genotype
Genotype SVs across SV genotyping methods
Last updated: 2025-03-14
Input requirmenets
-
Manifest:
- Format: tab-separated file with header.
- Columns:
file
: path to SV discovery and force calling VCF/BCF filesample
: sample IDaligner
: aligner used to generate the SV call setcaller
: SV caller used to generate the SV call setif_force_call
: whether the SV call set is generated by force-calling (1: True, 0: False)info
(optional): INFO tags in additional toDP
,RE
to be added to the output VCF
-
Representative VCF:
- Format: bi-allelic VCF/BCF files following the VCF specification
- INFO:
SVTYPE
,SVLEN
,ID_LIST
- ID: unique ID within the representative VCF
-
Discovery calling VCF:
- Format: single-sample, bi-allelic VCF/BCF files following the VCF specification
- ID: unique ID listed in the
ID_LIST
of the representative VCF - INFO:
DP
,RE
-
Force calling VCF:
- Format: single-sample, bi-allelic VCF/BCF files following the VCF specification
- ID: the same ID as the representative VCF
- INFO:
DP
,RE
-
Region:
- Format: region string accepted by bcftools (e.g.,
chr1[:100-200]
)
- Format: region string accepted by bcftools (e.g.,
Output
- VCF: single-sample genotyped VCF/BCF files.
- ID: the same ID as the representative VCF
Usage
harmonisv represent [options] -i <input_vcf> -f <manifest> -o <output_vcf> --sample <id>
Genotyping methods
The genotype
command integrates information from various SV discovery and force calling methods to genotype SVs with high sensitivity. The genotype of an SV is determined through the following steps:
-
Extract
DP
,RE
and additional INFO tags from SV calling VCFs and calculateVAF
. If a method performs both discovery calling and force calling, the results of force calling will be used. -
If one method's
VAF
> heterozygous frequency threshold, it is considered to support the SV and counted inSUPP_METHOD
. The maximum number of SV callers using the same aligner is recorded inSUPP_CALLER
. For example, if an SV is supported by callerssniffles
andcuteSV
with alignerminimap2
, as well as callersvim
with alignerNGMLR
, thenSUPP_CALLER
would be set to 2. -
Calculate the average
VAF
of methods that support the SV (MEAN_VAF_CALL
) and determine the genotype as follows:0/1
if the heterozygous frequency threshold <= MEAN_VAF_CALL < homozygous frequency threshold1/1
ifMEAN_VAF_CALL
>= homozygous frequency threshold
-
If no method supports the SV, determine the genotype as:
0/0
ifDP
>= minimum depth threshold./.
ifDP
< minimum depth threshold
The methods used to calculate MEAN_VAF_CALL
can be specified by --include
option. If --include all
is used, all methods will be included. If --include force_call
is used, only force calling methods will be included. Alternatively, a comma-separated list of methods in the format of ALIGNER_CALLER
can be specified (e.g., --include MINIMAP2_SNIFFLES
).
The genotyping results are highly sensitive as they retain SVs supported by any method. To obtain high-confidence SVs, consider further filtering the results:
- filter: filter SVs by training a random forest model
- consensus calling: filter SVs by the number of supporting methods, e.g.,
bcftools view -i 'SUPP_CALLER >= 2'
Examples
In this example, we genotype HG002 using the discovery calling and force calling results from the combinations of 2 aligners (minimap2 and NGMLR) and 3 SV callers (sniffles, cuteSV, and svim).
harmonisv genotype \
-i representative.vcf \
-f manifest.txt \
-o HG002.representative.genotyped.vcf \
--sample HG002
The manifest file manifest.txt
is shown below:
file | sample | aligner | caller | is_force_call | info |
---|---|---|---|---|---|
output/harmonize_force_call/HG002.NGMLR.cuteSV.harmonized.vcf | HG002 | NGMLR | cuteSV | 1 | PRECISE,CIPOS,CILEN |
output/harmonize_force_call/HG002.NGMLR.sniffles.harmonized.vcf | HG002 | NGMLR | sniffles | 1 | |
output/harmonize_force_call/HG002.minimap2.cuteSV.harmonized.vcf | HG002 | minimap2 | cuteSV | 1 | PRECISE,CIPOS,CILEN |
output/harmonize_force_call/HG002.minimap2.sniffles.harmonized.vcf | HG002 | minimap2 | sniffles | 1 | |
output/harmonize/HG002.NGMLR.cuteSV.harmonized.vcf | HG002 | NGMLR | cuteSV | 0 | |
output/harmonize/HG002.NGMLR.sniffles.harmonized.vcf | HG002 | NGMLR | sniffles | 0 | |
output/harmonize/HG002.NGMLR.svim.harmonized.vcf | HG002 | NGMLR | svim | 0 | |
output/harmonize/HG002.minimap2.cuteSV.harmonized.vcf | HG002 | minimap2 | cuteSV | 0 | |
output/harmonize/HG002.minimap2.sniffles.harmonized.vcf | HG002 | minimap2 | sniffles | 0 | |
output/harmonize/HG002.minimap2.svim.harmonized.vcf | HG002 | minimap2 | svim | 0 |
An example output is shown below:
22 16147398 All_method.INS.1 G GCCTCAGCCTCCCAAAAAGTGCTGGAATTATAAGCGTGAGCCACTGTGCCAACCGATTTTTTTGGTATTTTTAGTAAAGATGGGGTTTCATCATCTTGGAACTAGGCTGGTCTTGAACTCCTGATCTCGTGATCCACCCAC . . SVTYPE=INS;END=16147398;SVLEN=140;AC=2;REPRESENT_SV=HG002.NGMLR.svim.INS.1;CIPOS_NGMLR_CUTESV=-14,14;CIPOS_MINIMAP2_CUTESV=-8,8;RE_NGMLR_CUTESV=6;RE_NGMLR_SNIFFLES=6;RE_MINIMAP2_CUTESV=35;RE_MINIMAP2_SNIFFLES=40;RE_MINIMAP2_SVIM=39;RE_NGMLR_SVIM=6;CILEN_NGMLR_CUTESV=-9,9;CILEN_MINIMAP2_CUTESV=-4,4;DP_NGMLR_CUTESV=6;DP_NGMLR_SNIFFLES=6;DP_MINIMAP2_CUTESV=39;DP_MINIMAP2_SNIFFLES=40;DP_MINIMAP2_SVIM=40;DP_NGMLR_SVIM=6;PRECISE_NGMLR_CUTESV;PRECISE_MINIMAP2_CUTESV;VAF_NGMLR_CUTESV=1;VAF_NGMLR_SNIFFLES=1;VAF_MINIMAP2_CUTESV=0.897436;VAF_MINIMAP2_SNIFFLES=1;VAF_MINIMAP2_SVIM=0.975;VAF_NGMLR_SVIM=1;SUPP_METHOD=6;SUPP_METHOD_FORCE=4;SUPP_CALLER=3;SUPP_CALLER_FORCE=2;MAX_RE=40;MEAN_VAF=0.978739;STD_VAF=0.0374884;MEAN_VAF_CALL=0.978739;STD_VAF_CALL=0.0374884 GT:DP:AD 1/1:.:.,.
22 16212542 All_method.DEL.1 CGCTCCAAATATCCACTCGCGGTTTCTGCAAAAAGAGTGTTTCAAAACTTCTCAATCAAAAGAAAGGTTCAACTCTGTGAGATGAATGCACACATCACAAAGAAGTTTCTCAGAATGCTTCTGTCTAGTTTTTATGTGAAGATATTCCCTTTTCCACCACAGGCC C . . SVTYPE=DEL;END=16212706;SVLEN=-164;AC=1;REPRESENT_SV=HG002.minimap2.svim.DEL.2;CIPOS_NGMLR_CUTESV=-4,4;CIPOS_MINIMAP2_CUTESV=-22,22;RE_NGMLR_CUTESV=2;RE_NGMLR_SNIFFLES=2;RE_MINIMAP2_CUTESV=46;RE_MINIMAP2_SNIFFLES=46;RE_MINIMAP2_SVIM=44;RE_NGMLR_SVIM=0;CILEN_NGMLR_CUTESV=-1,1;CILEN_MINIMAP2_CUTESV=-7,7;DP_NGMLR_CUTESV=3;DP_NGMLR_SNIFFLES=6;DP_MINIMAP2_CUTESV=66;DP_MINIMAP2_SNIFFLES=68;DP_MINIMAP2_SVIM=72;DP_NGMLR_SVIM=.;PRECISE_NGMLR_CUTESV;PRECISE_MINIMAP2_CUTESV;VAF_NGMLR_CUTESV=0.666667;VAF_NGMLR_SNIFFLES=0.333333;VAF_MINIMAP2_CUTESV=0.69697;VAF_MINIMAP2_SNIFFLES=0.676471;VAF_MINIMAP2_SVIM=0.611111;VAF_NGMLR_SVIM=0;SUPP_METHOD=5;SUPP_METHOD_FORCE=4;SUPP_CALLER=3;SUPP_CALLER_FORCE=2;MAX_RE=46;MEAN_VAF=0.497425;STD_VAF=0.254231;MEAN_VAF_CALL=0.59691;STD_VAF_CALL=0.13482 GT:DP:AD 0/1:.:.,.
Arguments
Input/Output arguments:
- -i, --invcf VCF
- input representative SV call set VCF/BCF
- -f, --manifest TSV
- tab separated manifest file of SV calling results. Column headers should be: file, sample, aligner, caller, is_force_call, info (optional)
- --sample ID
- sample to be genotyped, should be listed in the manifest file
- -o, --outvcf VCF
- output VCF
- -r, --region CHR
- region to genotype (requires indexed VCF)
Genotyping arguments:
- --min-dp 10
- minimum depth to to genotype SVs as 0/0 if no method has VAF >
--hetero-freq
, otherwise ./. (default: 10) - --homo-freq 0.8
- minimum average variant allele fraction to genotype as homozygous (default: 0.8)
- --hetero-freq 0.2
- minimum average variant allele fraction to genotype as heterozygous (default: 0.2)
- --include all
- methods to be included to determine genotype. Options: 'all', 'force_call', or comma-separated list of methods in format of 'ALIGNER_CALLER' (default: 'all')