genotype

Genotype SVs across SV genotyping methods

Last updated: 2025-03-14

Input requirmenets

  • Manifest:

    • Format: tab-separated file with header.
    • Columns:
      • file: path to SV discovery and force calling VCF/BCF file
      • sample: sample ID
      • aligner: aligner used to generate the SV call set
      • caller: SV caller used to generate the SV call set
      • if_force_call: whether the SV call set is generated by force-calling (1: True, 0: False)
      • info (optional): INFO tags in additional to DP, RE to be added to the output VCF
  • Representative VCF:

    • Format: bi-allelic VCF/BCF files following the VCF specification
    • INFO: SVTYPE, SVLEN, ID_LIST
    • ID: unique ID within the representative VCF
  • Discovery calling VCF:

    • Format: single-sample, bi-allelic VCF/BCF files following the VCF specification
    • ID: unique ID listed in the ID_LIST of the representative VCF
    • INFO: DP, RE
  • Force calling VCF:

    • Format: single-sample, bi-allelic VCF/BCF files following the VCF specification
    • ID: the same ID as the representative VCF
    • INFO: DP, RE
  • Region:

    • Format: region string accepted by bcftools (e.g., chr1[:100-200])

Output

  • VCF: single-sample genotyped VCF/BCF files.
  • ID: the same ID as the representative VCF

Usage

harmonisv represent [options] -i <input_vcf> -f <manifest> -o <output_vcf> --sample <id>

Genotyping methods

The genotype command integrates information from various SV discovery and force calling methods to genotype SVs with high sensitivity. The genotype of an SV is determined through the following steps:

  1. Extract DP, RE and additional INFO tags from SV calling VCFs and calculate VAF. If a method performs both discovery calling and force calling, the results of force calling will be used.

  2. If one method's VAF > heterozygous frequency threshold, it is considered to support the SV and counted in SUPP_METHOD. The maximum number of SV callers using the same aligner is recorded in SUPP_CALLER. For example, if an SV is supported by callers sniffles and cuteSV with aligner minimap2, as well as caller svim with aligner NGMLR, then SUPP_CALLER would be set to 2.

  3. Calculate the average VAF of methods that support the SV (MEAN_VAF_CALL) and determine the genotype as follows:

    • 0/1 if the heterozygous frequency threshold <= MEAN_VAF_CALL < homozygous frequency threshold
    • 1/1 if MEAN_VAF_CALL >= homozygous frequency threshold
  4. If no method supports the SV, determine the genotype as:

    • 0/0 if DP >= minimum depth threshold
    • ./. if DP < minimum depth threshold

The methods used to calculate MEAN_VAF_CALL can be specified by --include option. If --include all is used, all methods will be included. If --include force_call is used, only force calling methods will be included. Alternatively, a comma-separated list of methods in the format of ALIGNER_CALLER can be specified (e.g., --include MINIMAP2_SNIFFLES).

The genotyping results are highly sensitive as they retain SVs supported by any method. To obtain high-confidence SVs, consider further filtering the results:

  • filter: filter SVs by training a random forest model
  • consensus calling: filter SVs by the number of supporting methods, e.g., bcftools view -i 'SUPP_CALLER >= 2'

Examples

In this example, we genotype HG002 using the discovery calling and force calling results from the combinations of 2 aligners (minimap2 and NGMLR) and 3 SV callers (sniffles, cuteSV, and svim).

harmonisv genotype \
-i representative.vcf \
-f manifest.txt \
-o HG002.representative.genotyped.vcf \
--sample HG002

The manifest file manifest.txt is shown below:

file sample aligner caller is_force_call info
output/harmonize_force_call/HG002.NGMLR.cuteSV.harmonized.vcf HG002 NGMLR cuteSV 1 PRECISE,CIPOS,CILEN
output/harmonize_force_call/HG002.NGMLR.sniffles.harmonized.vcf HG002 NGMLR sniffles 1
output/harmonize_force_call/HG002.minimap2.cuteSV.harmonized.vcf HG002 minimap2 cuteSV 1 PRECISE,CIPOS,CILEN
output/harmonize_force_call/HG002.minimap2.sniffles.harmonized.vcf HG002 minimap2 sniffles 1
output/harmonize/HG002.NGMLR.cuteSV.harmonized.vcf HG002 NGMLR cuteSV 0
output/harmonize/HG002.NGMLR.sniffles.harmonized.vcf HG002 NGMLR sniffles 0
output/harmonize/HG002.NGMLR.svim.harmonized.vcf HG002 NGMLR svim 0
output/harmonize/HG002.minimap2.cuteSV.harmonized.vcf HG002 minimap2 cuteSV 0
output/harmonize/HG002.minimap2.sniffles.harmonized.vcf HG002 minimap2 sniffles 0
output/harmonize/HG002.minimap2.svim.harmonized.vcf HG002 minimap2 svim 0

An example output is shown below:

22  16147398    All_method.INS.1    G   GCCTCAGCCTCCCAAAAAGTGCTGGAATTATAAGCGTGAGCCACTGTGCCAACCGATTTTTTTGGTATTTTTAGTAAAGATGGGGTTTCATCATCTTGGAACTAGGCTGGTCTTGAACTCCTGATCTCGTGATCCACCCAC   .   .   SVTYPE=INS;END=16147398;SVLEN=140;AC=2;REPRESENT_SV=HG002.NGMLR.svim.INS.1;CIPOS_NGMLR_CUTESV=-14,14;CIPOS_MINIMAP2_CUTESV=-8,8;RE_NGMLR_CUTESV=6;RE_NGMLR_SNIFFLES=6;RE_MINIMAP2_CUTESV=35;RE_MINIMAP2_SNIFFLES=40;RE_MINIMAP2_SVIM=39;RE_NGMLR_SVIM=6;CILEN_NGMLR_CUTESV=-9,9;CILEN_MINIMAP2_CUTESV=-4,4;DP_NGMLR_CUTESV=6;DP_NGMLR_SNIFFLES=6;DP_MINIMAP2_CUTESV=39;DP_MINIMAP2_SNIFFLES=40;DP_MINIMAP2_SVIM=40;DP_NGMLR_SVIM=6;PRECISE_NGMLR_CUTESV;PRECISE_MINIMAP2_CUTESV;VAF_NGMLR_CUTESV=1;VAF_NGMLR_SNIFFLES=1;VAF_MINIMAP2_CUTESV=0.897436;VAF_MINIMAP2_SNIFFLES=1;VAF_MINIMAP2_SVIM=0.975;VAF_NGMLR_SVIM=1;SUPP_METHOD=6;SUPP_METHOD_FORCE=4;SUPP_CALLER=3;SUPP_CALLER_FORCE=2;MAX_RE=40;MEAN_VAF=0.978739;STD_VAF=0.0374884;MEAN_VAF_CALL=0.978739;STD_VAF_CALL=0.0374884   GT:DP:AD    1/1:.:.,.
22  16212542    All_method.DEL.1    CGCTCCAAATATCCACTCGCGGTTTCTGCAAAAAGAGTGTTTCAAAACTTCTCAATCAAAAGAAAGGTTCAACTCTGTGAGATGAATGCACACATCACAAAGAAGTTTCTCAGAATGCTTCTGTCTAGTTTTTATGTGAAGATATTCCCTTTTCCACCACAGGCC   C   .   .   SVTYPE=DEL;END=16212706;SVLEN=-164;AC=1;REPRESENT_SV=HG002.minimap2.svim.DEL.2;CIPOS_NGMLR_CUTESV=-4,4;CIPOS_MINIMAP2_CUTESV=-22,22;RE_NGMLR_CUTESV=2;RE_NGMLR_SNIFFLES=2;RE_MINIMAP2_CUTESV=46;RE_MINIMAP2_SNIFFLES=46;RE_MINIMAP2_SVIM=44;RE_NGMLR_SVIM=0;CILEN_NGMLR_CUTESV=-1,1;CILEN_MINIMAP2_CUTESV=-7,7;DP_NGMLR_CUTESV=3;DP_NGMLR_SNIFFLES=6;DP_MINIMAP2_CUTESV=66;DP_MINIMAP2_SNIFFLES=68;DP_MINIMAP2_SVIM=72;DP_NGMLR_SVIM=.;PRECISE_NGMLR_CUTESV;PRECISE_MINIMAP2_CUTESV;VAF_NGMLR_CUTESV=0.666667;VAF_NGMLR_SNIFFLES=0.333333;VAF_MINIMAP2_CUTESV=0.69697;VAF_MINIMAP2_SNIFFLES=0.676471;VAF_MINIMAP2_SVIM=0.611111;VAF_NGMLR_SVIM=0;SUPP_METHOD=5;SUPP_METHOD_FORCE=4;SUPP_CALLER=3;SUPP_CALLER_FORCE=2;MAX_RE=46;MEAN_VAF=0.497425;STD_VAF=0.254231;MEAN_VAF_CALL=0.59691;STD_VAF_CALL=0.13482    GT:DP:AD    0/1:.:.,.

Arguments

Input/Output arguments:

-i, --invcf VCF
input representative SV call set VCF/BCF
-f, --manifest TSV
tab separated manifest file of SV calling results. Column headers should be: file, sample, aligner, caller, is_force_call, info (optional)
--sample ID
sample to be genotyped, should be listed in the manifest file
-o, --outvcf VCF
output VCF
-r, --region CHR
region to genotype (requires indexed VCF)

Genotyping arguments:

--min-dp 10
minimum depth to to genotype SVs as 0/0 if no method has VAF > --hetero-freq, otherwise ./. (default: 10)
--homo-freq 0.8
minimum average variant allele fraction to genotype as homozygous (default: 0.8)
--hetero-freq 0.2
minimum average variant allele fraction to genotype as heterozygous (default: 0.2)
--include all
methods to be included to determine genotype. Options: 'all', 'force_call', or comma-separated list of methods in format of 'ALIGNER_CALLER' (default: 'all')