harmonize

Harmonize SV VCFs across samples and SV calling methods

Last updated: 2025-03-19

Input requirmenets

  • VCF:
    • Format: bi-allelic VCF/BCF files following the VCF specification
    • INFO: SVTYPE, if other name is used, specify by --svtype

Output

  • VCF: harmonized VCF/BCF files.
    • ID: {PREFIX}.{SVTYPE}.{NUMBER} if --rename-id --id-prefix PREFIX is specified, otherwise the same as input.

Usage

harmonisv harmonize [options] -i <input_vcf> -o <output_vcf> 

Input filtering

By default, the following records will be removed:

  • Records with missing ID or duplicated ID if --rename-id is not specified
  • Records without INFO/SVTYPE
  • Records' INFO/SVTYPE not in "INS, DEL, DUP, INV, CNV, BND"

If the types of SVs are stored in INFO field with different names, use --svtype to specify the tag name.

If the SV types are not in the above list, use --INS, --DEL, --DUP, --INV, --CNV, --BND to normalize the SV types. For example, --info SVTYPE_raw=SVTYPE --DUP DUP,DUP:TANDEM will normalize SVTYPE=DUP:TANDEM to SVTYPE=DUP;SVTYPE_raw=DUP:TANDEM.

To disable the check, use --no-check.

VCF headers

The VCF headers produced by different SV calling methods are usually not compatable. It is necessary to have a unified VCF header for downstream analysis. If only INFO tags from the original VCFs will be used, one simple method is to use harmonisv harmonize-header to combine the headers from all VCFs (see harmonize-header). However, if any new INFO tags will be added / renamed, please use --header-str to specify them.

For example, the following command will combine the headers of method A, B, and C, and add INFO/STRANDS header.

harmonisv harmonize-header \
-i A.vcf,B.vcf,C.vcf \
-o harmonized_header.txt \

for vcf in A.vcf B.vcf C.vcf; do
    harmonisv harmonize \
    -i $vcf \
    -o harmonized.$vcf \
    --header harmonized_header.txt \
    --header-str 'STRANDS,1,String,Strand orientation of supporting reads'
done

Examples

In this example, we will rename variant ID, normalize SVTYPE, and extract the basic SV calling information:

  • SVTYPE
  • SVLEN
  • END
  • STRANDS (if exists)
  • DP (sequencing depth)
  • RE (number of reads supporting the SV)

Please note that SV-calling methods may store this information under different names. For instance, INFO/SUPPORT might be used for RE, and FORMAT/DP for DP. The sequencing depth might also be provided separately for REF and ALT alleles (e.g., FORMAT/DR and FORMAT/DV). Additionally, the same type of SV might have different names in different methods (e.g., DUP:TANDEM vs DUP). All the harmonized information will be stored in the INFO field.

Sniffles2 (ver 2.0.6)

harmonisv harmonize \
-i HG002.minimap2.sniffles.vcf \             # input VCF
-o HG002.minimap2.sniffles.harmonized.vcf \  # output VCF
--info SVTYPE,SVLEN,END,STRANDS=STRAND \     # INFO fields to be kept, rename STRAND to STRANDS
--format-to-info RE=DV \                     # Extract FORMAT/DV to INFO/RE
--format-to-info-sum DP=DR,DP=DV \           # Calculate INFO/DP = FORMAT/DR + FORMAT/DV
--header harmonized_header.txt \             # Replace VCF header
--id-prefix HG002.minimap2.sniffles \        # {SAMPLE}.{ALIGNER}.{CALLER}
--rename-id                                  # Rename all variant ID

SVIM (ver 2.0.0)

harmonisv harmonize \                         
-i HG002.minimap2.svim.vcf \                 # input VCF
-o HG002.minimap2.svim.harmonized.vcf \      # output VCF
--info SVTYPE,SVLEN,END,RE=SUPPORT \         # INFO fields to be kept, rename SUPPORT to RE
--format-to-info DP=DP \                     # Extract FORMAT/DP to INFO/DP
--DUP DUP,DUP:TANDEM,DUP:INT \               # Normalize the name of duplications
--header harmonized_header.txt \             # Harmonized VCF header
--id-prefix HG002.minimap2.svim \            # {SAMPLE}.{ALIGNER}.{CALLER}
--rename-id                                  # Rename all variant ID

cuteSV (ver 2.0.3)

harmonisv harmonize \
-i HG002.minimap2.cuteSV.vcf \               # input VCF
-o HG002.minimap2.cuteSV.harmonized.vcf \    # output VCF
--info SVTYPE,SVLEN,END,RE \                 # INFO fields to be kept
--format-to-info-sum DP=DR,DP=DV \           # Calculate INFO/DP = FORMAT/DR + FORMAT/DV
--header harmonized_header.txt \             # Harmonized VCF header
--id-prefix HG002.minimap2.cuteSV \          # {SAMPLE}.{ALIGNER}.{CALLER}
--rename-id                                  # Rename all variant ID

By default, all INFO tags not specified in --info will be removed. To keep all original INFO tags, use --keep-all. To keep the original INFO tags before renaming, use --keep-old. Meanwhile, allele count (AC) and allele number (AN) will also be automatically computed from genotypes. To disable this feature, use --no-AC.

Particularly, harmonisv harmonize -i input.vcf -o output.vcf --keep-all --no-AC --no-check should produce the same variant records as the input (some VCF headers are still added).

Arguments

Input/Output arguments:

-i, --invcf VCF
input VCF
-o, --outvcf VCF
output VCF

VCF INFO manipulation:

--info TAG
Comma separated INFO tags to extract or rename. INFO tags can be renamed by NEW=OLD, from high prioirty to low priority, e.g., 'NEW=OLD1,NEW=OLD2' means if OLD1 is present, use OLD1, otherwise use OLD2.
--info-sum TAG
Comma separated INFO tags to extract and sum. Old tags with the same new tag name will be summed up, e.g., 'NEW=OLD1,NEW=OLD2' means 'INFO/NEW = INFO/OLD1 + INFO/OLD2'. Please define the header of new tags in --header or --header-str
--format-to-info TAG
Comma separated FORMAT tags to sum across samples and add to INFO, from high prioirty to low priority, e.g., DP=DP means 'INFO/DP = sum(FORMAT/DP)'.
--format-to-info-sum TAG
Comma separated FORMAT tags to sum across samples and tags and add to INFO, e.g., 'DP=DR,DP=DV' means 'INFO/DP = sum(FORMAT/DR) + sum(FORMAT/DV)'. Please define the header of new tags in --header or --header-str
--info-to-alt TAG
Comma separated INFO tags to fill in ALT, from high prioirty to low priority. This is useful if insertion sequence stored in INFO.
--keep-old
Keep original INFO tags after renaming or sum (default: False)
--keep-all
Keep all original INFO tags (default: False)
--no-AC
Disable automatically add AC and AN to INFO (default: False)

VCF header manipulation:

--header FILE
New VCF header file to replace the header of input VCF.
--header-str string
Semicolon separated INFO header string added to new header (metadata separated by comma), e.g., 'DP,1,Integar,Sequencing depth;AF,1,Float,Allele frequency'

Structural variation format:

--id-prefix PREFIX
Rename SV ID to PREFIX.raw_ID. Final ID should be Sample.Aligner.Caller.unique_id for downstream analysis
--rename-id
Rename SV ID to PREFIX.SVTYPE.No., must use with --id-prefix
--svtype SVTYPE
INFO tag stores the structural variation type (default: SVTYPE), will rename it to SVTYPE if not.
--INS INS
Comma separated SVTYPE string for insertions, will be nomalized as INS (default: INS)
--DEL DEL
Comma separated SVTYPE string for deletions, will be nomalized as DEL (default: DEL)
--DUP DUP
Comma separated SVTYPE string for duplications, will be nomalized as DUP (default: DUP)
--INV INV
Comma separated SVTYPE string for inversions, will be nomalized as INV (default: INV)
--CNV CNV
Comma separated SVTYPE string for copy number variations, will be nomalized as CNV (default: CNV)
--BND BND
Comma separated SVTYPE string for breakends, will be nomalized as BND (default: BND)
--no-check
Disable check and filter on variant ID and SVTYPE (default: False)