represent

Select the representative SV from merged SVs

Last updated: 2025-03-14

Input requirmenets

VCF:
- Format: bi-allelic VCF/BCF files following the VCF specification
- INFO: SVTYPE, SVLEN, AC
- ID: unique ID across all input VCFs
VCF file list:
- Format: one VCF/BCF per line
SV merging results:
- Format: comma-separated list of merged SV IDs, unique SVs that are not merged with others should also be included without comma
Region:
- Format: region string accepted by bcftools (e.g., chr1[:100-200])

Output

VCF: representative SV VCF/BCF files.
- INFO:
  - REPRESENT_SV: original ID of the representative SV
  - ID_LIST: list of SVs merged to the representative SV
- ID: {SVTYPE}.{NUMBER} by default. --id-prefix PREFIX will add {PREFIX}. to the beginning, --region STR will add {STR}_ before {NUMBER}.

Usage

harmonisv represent [options] -i <input_vcf> -o <output_vcf> --merge <merge_file>
harmonisv represent [options] -f <file_list> -o <output_vcf> --merge <merge_file>

Representative SV

SV calling results can vary significantly across different methods or samples. Therefore, merging SVs across samples and methods is a common practice to identify non-redundant SVs. Once the SVs are merged, the position, length, and alleles of the merged SV can be determined based on the representative SV. The represent command is used to select the representative SV from the merged SVs using one of the following criteria:

--by-max TAG: This option selects the SV with the maximum value of INFO/TAG.
--by-freq: This option selects the SV with the maximum frequency of POS and SVLEN. If more than one SV has the same frequency, it selects the one closest to the average POS and SVLEN.

Examples

1. Select representative SVs from SVs merged across samples and methods

In this example, we will first use Jasmine to merge SVs across samples and methods, and then select the representative SV based on the prevalence of SVs' positions and lengths.


# prepare VCF file list
for vcf in SampleA.Method1.vcf SampleA.Method2.vcf SampleB.Method1.vcf SampleB.Method2.vcf; do
    echo ${vcf}
done > vcf_list.txt

# SV merging across samples and methods
jasmine \
file_list=vcf_list.txt \
out_file=merged.vcf \
--keep_var_ids \
--ignore_strand 

# extract SV merging results
bcftools query -f '%IDLIST\n' merged.vcf > merge.txt

# Select representative SV based on the frequency of POS and SVLEN
harmonisv represent \
-f vcf_list.txt \
-o representative.vcf \
--merge merge.txt \
--by-freq \
--id-prefix Merged \            # rename output SV ID as Merged.SVTYPE.NUMBER
--save-id  \                    # save merged SVs to INFO/ID_LIST
--min-len-input 35 \            # only SVs with SVLEN < 35 are considered when selecting representative SV
--min-len-output 50             # drop representative SVs with SVLEN < 50

2. Remove duplicated SVs within the same sample and method

SV calling methods may produce duplicated SVs within the same sequencing data. This example will first use Jasmine to merge SVs within the same sample and method, and then select the representative SV based on the maximum value of reads supporting the SV.

# SV merging within the same sample and method
jasmine \
file_list=input.vcf \
out_file=jasmine.vcf \
--comma_filelist \
max_dist=200 \
--allow_intrasample \
--nonlinear_dist \
--ignore_strand \
--keep_var_ids

# extract SV merging results
bcftools query -f '%INTRASAMPLE_IDLIST\n' jasmine.vcf > dup_sv.txt

# Select representative SV based on the maximum value of reads supporting the SV (INFO/RE)
harmonisv represent \
-i input.vcf \
-o dedup.vcf \
--merge dup_sv.txt \
--by-max RE \
--min-len-input 30 \
--min-len-output 30

Arguments

Input/Output arguments:

-i, --invcf VCF: input VCF
-f, --file-list TSV: list of VCF files, one VCF per line
-o, --outvcf VCF: output VCF
--merge FILE: SV merging results, each line is a comma-separated list of IDs of merged SVs, or the ID of unique SVs
-r, --region CHR: genomic region to work on (requires VCF with index)

Representative SV selection arguments:

--id-prefix PREFIX: Rename output SV ID as PREFIX.SVTYPE.NUMBER
--by-max TAG: Select representative SV by maximum INFO/TAG value
--by-freq: Select representative SV by frequency of POS, SVLEN, if >1 SVs have same frequency, select the one closest to average POS, SVLEN
--save-id: Save original IDs to INFO/ID_LIST
--keep-hom-ref: Keep SVs with AC=0 before selecting representative SV (Default: False)
--min-len-input N: Remove SVLEN < min_len_input before selecting representative SV (Default: 35)
--min-len-output N: Remove SVLEN < min_len_output from output (Default: 50)