The pipeline consolidates variant calls produced by multiple variant callers into a unified representation. Raw variant calls from each algorithm are first filtered and normalized independently, and then merged into a single candidate variant set.
This step ensures that variants detected by different algorithms are represented consistently and that equivalent variants reported by multiple callers are unified into a single record.
Each VCF file produced by the variant callers undergoes preprocessing before merging. This step standardizes variant representation and removes redundant records.
bcftools_PASS_norm_dedup.sh \
-i input.vcf.gz \
-f additional.vcf.gz \
-r reference.fasta
Arguments:
The preprocessing stage performs the following operations:
These operations are implemented using Bcftools.
After preprocessing, the normalized VCF files generated by each variant caller are merged into a unified variant representation.
merge_callers.py \
-i TNhaplotyper2:tnhaplotyper2.vcf.gz \
-i Strelka2:strelka2.vcf.gz \
-i RUFUS:rufus.vcf.gz \
-i longcallD:longcalld.vcf.gz \
-s sample \
-o merged.vcf.gz
Arguments:
CALLER:VCF (can be provided multiple times). Supported callers are TNhaplotyper2, Strelka2, RUFUS, longcallD.The merging step consolidates variants reported by multiple callers into a single record when they share the same genomic position and allele representation (CHROM, POS, REF, ALT).
For each merged variant, the pipeline records which algorithms detected the variant.
This information is stored in the INFO field:
CALLERS=Strelka2,TNhaplotyper2
This annotation allows downstream filtering steps to evaluate support for each variant across independent algorithms.
During merging, the pipeline reconstructs the VCF header to ensure consistency across callers. The process includes:
##contig definitions##SAMPLE=<ID=...>)After validation, variants are annotated based on the level of cross-technology support observed across sequencing platforms.
Variants are labeled CrossCaller when they are independently detected by two or more somatic callers (listed in the CALLERS INFO field).
All the relevant code can be accessed in the GitHub repository: