pipelines-docs

Cross-Technology Validation

After hierarchical filtering, candidate variants undergo cross-technology validation to assess support across multiple sequencing platforms and evaluate consistency with donor-level genomic context.

This step integrates evidence from short-read and long-read sequencing data to distinguish true mosaic variants from sequencing artifacts and germline variation.

Validation Workflow

Candidate variants are evaluated using long-read evidence, statistical models, and haplotype information. The validation process applies a series of filters to ensure that retained variants are supported by sequencing data and consistent with expected mosaic variant properties.

Coverage and Error Model Filtering

Variants are retained only if sufficient read coverage is present at the variant locus in either short-read data or pooled donor-level long-read data.

For each candidate variant, a binomial test (Poisson approximation) is applied to evaluate whether the observed alternate read count exceeds expectations under a 0.1% sequencing error model (p < 0.01)^₁.

At typical short-read coverage levels (~300X), this threshold generally corresponds to a minimum of approximately 2–3 alternate reads supporting the variant.

Germline Variant Exclusion

Variants supported by long-read data are evaluated using a germline binomial test to ensure the observed variant allele fraction (VAF) is inconsistent with a germline heterozygous genotype.

Variant allele fractions derived from pooled donor-level long-read data (PacBio and ONT) are evaluated jointly across both technologies.

If a variant lacks long-read support based on pileup evidence, the germline binomial test is applied to the tissue-specific short-read data.

Variants with allele fractions approaching the expected germline heterozygous level (~0.5) are excluded.

Long-Read Phasing

PacBio reads are used to phase candidate variants relative to nearby heterozygous germline SNVs identified from DNAscope Hybrid germline calls.

For each variant:

The nearest heterozygous germline SNV within ±5 kb is identified.
PacBio reads spanning both loci are used to determine haplotypic consistency.

If no heterozygous germline SNV is present within ±5 kb, the variant is retained and labeled:

UNABLE_TO_PHASE

Cross-Evidence Classification

After validation, variants are annotated based on the level of cross-technology support observed across sequencing platforms.

CrossTech

Variants are labeled CrossTech when they are detected in two or more sequencing technologies:

Illumina
PacBio
ONT

Detection requires support exceeding the sequencing error threshold defined in the coverage filtering step.

This annotation indicates independent evidence for the variant across multiple sequencing platforms, increasing confidence in the call.

_{1: The binomial test evaluates whether the observed number of alternate reads exceeds the expectation under a sequencing error model (0.1%). A Poisson approximation is used to estimate the probability of observing the alternate read count under the null hypothesis.}

Source Code

All the relevant code can be accessed in the GitHub repository:

minipileup-parallel.sh [minipileup]
minipileup-parallel_sr_only.sh [minipileup]
tier_filter_variants_SR_PB_ONT.py [tier_filter_variants_SR_PB_ONT]
phase_mosaic_vars.sh [phase_mosaic_vars]
phasing_step1_get_closest_germline.sh [phasing_step1_get_closest_germline]
phasing_step2_phase_mosaic.py [phasing_step2_phase_mosaic]
bcftools_regions.sh [Bcftools]

Home - Overview - Short-Read Calling - Long-Read Calling - Calls Merging - Filtering - Cross-Technology Validation - Donor-Level Refinement - Confidence Designation