pipelines-docs

Duplicate Reads

In this step, the pipeline marks duplicate reads. Duplicate reads are sequencing artifacts that originate during library preparation and sequencing runs. Duplicate reads are evaluated per-library using the LB tag in the read groups.

The pipeline does not remove the duplicate reads that are tagged directly in the BAM file.

Detecting and Marking Duplicates

Detect duplicate reads

sentieon driver -i sorted.bam
                --algo LocusCollector
                --fun score_info
                score.txt

Mark duplicate reads

sentieon driver -i sorted.bam
                --algo Dedup
                --optical_dup_pix_dist 2500
                --score_info score.txt
                deduped.bam

Arguments:

--optical_dup_pix_dist: maximum offset between two duplicate clusters to consider them optical duplicates. For structured flow cells (NovaSeq, HiSeq 4000, X), the pipeline uses 2500.

Integrity Check

To confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.

Implementation with Sentieon

The pipeline implementation uses Sentieon LocusCollector to calculate duplicate metrics per library and the Dedup algorithm to mark duplicate reads in the BAM file. The pipeline is using Sentieon version 202308.01, corresponding to Picard 2.9.0. Both algorithms combined are equivalent to the MarkDuplicates algorithm in Picard.

Detect and mark duplicate reads (Picard equivalent)

java -jar picard.jar MarkDuplicates
      INPUT=sorted.bam
      OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500
      OUTPUT=deduped.bam

Source Code

All the relevant code can be accessed in the GitHub repository:

sentieon_Dedup.sh [LocusCollector + Dedup]

Home - Overview - Alignment - Duplicate Reads - Transcripts - Genes