In this step, the pipeline marks duplicate reads. Duplicate reads are sequencing artifacts that originate during library preparation and sequencing runs. Duplicate reads are evaluated per-library using the LB
tag in the read groups.
The pipeline does not remove the duplicate reads that are tagged directly in the BAM file.
sentieon driver -i sorted.bam
--algo LocusCollector
--fun score_info
score.txt
sentieon driver -i sorted.bam
--algo Dedup
--optical_dup_pix_dist 2500
--score_info score.txt
deduped.bam
Arguments:
To confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.
The pipeline implementation uses Sentieon LocusCollector to calculate duplicate metrics per library and the Dedup algorithm to mark duplicate reads in the BAM file. The pipeline is using Sentieon version 202308.01, corresponding to Picard 2.9.0. Both algorithms combined are equivalent to the MarkDuplicates algorithm in Picard.
java -jar picard.jar MarkDuplicates
INPUT=sorted.bam
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500
OUTPUT=deduped.bam
All the relevant code can be accessed in the GitHub repository: