Latest Illumina technologies using one/two-channel sequencing systems, such as NovaSeq, may introduce homopolymer runs of G bases (polyG) as artifacts. polyG artifacts appear when the dark base G is called after the synthesis has terminated, resulting in the erroneous calling of high-confidence G bases at the ends of affected reads. Eventually, a large number of these reads may align to reference regions with high G content (e.g., chr2:32916230-32916625), creating problems for downstream processing.
As part of FASTQ files preprocessing, raw reads generated by Illumina sequencing systems are filtered using fastp to remove read pairs containing polyG artifacts.
fastp
--dont_eval_duplication
--disable_adapter_trimming
--disable_quality_filtering
--trim_poly_g
--length_required read_length
-i reads.fastq -I mates.fastq
-o reads.filtered.fastq -O mates.filtered.fastq
Arguments:
The pipeline is using fastp version 0.23.2.