As the second step, the pipeline assigns the reads to unique read groups, representing identifiers that group reads together. A read group (@RG
) captures relevant information about the sample and the sequencing process and technology, utilized by various downstream bioinformatics tools.
The relevant fields in defining a read group include:
To assign read groups, an in-house Python script is used. It can automatically generate read groups based on Illumina read names and handle multiple read groups in the same file (e.g., reads from multiple lanes are merged into a single file).
The read groups are assigned as follows:
<sample name>.<instrument>_<run>_<flow cell>.<lane>
<sample name>
<platform>
<instrument>_<run>_<flow cell>.<lane>
<sample name>.<library>
E.g., in BAM file:
@RG ID:SMAHT1.ST-E00127_336_HJ7YHCCXX.8 SM:SMAHT1 PL:ILLUMINA PU:ST-E00127_336_HJ7YHCCXX.8 LB:SMAHT1.HISEQ-LIB1
All the relevant code is accessible in the GitHub repository: