pipelines-docs

Read Annotation

In this step, the pipeline annotates individual Full Length Non Chimeric (FLNC) reads with the isoform-level classification generated in the previous step. This allows downstream analyses to trace high-confidence isoforms back to the specific supporting reads.

The annotation is performed using a custom in-house script that lifts isoform classification to the read level.

Annotation Tags

Reads are annotated in the BAM format using the following custom tags:

Tag	Format	Description
`in:Z:`	string	Isoform ID.
`sc:Z:`	string	Structural category. One of: `full-splice_match`, `incomplete-splice_match`, `novel_in_catalog`, `novel_not_in_catalog`, `genic`, `antisense`, `fusion`, `intergenic`, `genic_intron`.
`gn:Z:`	string	Associated reference gene name.
`tn:Z:`	string	Associated reference transcript name.
`sb:Z:`	string	Subcategory for additional splicing information. Values may include `mono-exon`, `multi-exon`, and `intron_retention` (separated by semicolons).
`ct:i:`	int	Total number of reads supporting the isoform.

Annotating FLNC Reads by Isoform Class

Annotate FLNC reads

FLNC_ImportTags.py \
  --input_flnc aligned_flnc.bam \
  --output_flnc annotated_flnc.bam \
  --read_stat read_stat.txt \
  --classification filtered_classification.txt \
  --index

Arguments:

--input_flnc: input BAM file containing aligned FLNC reads to annotate.
--read_stat: file from the collapsing step with read-to-isoform mappings (read_stat).
--classification: classification file from the filtering step.
--index: flag to index the output BAM file. Requires the reads to be sorted.

Implementation

The annotation step is implemented using a custom Python script maintained in-house.

Source Code

All the relevant code can be accessed in the GitHub repository:

FLNC_ImportTags.py [FLNC_ImportTags.py]

Home - Overview - Clustering - Alignment - Collapsing - Classification and Filtering - Annotation