pipelines-docs

STAR Index

The STAR index is generated from the standard Genome Reference Consortium Human Build 38 (GRCh38) released by the Broad Institute, as described in GTEx analysis pipeline.

The STAR index uses GENCODE comprehensive gene annotations. For more detailed information please refer to the GENCODE documentation under “Genome Annotations” section.

Downloading and Preparing the Genome Reference

Download the reference genome
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
ALT, HLA, and decoy contigs are excluded from the reference genome FASTA using the following Python code
with open('Homo_sapiens_assembly38.fasta', 'r') as fasta:
    contigs = fasta.read()
contigs = contigs.split('>')
contig_ids = [i.split(' ', 1)[0] for i in contigs]

# exclude ALT, HLA and decoy contigs
filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
    if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])

with open('Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta', 'w') as fasta:
    fasta.write(filtered_fasta)
Generate FASTA indexes
samtools faidx Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta

java -jar picard.jar \
    CreateSequenceDictionary \
    R=Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
    O=Homo_sapiens_assembly38_noALT_noHLA_noDecoy.dict

Generating STAR Index

Generate STAR index for read length
sentieon STAR \
  --runMode genomeGenerate \
  --genomeDir STARv2710b_assembly38_noALT_noHLA_noDecoy \
  --genomeFastaFiles Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
  --sjdbGTFfile gencode.annotation.gtf \
  --sjdbOverhang read_length-1

Implementation with Sentieon

Sentieon implementation replicates the original STAR code. The current reference index was generated using Sentieon version 202308.01, corresponding to STAR version 2.7.10b.


Home - BWT Index - STAR Index - RSEM Reference