The STAR index is generated from the standard Genome Reference Consortium Human Build 38 (GRCh38) released by the Broad Institute, as described in GTEx analysis pipeline.
The STAR index uses GENCODE comprehensive gene annotations. For more detailed information please refer to the GENCODE documentation under “Genome Annotations” section.
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
with open('Homo_sapiens_assembly38.fasta', 'r') as fasta:
contigs = fasta.read()
contigs = contigs.split('>')
contig_ids = [i.split(' ', 1)[0] for i in contigs]
# exclude ALT, HLA and decoy contigs
filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])
with open('Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta', 'w') as fasta:
fasta.write(filtered_fasta)
samtools faidx Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta
java -jar picard.jar \
CreateSequenceDictionary \
R=Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
O=Homo_sapiens_assembly38_noALT_noHLA_noDecoy.dict
sentieon STAR \
--runMode genomeGenerate \
--genomeDir STARv2710b_assembly38_noALT_noHLA_noDecoy \
--genomeFastaFiles Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
--sjdbGTFfile gencode.annotation.gtf \
--sjdbOverhang read_length-1
Sentieon implementation replicates the original STAR code. The current reference index was generated using Sentieon version 202308.01, corresponding to STAR version 2.7.10b.