The RSEM Reference is generated from the standard Genome Reference Consortium Human Build 38 (GRCh38) released by the Broad Institute, as described in GTEx analysis pipeline.
The RSEM Reference uses GENCODE comprehensive gene annotations. For more detailed information please refer to the GENCODE documentation under “Genome Annotations” section.
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
with open('Homo_sapiens_assembly38.fasta', 'r') as fasta:
contigs = fasta.read()
contigs = contigs.split('>')
contig_ids = [i.split(' ', 1)[0] for i in contigs]
# exclude ALT, HLA and decoy contigs
filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])
with open('Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta', 'w') as fasta:
fasta.write(filtered_fasta)
samtools faidx Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta
java -jar picard.jar \
CreateSequenceDictionary \
R=Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
O=Homo_sapiens_assembly38_noALT_noHLA_noDecoy.dict
rsem-prepare-reference \
--gtf gencode.annotation.gtf \
Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
rsem_reference
The current reference was generated using RSEM version v1.3.3.