High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
RSEM on Biowulf & Helix

RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute’s Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.

NOTE: Rsem has uses quite a bit of temporary directory space. Please make sure to use "-temporary-folder" flag in "rsem-run-prsem-testing-procedure" and "rsem-calculate-expression" when submit batch job. See example below.


Based on the tutorial https://github.com/bli25ucb/RSEM_tutorial

The following steps were already finished on our systems. User can use the index files in the shared area. Some of the steps can not be run by users due to permission setting.

Biowulf > $ mkdir /fdb/rsem  ## users do now have permission to do the following steps in /fdb/rsem but can create directory under personal space such as /data/$USER and replace the following /fdb/rsem with the directory created.

Biowulf > $ sinteractive --mem=10g
cn1234 > $ module load rsem++++
cn1234 > $ cd /fdb/rsem
cn1234 > $ ln -s /fdb/ensembl/pub/release-82/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa  Mus_musculus.GRCm38.dna.toplevel.fa
cn1234 > $ ln -s /fdb/ensembl/pub/release-82/gtf/mus_musculus/Mus_musculus.GRCm38.82.chr.gtf  Mus_musculus.GRCm38.82.chr.gtf

Build reference from genome (Users can use the reference directly without building) :
cn1234 > $ rsem-prepare-reference --bowtie2 --gtf Mus_musculus.GRCm38.82.chr.gtf Mus_musculus.GRCm38.dna.toplevel.fa ref_from_genome/mouse_ref

Build reference from ensemble transcripts (Users can use the reference directly without building) :
Downloade mouse_ref_building_from_transcripts.tar.gz from https://www.dropbox.com/s/ie67okalzaw8zzj/mouse_ref_building_from_transcripts.tar.gz?dl=0
cn1234 > $ tar xvfz mouse_ref_building_from_transcripts.tar.gz
This created two files: mouse_ref.fa and mouse_ref_mapping.txt

cn1234 > $ rsem-prepare-reference --transcript-to-gene-map mouse_ref_mapping.txt --bowtie2 mouse_ref.fa ref_from_transcripts/mouse_ref


Running on Helix

$ module load rsem bowtie/2-2.2.6  STAR
$ rsem-calculate-expression -p 4 --paired-end --bowtie2 --estimate-rspd --append-names --output-genome-bam SRR937564_1.fastq SRR937564_2.fastq /fdb/rsem/ref_from_genome/mouse_ref LPS_6h

Running a single batch job on Biowulf

1. Create a script file similar to the lines below.


module load rsem  bowtie/2-2.2.6  STAR
cd /data/$USER/dir
rsem-calculate-expression -p $SLURM_CPUS_PER_TASK --temporary-folder /lscratch/$SLURM_JOBID/tempdir --paired-end --bowtie2 --estimate-rspd --append-names --output-genome-bam SRR937564_1.fastq SRR937564_2.fastq /fdb/rsem/ref_from_genome/mouse_ref LPS_6h

2. Submit the script on biowulf.

$ sbatch --cpus-per-task=4 --gres=lscratch:100 jobscript

For multiple threaded job, such as -p used in rsem-calculate-expression, use '--cpus-per-task' and the $SLURM_CPUS_PER_TASK will be assigned automatically to the number user assigned (4 in this case).

rsem-calculate-expression can cause severe file system performance degradation if the temp dir is not located on a local lscratch disk. To use temporary local space, use "--gres=lscratch:XXX" which will allocate XXX gb of space for the job. The space can be accessed by using "/lscratch/$SLURM_JOBID/directoryNameYouChoose".

For more memory requirement (default 2xcpus=8gb in this case), use --mem flag:
$ sbatch --cpus-per-task=4 --mem=10g jobscript

Running a swarm of jobs on Biowulf

Setup a swarm command file:

  cd /data/$USER/dir1; rsem command1; rsem command2
  cd /data/$USER/dir2; rsem command1; rsem command2
  cd /data/$USER/dir3; rsem command1; rsem command2

Submit the swarm file:

  $ swarm -f swarmfile -t 4 --module rsem,bowtie/2-2.2.6,STAR

-t: specify the thread number
-f: specify the swarmfile name
--module: set environmental variables for each command line in the file

To allocate more memory, use -g flag:

  $ swarm -f swarmfile -g 10 --module rsem

-g: allocate more memory

For more information regarding running swarm, see swarm.html

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive 
salloc.exe: Granted job allocation 16535

cn999$ module load rsem
cn999$ cd /data/$USER/dir
cn999$ rsem commands

cn999$ exit


Make sure to exit the job once finished.

If more memory is needed, use --mem flag. For example

biowulf$ sinteractive --mem=10g


RSEM user guide