High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
STAR on Biowulf and Helix

Description

STAR aligns RNA-Seq data to reference genomes. It is designed to be fast and accurate for known and novel splice junctions. In addition, it has no limit on the read size and can align reads with multiple splice junctions. This is becoming more important as read lengths increase. As a result of the alignment algorithm, poor quality tails and poly-A tails are clipped in the alignment.

Starting with version 2.4.1a, annotations can be included at the mapping stage instead/in addition to the indexing stage.

There are multiple versions of STAR available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail STAR 

To select a module use

module load STAR/[version]

where [version] is the version of choice.

STAR is a multithreaded application. Make sure to match the number of cpus requested with the number of threads ( --runThreadN).

Index files

STAR made a backwards incompatible change to it's index structure with version 2.4.2.

STAR indices can be found in

/fdb/STAR_indices/[STAR VERSION]

For historical reasons there are also three symlinks pointing to the versioned subdirectories in /fdb/STAR_indices:

The structure within each of the versioned directories is similar but not identical. STAR indices occupy a lot of storage space so at some point in the future older indices will have to be removed.

Environment variables set

References

Documentation

On Helix

STAR is not an appropriate application for helix. Please use an sinteractive session for interactive work.

Interactive job on Biowulf

STAR is used to create genome indices as well as to align and map short reads to the indexed genome. A GTF format annotation of transcripts can be provided during indexing or, since version 2.4.1a, on the fly during mapping. As described above, indices with different annotations and overhang lengths indexed with the newest version of STAR are available in /fbd/STAR_current.

A simple example of indexing a small genome with annotation. If annotation is provided, a overhang depending on the readlength to be used has to be provided as well. For example, for 100nt reads:

Allocate an interactive session with sinteractive and use as described below

biowulf$ sinteractive --cpus-per-task=12 --mem=30g
node$ module load STAR
node$ cd /data/$USER/test_data
node$ mkdir -p indices/star100-EF4
node$ GENOME=/fdb/igenomes/Saccharomyces_cerevisiae/Ensembl/EF4
node$ STAR \
    --runThreadN 2 \
    --runMode genomeGenerate \
    --genomeDir indices/star100-EF4 \
    --genomeFastaFiles $GENOME/Sequence/WholeGenomeFasta/genome.fa \
    --sjdbGTFfile $GENOME/Annotation/Genes/genes.gtf \
    --sjdbOverhang 99 \
    --genomeSAindexNbases 11

However, a default value of 100 will work well in many cases.

Align single end 50nt RNA-Seq data to the mouse genome:

node$ cd /data/$USER/test_data
node$ mkdir -p bam/rnaseq_STAR
node$ GENOME=/fdb/STAR_current/UCSC/mm10/genes-50
node$ STAR \
    --runThreadN 12 \
    --genomeDir $GENOME \
    --sjdbOverhang 50 \
    --readFilesIn fastq/rnaseq_500k.fastq.gz \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix bam/rnaseq_STAR/test
Batch job on Biowulf

Batch scripts should use the $SLURM_CPUS_PER_TASK environment variable to determine the number of threads to use. This variable is set by slurm according to the number of requested cores. An example script for aligning single end RNA-Seq data with STAR might look like the following:

#! /bin/bash
# this file is STAR.batch
set -o pipefail
set -e

function fail {
    echo "$@" >&2
    exit 1
}

module load samtools/1.2 || fail "could not load samtools module"
module load STAR         || fail "could not load STAR module"
cd /data/$USER/test_data || fail "no such directory"
mkdir -p bam/rnaseq_STAR
GENOME=/fdb/STAR_current/UCSC/mm10/genes-50
STAR \
    --runThreadN $SLURM_CPUS_PER_TASK \
    --genomeDir $GENOME \
    --sjdbOverhang 50 \
    --readFilesIn fastq/rnaseq_500k.fastq.gz \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix bam/rnaseq_STAR/test

STAR requires ~30-35GB of memory for a mammalian genome. Submit the job to run on 12 cores with a bit memory to space:

biowulf$ sbatch --cpus-per-task 12 --mem=36g STAR.batch
Swarm of jobs on Biowulf

Set up a command file for two jobs, each aligning single ended reads. Note that swarm allows line continuations.

cd /data/$USER/test_data \
  && mkdir -p bam/rnaseq_STAR \
  && STAR \
    --runThreadN $SLURM_CPUS_PER_TASK \
    --genomeDir /fdb/STAR_current/UCSC/mm10/genes-50 \
    --sjdbOverhang 50 \
    --readFilesIn fastq/rnaseq_500k.fastq.gz \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix bam/rnaseq_STAR/test
cd /data/$USER/test_data \
  && mkdir -p bam/rnaseq_STAR2 \
  && STAR \
    --runThreadN $SLURM_CPUS_PER_TASK \
    --genomeDir /fdb/STAR_current/UCSC/mm10/genes-50 \
    --sjdbOverhang 50 \
    --readFilesIn fastq/rnaseq_250k.fastq.gz \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix bam/rnaseq_STAR2/test

The command file is submitted with swarm, again making sure to request sufficient memory for STAR.

biowulf$ swarm -f swarm_commands -t 12 -g 36 --module STAR/2.5.2b