High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
TopHat on Biowulf & Helix

Description

TopHat is a splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

References

Web sites

TopHat makes use of either bowtie2 (the default) or bowtie (optional; necessary for colorspace reads as bowtie2 does not support colorspace). TopHat is limited to a maximal read length of 1024nts.

TopHat2 now includes TopHat-Fusion's ability to look for fusions between different transcripts (--fusion-search).

Index files

TopHat will need either bowtie2 or bowtie indices which are available as part of the igenomes package at

/fdb/igenomes/[organism]/[source]/[build]/Sequence/Bowtie[2]Index/*

More information on the locally available igenomes builds/organisms is available from our scientific database index. For more information about igenomes in general, iGenomes readme.

Running tophat on Helix

The module that sets up the environment for running TopHat also includes bowtie2 and samtools executables on the path:

helix$ module load tophat
helix$ module list
 Currently Loaded Modules:
   1) bowtie/2-2.2.3   2) boost/1.56   3) samtools/1.2   4) tophat/2.0.13

Note that TopHat is a multithreaded application. The number of threads is set with -p/--num-threads and should not exceed 2 for interactive runs on helix. For larger numbers of threads use compute nodes (see below).

Example session of a simple single end alignment without looking for novel splice junctions:

helix$ cd /data/$USER/test_data
helix$ tophat \
    --output-dir=./tophat_test \
    --min-anchor-length=10 \
    --num-threads=2 \
    --b2-sensitive \
    --no-novel-juncs \
    --GTF=annot/140609_refseq_nomir_nosnor.gtf \
    /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \
    fastq/rnaseq_500k.fastq.gz
[2015-04-30 14:18:33] Beginning TopHat run (v2.0.13)
-----------------------------------------------
[2015-04-30 14:18:33] Checking for Bowtie
Bowtie version:        2.2.3.0
[2015-04-30 14:18:33] Checking for Bowtie index files (genome)..
[2015-04-30 14:18:33] Checking for reference FASTA file
[2015-04-30 14:18:33] Generating SAM header for /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome
[2015-04-30 14:18:59] Reading known junctions from GTF file
[2015-04-30 14:19:01] Preparing reads
left reads: min. length=50, max. length=50, 499885 kept reads (115 discarded)
[2015-04-30 14:19:08] Building transcriptome data files ./tophat_test/tmp/140609_refseq_nomir_nosnor
[2015-04-30 14:19:36] Building Bowtie index from 140609_refseq_nomir_nosnor.fa
.....
-----------------------------------------------
[2015-04-30 14:31:12] A summary of the alignment counts can be found in ./tophat_test/align_summary.txt
[2015-04-30 14:31:12] Run complete: 00:12:38 elapsed
helix$ ll tophat_test
total 45M
-rw-rw-r-- 1 user user  41M Apr 30 14:31 accepted_hits.bam
-rw-rw-r-- 1 user user  200 Apr 30 14:31 align_summary.txt
-rw-rw-r-- 1 user user  37K Apr 30 14:31 deletions.bed
-rw-rw-r-- 1 user user  27K Apr 30 14:31 insertions.bed
-rw-rw-r-- 1 user user 2.3M Apr 30 14:31 junctions.bed
drwxrwxr-x 2 user user 4.0K Apr 30 14:31 logs
-rw-rw-r-- 1 user user   66 Apr 30 14:19 prep_reads.info
-rw-rw-r-- 1 user user 835K Apr 30 14:31 unmapped.bam

Running a single tophat batch job on Biowulf

Set up a script file file to submit to the batch queue. For example the following script is similar to the alignment done above except that it (a) looks for novel splice junctions and (b) makes sure to use all the cores requested when submitting to the batch queue

#! /bin/bash
# filename: tophat_batch.sh
set -e

module load tophat
cd /data/$USER/test_data
tophat \
    --output-dir=./tophat_test \
    --min-anchor-length=10 \
    --num-threads=$SLURM_CPUS_PER_TASK \
    --b2-sensitive \
    /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \
    fastq/rnaseq_500k.fastq.gz

This script can then be submitted to the queue with

biowulf$ sbatch --mem=20g --cpus-per-task=12 tophat_batch.sh

Running a swarm of tophat batch jobs on Biowulf

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file containing one command to be run per line (line continuations are allowed) and swarm will submit one job for each command to be run as a job array. Here is a sample. Note that tophat by default creates a ouput directory called tophat_out. Therefore, to run multiple jobs in parallel, the jobs either have to be run in separate directories or the command line has to specify the -o / --output_dir for each job.

tophat -o job1 -p $SLURM_CPUS_PER_TASK --b2-sensitive \
     --GTF=annot/140609_refseq_nomir_nosnor.gtf --no-novel-juncs \    
    /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \
    fastq/rnaseq_500k.fastq.gz
tophat -o job2 -p $SLURM_CPUS_PER_TASK --b2-sensitive \
     --GTF=annot/140609_refseq_nomir_nosnor.gtf --no-novel-juncs \    
    /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \
    fastq/rnaseq_250k.fastq.gz

The jobs can then be submitted with the swarm utility specifying 16 cores and 10GB of memory for each job:

biowulf$ swarm -f swarm_cmd_file -g 10 -t 16 --module tophat

Running an interactive job on Biowulf

Interactive work requiring resources should be carried out on interactive compute nodes, not on the head node or helix. Interactive nodes are allocated with sinteractive. For example, to request a 8 core interactive node with 10GB memory:

biowulf$ sinteractive -c 8 --mem=10g
alloc.exe: Granted job allocation 17455
slurm stepprolog here!
srun: error: x11: no local DISPLAY defined, skipping
                                                    Begin slurm taskprolog!
                                                    End slurm taskprolog!
user@node$ cd /data/$USER/test_data
user@node$ module load tophat
user@node$ tophat -o job1 -p8 /path/to/genome /path/to/fastq.gz
...
user@node$ exit
biowulf$

Documentation