tophat on Biowulf

TopHat is a splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.



References:

Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way.

Documentation
Important Notes

TopHat makes use of either bowtie2 (the default) or bowtie (optional; necessary for colorspace reads as bowtie2 does not support colorspace). TopHat is limited to a maximal read length of 1024nts.

TopHat2 now includes TopHat-Fusion's ability to look for fusions between different transcripts (--fusion-search).

Index files

TopHat will need either bowtie2 or bowtie indices which are available as part of the igenomes package at

/fdb/igenomes/[organism]/[source]/[build]/Sequence/Bowtie[2]Index/*

More information on the locally available igenomes builds/organisms is available from our scientific database index. For more information about igenomes in general, iGenomes readme.

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=10g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ cd /data/$USER/test_data
[user@cn3144 ~]$ module load tophat
[user@cn3144 ~]$ tophat -o job1 -p8 /path/to/genome /path/to/fastq.gz
...
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. tophat.sh). For example:

#! /bin/bash
module load tophat
cd /data/${USER}/test_data
tophat \
  --output-dir=./tophat_test \
  --min-anchor-length=10 \
  --num-threads=$SLURM_CPUS_PER_TASK \
  --b2-sensitive \
  /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \
  fastq/rnaseq_500k.fastq.gz

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=16 --mem=20g tophat.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Note that tophat by default creates a ouput directory called tophat_out. Therefore, to run multiple jobs in parallel, the jobs either have to be run in separate directories or the command line has to specify the -o / --output_dir for each job.

Create a swarmfile (e.g. tophat.swarm). For example:

tophat -o job1 -p ${SLURM_CPUS_PER_TASK} --b2-sensitive \
  --GTF=annot/140609_refseq_nomir_nosnor.gtf --no-novel-juncs \    
  /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \
  fastq/rnaseq_500k.fastq.gz
tophat -o job2 -p ${SLURM_CPUS_PER_TASK} --b2-sensitive \
  --GTF=annot/140609_refseq_nomir_nosnor.gtf --no-novel-juncs \    
  /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \
  fastq/rnaseq_250k.fastq.gz

Submit this job using the swarm command.

swarm -f tophat.swarm -g 10 -t 16 --module tophat
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module tophat Loads the tophat module for each subjob in the swarm