TopHat is a splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way.
TopHat makes use of either bowtie2
(the default) or bowtie
(optional;
necessary for colorspace reads as bowtie2 does not support colorspace). TopHat
is limited to a maximal read length of 1024nts.
TopHat2 now includes TopHat-Fusion's ability to look for fusions between
different transcripts (--fusion-search
).
TopHat will need either bowtie2
or bowtie
indices which are available
as part of the igenomes package at
/fdb/igenomes/[organism]/[source]/[build]/Sequence/Bowtie[2]Index/*
[organism]
is the specific organism of interest
(Gallus_gallus, Rattus_norvegicus, etc.)[source]
is the source for the sequence (NCBI,
Ensembl, UCSC)[build]
is the specific genome build of interest
(hg19, build37.2, GRCh37)More information on the locally available igenomes builds/organisms is available from our scientific database index. For more information about igenomes in general, iGenomes readme.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=10g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ cd /data/$USER/test_data [user@cn3144 ~]$ module load tophat [user@cn3144 ~]$ tophat -o job1 -p8 /path/to/genome /path/to/fastq.gz ... [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. tophat.sh). For example:
#! /bin/bash module load tophat cd /data/${USER}/test_data tophat \ --output-dir=./tophat_test \ --min-anchor-length=10 \ --num-threads=$SLURM_CPUS_PER_TASK \ --b2-sensitive \ /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \ fastq/rnaseq_500k.fastq.gz
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=16 --mem=20g tophat.sh
Note that tophat by default creates
a ouput directory called tophat_out
. Therefore, to run multiple jobs in
parallel, the jobs either have to be run in separate directories or
the command line has to specify the -o / --output_dir
for each job.
Create a swarmfile (e.g. tophat.swarm). For example:
tophat -o job1 -p ${SLURM_CPUS_PER_TASK} --b2-sensitive \ --GTF=annot/140609_refseq_nomir_nosnor.gtf --no-novel-juncs \ /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \ fastq/rnaseq_500k.fastq.gz tophat -o job2 -p ${SLURM_CPUS_PER_TASK} --b2-sensitive \ --GTF=annot/140609_refseq_nomir_nosnor.gtf --no-novel-juncs \ /fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome \ fastq/rnaseq_250k.fastq.gz
Submit this job using the swarm command.
swarm -f tophat.swarm -g 10 -t 16 --module tophatwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module tophat | Loads the tophat module for each subjob in the swarm |