trinotate on Biowulf

Trinotate is a comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms. Trinotate makes use of a number of different well referenced methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and leveraging various annotation databases (eggNOG/GO/Kegg databases). All functional annotation data derived from the analysis of transcripts is integrated into a SQLite database which allows fast efficient searching for terms with specific qualities related to a desired scientific hypothesis or a means to create a whole annotation report for a transcriptome.

Documentation
Important Notes

Users MUST allocate lscratch to run Trinotate. This is because its dependancy, signalp, requires a temporary directory that defaults to lscratch.

To use the Trinotate Webserver, make sure to set up ssh tunneling. See interactive session below.

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

The example below runs the trinotate runMe.Biowulf.sh script created specifically for Biowulf. Executing this script will initially extract the reference dataset from $TRINOTATE_DATA_TAR before running the computation, generating reports, and setting up data for Trinotate WebServer

Allocate an interactive session with lscratch and --tunnel option

Sample session (user input in bold):

[user@biowulf]$ sinteractive --mem=10g --gres=lscratch:100 --tunnel
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

Created 1 generic SSH tunnel(s) from this compute node to 
biowulf for your use at port numbers defined 
in the $PORTn ($PORT1, ...) environment variables.


Please create a SSH tunnel from your workstation to these ports on biowulf.
On Linux/MacOS, open a terminal and run:

    ssh  -L 45000:localhost:45000 biowulf.nih.gov

For Windows instructions, see https://hpc.nih.gov/docs/tunneling

Use the instructions above to setup ssh tunnel in a separate terminal. See ssh tunneling for more information

[user@cn3144 ~]$ module load trinotate

[user@cn3144 ~]$ cd /lscratch/$SLURM_JOBID

[user@cn3144 46116226]$ cp -r $TRINOTATE_TEST_DATA . 

[user@cn3144 46116226]$ cd test_data

[user@cn3144 test_data]$ ./runMe.Biowulf.sh

[user@cn3144 test_data]$ run_TrinotateWebserver.pl

Copy the URL provided by the last command above and paste it your browser. Once you are done examining your data, press Ctrl+C in the interactive session to terminate the Trinotate Webserver.

[user@cn3144 test_data]$ exit
salloc.exe: Relinquishing job allocation 46116226

[user@biowulf ~]$

Trinotate also produces spreadsheets and html files. The easiest way to view them is to use hpcdrive to mount your Biowulf /home or /data area onto your desktop, then click on the file. For the test job above, since the output is in /lscratch/$SLURM_JOBID (temporary local disk on the node), you should copy the desired files back to your /data area before exiting the session.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. trinotate.sh). For example:

#!/bin/bash
set -e
module load trinotate
Trinotate Trinotate.sqlite init --gene_trans_map  --transcript_fasta  --transdecoder_pep 
etc

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=4 --mem=20g --gres=lscratch:25 trinotate.sh
Note: these are suggested values for cpus-per-task and mem. Based on your initial runs, you may need to increase or decrease them.