High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Trinity on Biowulf & Helix
Description

Trinity represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Trinity was developed at the Broad Institute & the Hebrew University of Jerusalem.

In addition to these core functions, Trinity also incudes scripts to do in silico normalization, transcript quantitation, differential expression, and other downstream analyses.

There may be multiple versions of Trinity available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail trinity 

To select a module use

module load trinity/[version]

where [version] is the version of choice.

Trinity is a multithreaded application in addition to being able to submit subjobs to the batch system from the main job. Make sure to match the number of cpus requested with the number of threads for the main job and its subjobs.

Environment variables set

Dependencies

bowtie is loaded automatically by the trinity module. Rsem, express, and R need to be loaded manually by the user when needed for the transcript quantitation or the differential expression scripts.

References

Documentation

Trinotate, the comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms, is also available.

On Helix

Trinity is not suited for helix. Please submit as a batch job or use an sinteractive session.

Batch job on Biowulf

A Trinity run can be devided into two phases - (I) one phase that uses more memory but is not easily parallelized and (II) a phase that uses less memory but can be parallelized. Below is a trace of the memory usage and the number of active threads for a genome based Trinity assembly of ~15M normalized paired end human reads running on a single node with 10 allocated CPUs:

Trinity memory/thread profile

Trinity can be either be run all within a single node as in the graphs above or it can be configured to start subjobs to parallelize Phase II of the assembly.

To run Trinity on a single node, create a batch script similar to the following example. Note that Trinity creates a lot of temporary data so the the use of lscratch is recommended.

#! /bin/bash
# this file is trinity.sh
function die() {
    echo "$@" >&2
    exit 1
}
module load trinity || die "Could not load trinity module"
[[ -d /lscratch/$SLURM_JOB_ID ]] || die "no lscratch allocated"

inbam=$1
mkdir /lscratch/$SLURM_JOB_ID/in
mkdir /lscratch/$SLURM_JOB_ID/out
cp $inbam /lscratch/$SLURM_JOB_ID/in
bam=/lscratch/$SLURM_JOB_ID/in/$(basename $inbam)
out=/lscratch/$SLURM_JOB_ID/out

Trinity --genome_guided_bam $bam \
    --SS_lib_type RF \
    --output  $out \
    --genome_guided_max_intron 10000 \
    --max_memory 28G \
    --CPU 12
mv $out/Trinity-GG.fasta /data/$USER/trinity_out/$(basename $inbam .bam)-Trinity-GG.fasta

Submit to the queue with sbatch:

biowulf$ sbatch --mem=30g --cpus-per-task=12 --gres=lscratch:150 trinity.sh /data/$USER/trinity_in/sample.bam

To distribute the subjobs of Phase II to separate slurm batch jobs the trinity command has to be modified, a config file has to be provided, and the output directory has to be on a file system that is accessible by all jobs (i.e. your /data directory). lscratch should still be used to move the input file as Trinity will create a number of temp files in the same directory. The modified script would be similar to this

#! /bin/bash
# this file is trinity.sh
function die() {
    echo "$@" >&2
    exit 1
}
module load trinity || die "Could not load trinity module"
[[ -d /lscratch/$SLURM_JOB_ID ]] || die "no lscratch allocated"

inbam=$1
mkdir /lscratch/$SLURM_JOB_ID/in
cp $inbam /lscratch/$SLURM_JOB_ID/in
bam=/lscratch/$SLURM_JOB_ID/in/$(basename $inbam)
out=/data/$USER/trinity_out/somedir

Trinity --genome_guided_bam $bam \
    --SS_lib_type RF \
    --output  $out \
    --genome_guided_max_intron 10000 \
    --max_memory 28G \
    --CPU 12 \
    --grid_conf trinity_grid.conf \
    --grid_node_CPU 2 \
    --grid_node_max_memory 4G

where the file 'trinity_grid.conf' looks like this:

# grid type: 
grid=SLURM

# template for a grid submission
cmd=sbatch --mem=4000 --time=03:00:00 --cpus-per-task=2

# note -e error.file -o out.file are set internally, so dont set them in the above cmd.

##########################################################################################
# settings below configure the Trinity job submission system, not tied to the grid itself.
##########################################################################################

# number of grid submissions to be maintained at steady state by the Trinity submission system 
max_nodes=100

# number of commands that are batched into a single grid submission job.
cmds_per_node=1000

The job is then submitted with

biowulf$ sbatch --mem=30g --cpus-per-task=2 --gres=lscratch:100 trinity.sh /data/$USER/trinity_in/sample.bam
Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as follows

biowulf$ sinteractive --mem=5g --cpus-per-task=4
# need to load R and rsem for differential gene expression analysis
# also need to load the older samtools as it appears that trinity
# uses the older samtools sort syntax
node$ module load trinity R rsem samtools
node$ cd /data/$USER/test_data/trinity
node$ cp -pr $TRINITY_ROOT/sample_data .
node$ cd sample_data/test_full_edgeR_pipeline
node$ ./runMe.sh
#!/bin/sh -ve

# run the pipeline
/usr/local/apps/trinity/trinityrnaseq_r20140717/util/run_Trinity_edgeR_pipeline.pl  --samples_file `pwd`/samples_n_reads_decribed.txt 
pwd


#################################################################
Uncompressing rnaseq_reads/Sp_ds.10k.right.fq.gz
#################################################################
CMD: gunzip -c rnaseq_reads/Sp_ds.10k.right.fq.gz > rnaseq_reads/Sp_ds.10k.right.fq
TIME: 0.0 min. for gunzip -c rnaseq_reads/Sp_ds.10k.right.fq.gz > rnaseq_reads/Sp_ds.10k.right.fq
[...]
-------------------------------------------
----------- Jellyfish  --------------------
-- (building a k-mer catalog from reads) --
-------------------------------------------

Tuesday, August 12, 2014: 11:30:38CMD: /spin1/sys2/usrlocal/apps/trinity/trinityrnaseq_r20140717/trinity-plugins/jellyfish/bin/jellyfish count -t 4 -m 25 -s 152186397  both.fa

CMD finished (17 seconds)
[...etc...]
node$ exit
biowulf$