High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Taco on NIH HPC Systems

TACO: Multi-sample transcriptome assembly from RNA-Seq Transcriptome assemblers reconstruct full-length transcripts from the short sequence fragments generated by RNA-Seq. Large consortia such as TCGA, ICGC, GTex, ENCODE, the Cancer Cell Line Encyclopedia (CCLE), and others have performed RNA-Seq on thousands of human tissues and cell lines, providing an unparalleled resource for investigating transcriptional diversity and complexity. Towards this end, we present Transcriptome Assemblies Combined into One (TACO), an algorithm that reconstructs a consensus transcriptome from a collection of individual assemblies. TACO employs change point detection to break apart complex loci and correctly delineate transcript start and end sites, and a dynamic programming approach to assemble transcripts from a network of splicing patterns. TACO vastly outperforms existing software tools such as Cuffmerge and Stringtie merge. Please refernece our manuscript in Nature Methods for further details and the results of our comparison analysis. TACO also contains an easy to use companion tool for comparing meta-assemblies to reference transcriptomes, assessing overlap with reference and also protein coding potential.

Example files are under /usr/local/apps/taco/test directory.
To test taco with the example files:

  $ cp -r /usr/local/apps/taco/test /data/$USER
  $ cd /data/$USER/test
  $ sinteractive --mem=5g
  $ module load taco
  $ taco_run --gtf-expr-attr expr taco_test_files.txt 
  

The above command give temp files removal errors but that doesn't affect taco performance and can be safely ignored:

2017-03-15 14:45:53,987 pid=8879 INFO - Removing temporary files 2017-03-15 14:45:54,034 pid=8879 ERROR - Error removing tmp files path=output/tmp/worker000/.nfs000000000a8f3fef000013ab message=( , OSError(16, 'Device or resource busy'), ) 2017-03-15 14:45:54,034 pid=8879 ERROR - Error removing tmp files path=output/tmp/worker000/.nfs000000000a8f3ff0000013ac message=(, OSError(16, 'Device or resource busy'), ) 2017-03-15 14:45:54,035 pid=8879 ERROR - Error removing tmp files
......
......
path=output/tmp/worker000 message=( , OSError(39, 'Directory not empty'), ) 2017-03-15 14:45:54,037 pid=8879 INFO - Done

On Helix

Sample session:


[helix ~]$ module load taco
[helix ~]$ taco_run -h
usage: taco_run [-h] [-o DIR] [-p N] [-v] [--resume] [--assemble BED]
                [--gtf-expr-attr ATTR] [--filter-min-length N]
                [--filter-min-expr X] [--filter-splice-juncs]
                [--ref-genome-fasta REF_GENOME_FASTA_FILE]
                [--isoform-frac FRAC] [--max-isoforms N]
                [--assemble-unstranded] [--no-assemble-unstranded]
                [--change-point] [--no-change-point]
                [--change-point-pvalue ]
                [--change-point-fold-change ]
                [--change-point-trim] [--no-change-point-trim]
                [--path-kmax kmax] [--path-frac X] [--max-paths N]
                [sample_file]

TACO: Multi-sample transcriptome assembly from RNA-Seq

positional arguments:
  sample_file

optional arguments:
  -h, --help            show this help message and exit
  -o DIR, --output-dir DIR
                        directory where output files will be stored (if
                        already exists then --resume must be specified)
                        [default=output]
  -p N, --num-processes N
                        Run TACO in parallel with N processes [default=1]
  -v, --verbose         Enabled detailed logging (for debugging)
  --resume              Resumes an existing run that may have ended
                        prematurely. Specify the location of the run using the
                        -o/--output-dir option.
  --assemble BED        Assemble transfrags produced by a previous TACO run,
                        bypassing the GTF aggregation step. Accepts a taco-
                        formatted BED file.
  --gtf-expr-attr ATTR  GTF attribute field containing expression
                        [default=FPKM]
  --filter-min-length N
                        Filter input transfrags with length < N prior to
                        assembly [default=200]
  --filter-min-expr X   Filter input transfrags with transcripts per milliion
                        (TPM) < X prior to assembly [default=0.5]
  --filter-splice-juncs
                        Filter input transfrags that possess non-canonical
                        splice motifs prior to assembly. Splice motifs are
                        GTAG and GCAG are allowed [default=False]. Requires
                        genome sequence to be specified using --ref-genome-
                        fasta.
  --ref-genome-fasta REF_GENOME_FASTA_FILE
                        Reference genome sequence in FASTA format needed to
                        assess splice junction motif sequences. Use in
                        conjunction with --filter-splice-juncs.
  --isoform-frac FRAC   Report transcript isoforms with expression fraction >=
                        FRAC (0.0-1.0) relative to the major isoform within
                        each gene [default=0.05]
  --max-isoforms N      Maximum isoforms to report for each gene [default=0]
  --assemble-unstranded
                        Enable assembly of unstranded transfrags
                        [default=False]
  --no-assemble-unstranded
                        Disable assembly of unstranded transfrags
  --change-point        Enable change point detection [default=True]
  --no-change-point     Disable change point detection

Advanced Options:
  (recommend leaving at their default settings for most purposes)

  --change-point-pvalue 
                        Mann-Whitney-U p-value threshold for calling change
                        points [default=0.01]
  --change-point-fold-change 
                        Fold change threshold between the means of two
                        putative segments on either side of a change point. A
                        value of 0.0 is the most strict setting, effectively
                        calling no change points. Conversely, setting the
                        value to 1.0 calls allchange points [default=0.85]
  --change-point-trim   Trim transfrags around change points [default=True]

  --no-change-point-trim
                        Disable trimming around change points
  --path-kmax kmax      Limit optimization for choosing parameter k for path
                        graph (DeBruijn graph) to k <= kmax [default=0]
  --path-frac X         dynamic programming algorithm will stop finding
                        suboptimal paths when path expression drops below a
                        fraction X (0.0-1.0) of the total locus expression
                        [default=0.0]
  --max-paths N         dynamic programming algorithm will stop after finding
                        N paths [default=0]


Batch job on Biowulf

Create a batch input file (e.g. script.sh). For example:

#!/bin/bash
module load taco

cd /data/$USER/dir
taco command 1
taco command 2
......

Then submit the file on biowulf

biowulf> $ sbatch script.sh

For more information regarding sbatch command : https://hpc.nih.gov/docs/userguide.html#submit

Swarm of Jobs on Biowulf

Create a swarmfile (e.g. script.swarm). For example:

# this file is called script.swarm
cd dir1;taco command 1; taco command 2
cd dir2;taco command 1; taco command 2
cd dir3;taco command 1; taco command 2
[...]

Submit this job using the swarm command.

swarm -f script.swarm --module taco

For more information regarding swarm: https://hpc.nih.gov/apps/swarm.html#usage

Interactive job on Biowulf

Allocate an interactive session. Sample session:

[biowulf ~]$ sinteractive --mem=5g
salloc.exe: Pending job allocation 15194042
salloc.exe: job 15194042 queued and waiting for resources
salloc.exe: job 15194042 has been allocated resources
salloc.exe: Granted job allocation 15194042
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1719 are ready for job

[cn1719 ~]$ module load taco

[cn1719 ~]$ taco command 
Documentation