Biowulf High Performance Computing at the NIH
talon on Biowulf
TALON is a Python package for identifying and quantifying known and novel genes/isoforms in long-read transcriptome data sets. TALON is technology-agnostic in that it works from mapped SAM files, allowing data from different sequencing platforms (i.e. PacBio and Oxford Nanopore) to be analyzed side by side.

References:

  • Dana Wyman et al., A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. bioRxiv
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run through the steps using 2 replicates of human cardiac atrium tissue runs on a PacBio Sequel II:

[user@biowulf]$ sinteractive --cpus-per-task=6 --mem=16G --gres=lscratch:50
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load talon
[user@cn3144]$ cp -L ${TALON_TEST_DATA:-none}/* .
[user@cn3144]$ ls -lh
total 4.6G
-rw-r--r-- 1 user group 172M Apr 20 14:31 ENCFF291EKY.bam
-rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF291EKY.bam.bai
-rw-r--r-- 1 user group 189M Apr 20 14:31 ENCFF613SDS.bam
-rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF613SDS.bam.bai
-rw-r--r-- 1 user group 1.3G Apr 20 14:31 gencode.v35.primary_assembly.annotation.gtf
-rw-r--r-- 1 user group 3.0G Apr 20 14:31 GRCh38.primary_assembly.genome.fa
-rw-r--r-- 1 user group 6.4K Apr 20 14:31 GRCh38.primary_assembly.genome.fa.fai
[user@cn3144]$ gtf=gencode.v35.primary_assembly.annotation.gtf
[user@cn3144]$ genome=GRCh38.primary_assembly.genome.fa
[user@cn3144]$ bam1=ENCFF291EKY.bam
[user@cn3144]$ bam2=ENCFF613SDS.bam
[user@cn3144]$ talon_initialize_database \
        --f $gtf \
        --a gencode_35 \
        --g GRCh38 \
        --o example_talon
chr1
bulk update genes...
bulk update gene_annotations...
bulk update transcripts...
[...snip...]
[user@cn3144]$ mkdir -p labeled tmp
[user@cn3144]$ ### check internal priming sites
[user@cn3144]$ talon_label_reads --f $bam1 \
        --g $genome  \
        --t $SLURM_CPUS_PER_TASK \
        --ar 20 \
        --tmpDir=/lscratch/$SLURM_JOBID/tmp \
        --deleteTmp \
        --o labeled/${bam1%.bam}
[ 2021-04-20 17:10:44 ] Started talon_label_reads run.
[ 2021-04-20 17:10:44 ] Splitting SAM by chromosome...
[ 2021-04-20 17:10:44 ] -----Writing chrom files...
[ 2021-04-20 17:10:59 ] Launching parallel jobs...
[ 2021-04-20 17:11:14 ] Pooling output files...
[ 2021-04-20 17:11:27 ] Run complete
[user@cn3144]$ talon_label_reads --f $bam2 \
        --g $genome  \
        --t $SLURM_CPUS_PER_TASK \
        --ar 20 \
        --tmpDir=/lscratch/$SLURM_JOBID/tmp \
        --deleteTmp \
        --o labeled/${bam2%.bam}
[...snip...]
[user@cn3144]$ ### run talon annotator
[user@cn3144]$ cat > config.csv <<__EOF__
ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam
ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam
__EOF__
[user@cn3144]$ talon \
        -t $SLURM_CPUS_PER_TASK \
        --f config.csv \
        --db example_talon.db \
        --build GRCh38 \
        --o example
[user@cn3144]$ ### summarize results
[user@cn3144]$  talon_summarize \
       --db example_talon.db \
       --v \
       --o example
[user@cn3144]$ ### run any other tools and then copy results back to shared space
[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. talon.sh), which uses the input file 'talon.in'. For example:

#! /bin/bash

module load talon/5.0

bam1=ENCFF291EKY.bam
bam2=ENCFF613SDS.bam
gtf=gencode.v35.primary_assembly.annotation.gtf
genome=GRCh38.primary_assembly.genome.fa



cd /lscratch/$SLURM_JOB_ID
cp -L ${TALON_TEST_DATA:-none}/* .
talon_initialize_database \
    --f $gtf \
    --a gencode_35 \
    --g GRCh38 \
    --o example_talon

mkdir -p labeled tmp
talon_label_reads --f $bam1 \
    --g $genome  \
    --t $SLURM_CPUS_PER_TASK \
    --ar 20 \
    --tmpDir=/lscratch/$SLURM_JOBID/tmp \
    --deleteTmp \
    --o labeled/${bam1%.bam}
talon_label_reads --f $bam2 \
    --g $genome  \
    --t $SLURM_CPUS_PER_TASK \
    --ar 20 \
    --tmpDir=/lscratch/$SLURM_JOBID/tmp \
    --deleteTmp \
    --o labeled/${bam2%.bam}

cat > config.csv <<__EOF__
ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam
ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam
__EOF__

talon \
    -t $SLURM_CPUS_PER_TASK \
    --f config.csv \
    --db example_talon.db \
    --build GRCh38 \
    --o example

talon_summarize \
   --db example_talon.db \
   --v \
   --o example

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] talon.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. talon.swarm). For example:

talon_label_reads --f ENCFF291EKY.bam \
    --g GRCh38.primary_assembly.genome.fa  \
    --t $SLURM_CPUS_PER_TASK \
    --ar 20 \
    --tmpDir=/lscratch/$SLURM_JOBID \
    --deleteTmp \
    --o labeled/ENCFF291EKY
talon_label_reads --f ENCFF613SDS.bam \
    --g GRCh38.primary_assembly.genome.fa  \
    --t $SLURM_CPUS_PER_TASK \
    --ar 20 \
    --tmpDir=/lscratch/$SLURM_JOBID \
    --deleteTmp \
    --o labeled/ENCFF613SDS

Submit this job using the swarm command.

swarm -f talon.swarm [-g 10] [-t 6] --gres=lscratch:50 --module talon/5.0
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module talon Loads the talon module for each subjob in the swarm