TALON is a Python package for identifying and quantifying known and novel genes/isoforms in long-read transcriptome data sets. TALON is technology-agnostic in that it works from mapped SAM files, allowing data from different sequencing platforms (i.e. PacBio and Oxford Nanopore) to be analyzed side by side.
$TALON_TEST_DATA
Allocate an interactive session and run through the steps using 2 replicates of human cardiac atrium tissue runs on a PacBio Sequel II:
[user@biowulf]$ sinteractive --cpus-per-task=6 --mem=16G --gres=lscratch:50 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load talon [user@cn3144]$ cp -L ${TALON_TEST_DATA:-none}/* . [user@cn3144]$ ls -lh total 4.6G -rw-r--r-- 1 user group 172M Apr 20 14:31 ENCFF291EKY.bam -rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF291EKY.bam.bai -rw-r--r-- 1 user group 189M Apr 20 14:31 ENCFF613SDS.bam -rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF613SDS.bam.bai -rw-r--r-- 1 user group 1.3G Apr 20 14:31 gencode.v35.primary_assembly.annotation.gtf -rw-r--r-- 1 user group 3.0G Apr 20 14:31 GRCh38.primary_assembly.genome.fa -rw-r--r-- 1 user group 6.4K Apr 20 14:31 GRCh38.primary_assembly.genome.fa.fai [user@cn3144]$ gtf=gencode.v35.primary_assembly.annotation.gtf [user@cn3144]$ genome=GRCh38.primary_assembly.genome.fa [user@cn3144]$ bam1=ENCFF291EKY.bam [user@cn3144]$ bam2=ENCFF613SDS.bam [user@cn3144]$ talon_initialize_database \ --f $gtf \ --a gencode_35 \ --g GRCh38 \ --o example_talon chr1 bulk update genes... bulk update gene_annotations... bulk update transcripts... [...snip...] [user@cn3144]$ mkdir -p labeled tmp [user@cn3144]$ ### check internal priming sites [user@cn3144]$ talon_label_reads --f $bam1 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam1%.bam} [ 2021-04-20 17:10:44 ] Started talon_label_reads run. [ 2021-04-20 17:10:44 ] Splitting SAM by chromosome... [ 2021-04-20 17:10:44 ] -----Writing chrom files... [ 2021-04-20 17:10:59 ] Launching parallel jobs... [ 2021-04-20 17:11:14 ] Pooling output files... [ 2021-04-20 17:11:27 ] Run complete [user@cn3144]$ talon_label_reads --f $bam2 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam2%.bam} [...snip...] [user@cn3144]$ ### run talon annotator [user@cn3144]$ cat > config.csv <<__EOF__ ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam __EOF__ [user@cn3144]$ talon \ -t $SLURM_CPUS_PER_TASK \ --f config.csv \ --db example_talon.db \ --build GRCh38 \ --o example [user@cn3144]$ ### summarize results [user@cn3144]$ talon_summarize \ --db example_talon.db \ --v \ --o example [user@cn3144]$ ### run any other tools and then copy results back to shared space [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Create a batch input file (e.g. talon.sh), which uses the input file 'talon.in'. For example:
#! /bin/bash module load talon/5.0 bam1=ENCFF291EKY.bam bam2=ENCFF613SDS.bam gtf=gencode.v35.primary_assembly.annotation.gtf genome=GRCh38.primary_assembly.genome.fa cd /lscratch/$SLURM_JOB_ID cp -L ${TALON_TEST_DATA:-none}/* . talon_initialize_database \ --f $gtf \ --a gencode_35 \ --g GRCh38 \ --o example_talon mkdir -p labeled tmp talon_label_reads --f $bam1 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam1%.bam} talon_label_reads --f $bam2 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam2%.bam} cat > config.csv <<__EOF__ ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam __EOF__ talon \ -t $SLURM_CPUS_PER_TASK \ --f config.csv \ --db example_talon.db \ --build GRCh38 \ --o example talon_summarize \ --db example_talon.db \ --v \ --o example
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] talon.sh
Create a swarmfile (e.g. talon.swarm). For example:
talon_label_reads --f ENCFF291EKY.bam \ --g GRCh38.primary_assembly.genome.fa \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID \ --deleteTmp \ --o labeled/ENCFF291EKY talon_label_reads --f ENCFF613SDS.bam \ --g GRCh38.primary_assembly.genome.fa \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID \ --deleteTmp \ --o labeled/ENCFF613SDS
Submit this job using the swarm command.
swarm -f talon.swarm [-g 10] [-t 6] --gres=lscratch:50 --module talon/5.0where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module talon | Loads the talon module for each subjob in the swarm |