Transcript Clean on Biowulf

Quick Links

TranscriptClean is a Python program that corrects mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome. It is designed for use with sam files from the PacBio Iso-seq and Oxford Nanopore transcriptome sequencing technologies. A variant-aware mode is available for users who want to avoid correcting away known variants in their data.

References:

Dana Wyman, Ali Mortazavi
TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts
Bioinformatics, Volume 35, Issue 2, January 2019, Pages 340–342

Documentation

Transcript Clean Main Site

Important Notes

Module Name: trascript_clean (see the modules page for more information)
Example files in /usr/local/apps/transcript_clean/example_files

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cncn4338 ~]$ module load transcript_clean
[+] Loading transcript_clean  2.1   on cn4338 
[+] Loading singularity  4.1.5  on cn4338 
[user@cn4338 ~]$  cp -a /usr/local/apps/transcript_clean/example_files .
[user@cn4338 ~]$  cd example_files
[user@cn4338 ~]$  transcriptclean -h
Usage: transcriptclean [options]

Options:
  -h, --help            show this help message and exit
  -s FILE, --sam=FILE   Input SAM file containing transcripts to correct. Must
                        contain a header.
  -g FILE, --genome=FILE
                        Reference genome fasta file. Should be the same one
                        used during mapping to generate the provided SAM file.
  -t N_THREADS, --threads=N_THREADS
                        Number of threads to run program with.
  -j FILE, --spliceJns=FILE
                        Splice junction file obtained by mapping Illumina
                        reads to the genome using STAR, or alternately,
                        extracted from a GTF using the accessory script. More
                        formats may be supported in the future.
  -v FILE, --variants=FILE
                        VCF formatted file of variants to avoid correcting
                        away in the data (optional).
  --maxLenIndel=MAXLENINDEL
                        Maximum size indel to correct (Default: 5 bp)
  --maxSJOffset=MAXSJOFFSET
                        Maximum distance from annotated splice junction to
                        correct (Default: 5 bp)
  -o FILE, --outprefix=FILE
                        Output file prefix. '_clean' plus a file extension
                        will be added to the end.
  -m CORRECTMISMATCHES, --correctMismatches=CORRECTMISMATCHES
                        If set to false, TranscriptClean will skip mismatch
                        correction. Default: true
  -i CORRECTINDELS, --correctIndels=CORRECTINDELS
                        If set to false, TranscriptClean will skip indel
                        correction. Default: true
  --correctSJs=CORRECTSJS
                        If set to false, TranscriptClean will skip splice
                        junction correction. Default: true, but you must
                        provide a splice junction annotation file in order for
                        it to work.
  --dryRun              If this option is set, TranscriptClean will read in
                        the sam file and record all insertions, deletions, and
                        mismatches, but it will skip correction. This mode is
                        useful for checking the distribution of transcript
                        errors in the data before running correction.
  --primaryOnly         If this option is set, TranscriptClean will only
                        output primary mappings of transcripts (ie it will
                        filter                       out unmapped and
                        multimapped lines from the SAM input.
  --canonOnly           If this option is set, TranscriptClean will output
                        only canonical transcripts and transcripts containing
                        annotated noncanonical junctions to the clean SAM file
                        at the end of the run.
  --tmpDir=TMP_PATH     If you would like the tmp files to be written
                        somewhere different than the final output, provide the
                        path to that location here.
  --bufferSize=BUFFER_SIZE
                        Number of lines to output to file at once by each
                        thread during run. Default = 100
  --deleteTmp           If this option is set, the temporary directory
                        generated by TranscriptClean (TC_tmp) will be removed
                        at the end of the run.

Example

Most jobs should be run as batch jobs.

Basic Correction

[user@cn4338 example_files]$ transcriptclean \
  --sam GM12878_chr1.sam \
  --genome chr1.fa \
  --outprefix outputs 
  Reading genome ..............................
  No splice annotation provided. Will skip splice junction correction.
  No variant file provided. Transcript correction will not be variant-aware.
  Reference file processing took 0:00:00
  Correcting transcripts...
  Took 0:00:27 to process transcript batch.
  Took 0:00:00 to combine all outputs.

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. transcript-clean.sh). For example:

#!/bin/bash
set -e
module load transcript_clean
transcriptclean --sam transcripts.sam --genome hg38.fa --outprefix /my/path/outfile

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] transcript-clean.sh