TranscriptClean is a Python program that corrects mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome. It is designed for use with sam files from the PacBio Iso-seq and Oxford Nanopore transcriptome sequencing technologies. A variant-aware mode is available for users who want to avoid correcting away known variants in their data.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cncn4338 ~]$ module load transcript_clean
[+] Loading transcript_clean 2.1 on cn4338
[+] Loading singularity 4.1.5 on cn4338
[user@cn4338 ~]$ cp -a /usr/local/apps/transcript_clean/example_files .
[user@cn4338 ~]$ cd example_files
[user@cn4338 ~]$ transcriptclean -h
Usage: transcriptclean [options]
Options:
-h, --help show this help message and exit
-s FILE, --sam=FILE Input SAM file containing transcripts to correct. Must
contain a header.
-g FILE, --genome=FILE
Reference genome fasta file. Should be the same one
used during mapping to generate the provided SAM file.
-t N_THREADS, --threads=N_THREADS
Number of threads to run program with.
-j FILE, --spliceJns=FILE
Splice junction file obtained by mapping Illumina
reads to the genome using STAR, or alternately,
extracted from a GTF using the accessory script. More
formats may be supported in the future.
-v FILE, --variants=FILE
VCF formatted file of variants to avoid correcting
away in the data (optional).
--maxLenIndel=MAXLENINDEL
Maximum size indel to correct (Default: 5 bp)
--maxSJOffset=MAXSJOFFSET
Maximum distance from annotated splice junction to
correct (Default: 5 bp)
-o FILE, --outprefix=FILE
Output file prefix. '_clean' plus a file extension
will be added to the end.
-m CORRECTMISMATCHES, --correctMismatches=CORRECTMISMATCHES
If set to false, TranscriptClean will skip mismatch
correction. Default: true
-i CORRECTINDELS, --correctIndels=CORRECTINDELS
If set to false, TranscriptClean will skip indel
correction. Default: true
--correctSJs=CORRECTSJS
If set to false, TranscriptClean will skip splice
junction correction. Default: true, but you must
provide a splice junction annotation file in order for
it to work.
--dryRun If this option is set, TranscriptClean will read in
the sam file and record all insertions, deletions, and
mismatches, but it will skip correction. This mode is
useful for checking the distribution of transcript
errors in the data before running correction.
--primaryOnly If this option is set, TranscriptClean will only
output primary mappings of transcripts (ie it will
filter out unmapped and
multimapped lines from the SAM input.
--canonOnly If this option is set, TranscriptClean will output
only canonical transcripts and transcripts containing
annotated noncanonical junctions to the clean SAM file
at the end of the run.
--tmpDir=TMP_PATH If you would like the tmp files to be written
somewhere different than the final output, provide the
path to that location here.
--bufferSize=BUFFER_SIZE
Number of lines to output to file at once by each
thread during run. Default = 100
--deleteTmp If this option is set, the temporary directory
generated by TranscriptClean (TC_tmp) will be removed
at the end of the run.
Basic Correction
[user@cn4338 example_files]$ transcriptclean \ --sam GM12878_chr1.sam \ --genome chr1.fa \ --outprefix outputs Reading genome .............................. No splice annotation provided. Will skip splice junction correction. No variant file provided. Transcript correction will not be variant-aware. Reference file processing took 0:00:00 Correcting transcripts... Took 0:00:27 to process transcript batch. Took 0:00:00 to combine all outputs.
Create a batch input file (e.g. transcript-clean.sh). For example:
#!/bin/bash set -e module load transcript_clean transcriptclean --sam transcripts.sam --genome hg38.fa --outprefix /my/path/outfile
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] transcript-clean.sh