TranscriptClean is a Python program that corrects mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome. It is designed for use with sam files from the PacBio Iso-seq and Oxford Nanopore transcriptome sequencing technologies. A variant-aware mode is available for users who want to avoid correcting away known variants in their data.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cncn4338 ~]$ module load transcript_clean [+] Loading transcript_clean 2.1 on cn4338 [+] Loading singularity 4.1.5 on cn4338 [user@cn4338 ~]$ cp -a /usr/local/apps/transcript_clean/example_files . [user@cn4338 ~]$ cd example_files [user@cn4338 ~]$ transcriptclean -h Usage: transcriptclean [options] Options: -h, --help show this help message and exit -s FILE, --sam=FILE Input SAM file containing transcripts to correct. Must contain a header. -g FILE, --genome=FILE Reference genome fasta file. Should be the same one used during mapping to generate the provided SAM file. -t N_THREADS, --threads=N_THREADS Number of threads to run program with. -j FILE, --spliceJns=FILE Splice junction file obtained by mapping Illumina reads to the genome using STAR, or alternately, extracted from a GTF using the accessory script. More formats may be supported in the future. -v FILE, --variants=FILE VCF formatted file of variants to avoid correcting away in the data (optional). --maxLenIndel=MAXLENINDEL Maximum size indel to correct (Default: 5 bp) --maxSJOffset=MAXSJOFFSET Maximum distance from annotated splice junction to correct (Default: 5 bp) -o FILE, --outprefix=FILE Output file prefix. '_clean' plus a file extension will be added to the end. -m CORRECTMISMATCHES, --correctMismatches=CORRECTMISMATCHES If set to false, TranscriptClean will skip mismatch correction. Default: true -i CORRECTINDELS, --correctIndels=CORRECTINDELS If set to false, TranscriptClean will skip indel correction. Default: true --correctSJs=CORRECTSJS If set to false, TranscriptClean will skip splice junction correction. Default: true, but you must provide a splice junction annotation file in order for it to work. --dryRun If this option is set, TranscriptClean will read in the sam file and record all insertions, deletions, and mismatches, but it will skip correction. This mode is useful for checking the distribution of transcript errors in the data before running correction. --primaryOnly If this option is set, TranscriptClean will only output primary mappings of transcripts (ie it will filter out unmapped and multimapped lines from the SAM input. --canonOnly If this option is set, TranscriptClean will output only canonical transcripts and transcripts containing annotated noncanonical junctions to the clean SAM file at the end of the run. --tmpDir=TMP_PATH If you would like the tmp files to be written somewhere different than the final output, provide the path to that location here. --bufferSize=BUFFER_SIZE Number of lines to output to file at once by each thread during run. Default = 100 --deleteTmp If this option is set, the temporary directory generated by TranscriptClean (TC_tmp) will be removed at the end of the run.
Basic Correction
[user@cn4338 example_files]$ transcriptclean \ --sam GM12878_chr1.sam \ --genome chr1.fa \ --outprefix outputs Reading genome .............................. No splice annotation provided. Will skip splice junction correction. No variant file provided. Transcript correction will not be variant-aware. Reference file processing took 0:00:00 Correcting transcripts... Took 0:00:27 to process transcript batch. Took 0:00:00 to combine all outputs.
Create a batch input file (e.g. transcript-clean.sh). For example:
#!/bin/bash set -e module load transcript_clean transcriptclean --sam transcripts.sam --genome hg38.fa --outprefix /my/path/outfile
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] transcript-clean.sh