ESPRESSO is a novel method for processing alignment of long read RNA-seq data,
which can effectively improve splice junction accuracy and isoform quantification.
ESPRESSO jointly considers alignments of all long reads aligned to a gene
and uses error profiles of individual reads
to improve the identification of splice junctions
and the discovery of their corresponding transcript isoforms.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive [user@cn3144 ~]$ module load espresso [+] Loading singularity 4.0.1 on cn3144 [+] Loading espresso 1.4.0 [user@cn3144 ~]$ ESPRESSO_C.pl -h Program: ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options) Version: C_1.4.0 Contact: Yuan Gao <gaoy@email.chop.edu, gy.james@163.com> Usage: perl ESPRESSO_C.pl -I work_dir -F ref.fa -X target_ID Arguments: -I, --in work directory (generated by ESPRESSO_S) -F, --fa FASTA file of all reference sequences. Please make sure this file is the same one provided to mapper. (required) -X, --target_ID ID of sample to process (required) -H, --help show this help information -T, --num_thread thread number (default: 5) --sort_buffer_size memory buffer size for running 'sort' commands (default: 2G) [user@cn3144 ~]$ ESPRESSO_Q.pl -h Program: ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options) Version: Q_1.4.0 Contact: Yuan Gao <gaoy@email.chop.edu, gy.james@163.com> Usage: perl ESPRESSO_Q.pl -L work_dir/samples.tsv.updated -A anno.gtf Arguments: -L, --list_samples tsv list of multiple samples (each bam in a line with 1st column as sorted bam file, 2nd column as sample name in output, 3rd column as directory of ESPRESSO_C results; this list can be generated by ESPRESSO_S according to the initially provided tsv list; required) -A, --anno input annotation file in GTF format (optional) -O, --out_dir output directory (default: directory of -L) -V, --tsv_compt output tsv for compatible isoform(s) of each read (optional) -T --num_thread how many threads to use (default: 5) -H, --help show this help information -N, --read_num_cutoff min perfect read count for all splice junctions of novel isoform (default: 2) -R, --read_ratio_cutoff min perfect read ratio for all splice junctions of novel isoform (default: 0) -S, --SJ_dist max number of bases that an alignment endpoint can extend past the start or end of a matched isoform (default: 35) --internal_boundary_limit max number of bases that an alignment endpoint can extend into an intron of a matched isoform (default: 6) --allow_longer_terminal_exons allow an alignment to match an isoform even if the alignment endpoint extends more than --SJ_dist past the start or end [user@cn3144 ~]$ ESPRESSO_S.pl -h Program: ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options) Version: S_1.4.0 Contact: Yuan Gao <gaoy@email.chop.edu, gy.james@163.com> Usage: perl ESPRESSO_S.pl -L samples.tsv -F ref.fa -A anno.gtf -O work_dir Arguments: -L, --list_samples tsv list of sample(s) (each file in a line with 1st column as sorted BAM/SAM file and 2nd column as sample name; required) -F, --fa FASTA file of all reference sequences. Please make sure this file is the same one provided to mapper. (required) -A, --anno input annotation file in GTF format (optional) -B, --SJ_bed input custom reliable splice junctions in BED format (optional; each reliable SJ in one line, with the 1st column as chromosome, the 2nd column as upstream splice site 0-base coordinate, the 3rd column as downstream splice site and 6th column as strand) -O, --out work directory (existing files in this directory may be OVERWRITTEN; default: ./) -H, --help show this help information -N, --read_num_cutoff min perfect read count for denovo detected candidate splice junctions (default: 2) -R, --read_ratio_cutoff min perfect read ratio for denovo detected candidate splice junctions: Set this as 1 for completely GTF-dependent processing (default: 0) -C, --cont_del_max max continuous deletion allowed; intron will be identified if longer (default: 50) -M, --chrM tell ESPRESSO the ID of mitochondrion in reference file (default: chrM) -T, --num_thread thread number (default: minimum of 5 and sam file number) -Q, --mapq_cutoff min mapping quality for processing (default: 1) --sort_buffer_size memory buffer size for running 'sort' commands (default: 2G) [user@cn3144 ~]$ git clone https://github.com/Xinglab/espresso [user@cn3144 ~]$ cd espresso [user@cn3144 ~]$ python-espresso tests/high_confidence_sjs/test.py test (__main__.HighConfidenceSjsTest.test) ... (config=strict) (config=num) (config=ratio) (config=num_and_ratio) (config=gtf) (config=bed) (config=bed_and_num_and_ratio) ok ---------------------------------------------------------------------- Ran 1 test in 230.805s [user@cn3144 ~]$ python-espresso tests/alignments/test.py test (__main__.ChrNameMismatchTest.test) ... ok test (__main__.CigarFormatTest.test) ... ok test (__main__.MissingSequenceTest.test) ... ok test (__main__.SecondaryAlignmentTest.test) ... ok ---------------------------------------------------------------------- Ran 4 tests in 90.046s [user@cn3144 ~]$ python-espresso tests/isoform_assignment/test.py test (__main__.IsoformAssignmentTest.test) ... ok test (__main__.NoExternalBoundaryTest.test) ... ok test (__main__.ReadEndpointsTest.test) ... ok ---------------------------------------------------------------------- Ran 3 tests in 140.279s
[user@cn3111 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$