ESPRESSO: Error Statistics PRomoted Evaluator of Splice Site Options

Quick Links

ESPRESSO is a novel method for processing alignment of long read RNA-seq data, which can effectively improve splice junction accuracy and isoform quantification. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms.

References:

Yuan Gao, Feng Wang, Robert Wang, Eric Kutschera, Yang Xu, Stephan Xie, Yuanyuan Wang, Kathryn E. Kadash-Edmondson, Lan Lin, Yi Xing
ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNAseq data
Science, (3) eabq5072 (2023)

Documentation

ESPRESSO Github page

Important Notes

Module Name: espresso (see the modules page for more information)
Unusual environment variables set
- ESPRESSO_HOME installation directory
- ESPRESSO_BIN executable directory
- ESPRESSO_DATA sample data directory

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
[user@cn3144 ~]$ module load espresso
[+] Loading singularity  4.0.1  on cn3144
[+] Loading espresso  1.4.0
[user@cn3144 ~]$ ESPRESSO_C.pl -h

Program:  ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options)
Version:  C_1.4.0
Contact:  Yuan Gao <gaoy@email.chop.edu, gy.james@163.com>

Usage:    perl ESPRESSO_C.pl -I work_dir -F ref.fa -X target_ID

Arguments:

    -I, --in
          work directory (generated by ESPRESSO_S)
    -F, --fa
          FASTA file of all reference sequences. Please make sure this file is
          the same one provided to mapper. (required)
    -X, --target_ID
          ID of sample to process (required)

    -H, --help
          show this help information

    -T, --num_thread
          thread number (default: 5)
    --sort_buffer_size
          memory buffer size for running 'sort' commands (default: 2G)

[user@cn3144 ~]$ ESPRESSO_Q.pl -h

Program:  ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options)
Version:  Q_1.4.0
Contact:  Yuan Gao <gaoy@email.chop.edu, gy.james@163.com>

Usage:    perl ESPRESSO_Q.pl -L work_dir/samples.tsv.updated -A anno.gtf

Arguments:

    -L, --list_samples
          tsv list of multiple samples (each bam in a line with 1st column as
          sorted bam file, 2nd column as sample name in output, 3rd column as
          directory of ESPRESSO_C results; this list can be generated by
          ESPRESSO_S according to the initially provided tsv list; required)
    -A, --anno
          input annotation file in GTF format (optional)
    -O, --out_dir
          output directory (default: directory of -L)
    -V, --tsv_compt
          output tsv for compatible isoform(s) of each read (optional)
    -T --num_thread
          how many threads to use (default: 5)

    -H, --help
          show this help information

    -N, --read_num_cutoff
          min perfect read count for all splice junctions of novel isoform
          (default: 2)
    -R, --read_ratio_cutoff
          min perfect read ratio for all splice junctions of novel isoform
          (default: 0)
    -S, --SJ_dist
          max number of bases that an alignment endpoint can extend past the
          start or end of a matched isoform
          (default: 35)
    --internal_boundary_limit
          max number of bases that an alignment endpoint can extend into an
          intron of a matched isoform
          (default: 6)
    --allow_longer_terminal_exons
          allow an alignment to match an isoform even if the alignment endpoint
          extends more than --SJ_dist past the start or end
[user@cn3144 ~]$ ESPRESSO_S.pl -h

Program:  ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options)
Version:  S_1.4.0
Contact:  Yuan Gao <gaoy@email.chop.edu, gy.james@163.com>

Usage:    perl ESPRESSO_S.pl -L samples.tsv -F ref.fa -A anno.gtf -O work_dir

Arguments:

    -L, --list_samples
          tsv list of sample(s) (each file in a line with 1st column as sorted
          BAM/SAM file and 2nd column as sample name; required)
    -F, --fa
          FASTA file of all reference sequences. Please make sure this file is
          the same one provided to mapper. (required)
    -A, --anno
          input annotation file in GTF format (optional)
    -B, --SJ_bed
          input custom reliable splice junctions in BED format (optional; each
          reliable SJ in one line, with the 1st column as chromosome, the 2nd
          column as upstream splice site 0-base coordinate, the 3rd column as
          downstream splice site and 6th column as strand)
    -O, --out
          work directory (existing files in this directory may be OVERWRITTEN;
          default: ./)

    -H, --help
          show this help information

    -N, --read_num_cutoff
          min perfect read count for denovo detected candidate splice junctions
          (default: 2)
    -R, --read_ratio_cutoff
          min perfect read ratio for denovo detected candidate splice junctions:
          Set this as 1 for completely GTF-dependent processing (default: 0)

    -C, --cont_del_max
          max continuous deletion allowed; intron will be identified if longer
          (default: 50)
    -M, --chrM
          tell ESPRESSO the ID of mitochondrion in reference file (default:
          chrM)

    -T, --num_thread
          thread number (default: minimum of 5 and sam file number)
    -Q, --mapq_cutoff
          min mapping quality for processing (default: 1)
    --sort_buffer_size
          memory buffer size for running 'sort' commands (default: 2G)

[user@cn3144 ~]$ git clone https://github.com/Xinglab/espresso
[user@cn3144 ~]$ cd espresso
[user@cn3144 ~]$ python-espresso tests/high_confidence_sjs/test.py
test (__main__.HighConfidenceSjsTest.test) ... (config=strict)
(config=num)
(config=ratio)
(config=num_and_ratio)
(config=gtf)
(config=bed)
(config=bed_and_num_and_ratio)
ok

----------------------------------------------------------------------
Ran 1 test in 230.805s
[user@cn3144 ~]$ python-espresso tests/alignments/test.py
test (__main__.ChrNameMismatchTest.test) ... ok
test (__main__.CigarFormatTest.test) ... ok
test (__main__.MissingSequenceTest.test) ... ok
test (__main__.SecondaryAlignmentTest.test) ... ok

----------------------------------------------------------------------
Ran 4 tests in 90.046s

[user@cn3144 ~]$ python-espresso tests/isoform_assignment/test.py
test (__main__.IsoformAssignmentTest.test) ... ok
test (__main__.NoExternalBoundaryTest.test) ... ok
test (__main__.ReadEndpointsTest.test) ... ok

----------------------------------------------------------------------
Ran 3 tests in 140.279s

End the interactive session:

[user@cn3111 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$