High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Express on Biowulf & Helix

eXpress is a streaming tool for quantifying the abundances of a set of target sequences from sampled subsequences. Example applications include transcript-level RNA-Seq quantification, allele-specific/haplotype expression analysis (from RNA-Seq), transcription factor binding quantification in ChIP-Seq, and analysis of metagenomic data. It is based on an online-EM algorithm [more info here] that results in space (memory) requirements proportional to the total size of the target sequences and time requirements that are proportional to the number of sampled fragments. Thus, in applications such as RNA-Seq, eXpress can accurately quantify much larger samples than other currently available tools greatly reducing computing infrastructure requirements. eXpress can be used to build lightweight high-throughput sequencing processing pipelines when coupled with a streaming aligner (such as Bowtie), as output can be piped directly into eXpress, effectively eliminating the need to store read alignments in memory or on disk.

The environment variable(s) need to be set properly first. The easiest way to do this is by using the modules commands, 'module load express', as in the example below.

biowulf% module load express
This command will load the required bowtie/1.0 and samtools as well as express.

Express was developed by Adam Roberts and Lior Pachter at the University of California, Berkeley. [Express website]

Running Express on Helix

The sample sessions in this document use the example sets from the Express documentation. In the following two sub-sections, you will run eXpress on a sample RNA-Seq dataset with simulated reads from UGT3A2 and the HOXC cluster using the human genome build hg18. Both the transcript sequences (transcripts.fasta) and raw reads (reads_1.fastq, reads_2.fastq) can be found in the /usr/local/express/sample_data directory. For this example to work, you will need to have both bowtie and samtools in your $PATH (you can use 'module' command), but in general any aligner will work and the conversion to BAM is not necessary unless you have insufficient disk space to store the uncompressed SAM.

Before you begin, you must prepare your Bowtie index. Since you wish to allow many multi-mappings, it is useful to build the index with a small offrate (in this case 1). The smaller the offrate, the larger the index and the faster the mapping. If you have disk space to spare, always use an offrate of 1.

  1. Build the index with the following commands.
    $ mkdir /data/userID/express
    $ cp -rp /usr/local/express/sample_data /data/userID/express
    $ cd /data/userID/express/sample_data
    $ module load express
    $ bowtie-build --offrate 1 transcripts.fasta transcript
    

    This command will populate your directory with several index files that allow Bowtie to more easily align reads to the transcripts.

    You can now map the reads to the transcript sequences using the following Bowtie command, which outputs in SAM (-S), allows for unlimited multi-mappings (-a), a maximum insert distance of 800 bp between the paired-ends (-X 800), and 3 mismatches (-v 3). The first three options (a,S,X) are highly recommended for best results. You should also allow for many mismatches, since eXpress models sequencing errors. Furthermore, you will want to take advantage of multiple processors when mapping large files using the -poption. See the Bowtie Manual for more details on various parameters and options.

  2. The SAM output from Bowtie is piped into SAMtools in order to compress it to BAM format. This conversion is optional, but will greatly reduce the size of the alignment file.

    $ bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | samtools view -Sb - > hits.bam
    

  3. Once you have aligned your reads to the transcriptome and stored them in a SAM or BAM file, you can run eXpress in default mode with the command:
    $ module load express
    $ express transcripts.fasta hits.bam
    

  4. If you do not wish to store an intermediate SAM/BAM file, you can pipe the Bowtie output directly into eXpress with the command:
    $ module load bowtie/0.12.8 samtools express
      $ bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | express transcripts.fasta 
      

Running a single Express batch job on Biowulf

The following batch script uses the same commands as in the preceding example.

#!/bin/bash
#
# this file is called express.bat

module load express
cd /data/$USER/express
cp -rp $EXPRESS_ROOT/sample_data  .
cd sample_data
bowtie-build --offrate 1 transcripts.fasta transcript
bowtie -p $SLURM_CPUS_PER_TASK -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq \
          | express transcripts.fasta 

Submit this job with:

sbatch  --cpus-per-task=4  express.bat

Bowtie is a multi-threaded program, and so the job is submitted to 4 CPUs ('--cpus-per-task=4' in the sbatch command above). In the batch script express.bat, bowtie is run with '-p $SLURM_CPUS_PER_TASK' to utilize all 4 cpus.

Running a swarm of Express batch jobs on Biowulf

Set up a swarm command file along the following lines:

bowtie -p $SLURM_CPUS_PER_TASK -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq  \
        | express transcripts.fasta
bowtie -p $SLURM_CPUS_PER_TASK -aS -X 800 --offrate 1 -v 3 transcript -1 reads_3.fastq -2 reads_4.fastq \ 
       | express transcripts.fasta
etc...
Submit this job with:
swarm -t 4 -f cmdfile --module express

The parameter '-t 4' tells swarm to allocate 4 CPUs for each command above. Express is single-threaded, but bowtie will make use of the 4 CPUS ('-p $SLURM_CPUS_PER_TASK' in the swarm command file above).

By default, each line in the swarm command file above will be executed on 4 CPU and can utilize up to 8 GB of memory. If your commands require more than 8 GB, you should submit with:

swarm -g # -f cmdfile --module express
where '#' is the number of GigaBytes of memory required for a single process (1 line in your swarm command file).

Running an interactive job on Biowulf

Allocate one or more CPUs and run the commands. Sample session:

[susanc@biowulf]$ sinteractive
salloc.exe: Granted job allocation 16255
slurm stepprolog here!
Begin slurm taskprolog!
End slurm taskprolog!
[susanc@p23 express]$ module load express

[susanc@p23 express]$ cp -rp $EXPRESS_ROOT/sample_data  .

[susanc@p23 express]$ cd sample_data

[susanc@p23 sample_data]$ bowtie-build --offrate 1 transcripts.fasta transcript

Settings:
  Output files: "transcript.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  ..etc...
  [susanc@p23 sample_data]$ bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | express transcripts.fasta
WARNING: Could not connect to update server to verify current version. Please check at the eXpress website (http://bio.math.berkeley.edu/eXpress).

2015-Mar-11 14:05:22 - No alignment file specified. Expecting streaming input on stdin...

2015-Mar-11 14:05:22 - Loading target sequences and measuring bias background...
2015-Mar-11 14:05:22 - Initialized 15 targets.
2015-Mar-11 14:05:22 - Processing input fragment alignments...
# reads processed: 10000
# reads with at least one reported alignment: 10000 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 14273 paired-end alignments to 1 output stream(s)
2015-Mar-11 14:05:30 - COMPLETED: Processed 10000 mapped fragments, targets are in 8 bundles.
2015-Mar-11 14:05:30 - WARNING: Not enough fragments observed to accurately learn bias parameters. Either disable bias correction (--no-bias-correct) or provide a file containing auxiliary parameters (--aux-param-file).
2015-Mar-11 14:05:30 - Writing results to file...
2015-Mar-11 14:05:30 - Done.

[susanc@p23 sample_data]$ exit
exit
slurm stepepilog here!
salloc.exe: Relinquishing job allocation 16255
salloc.exe: Job allocation 16255 has been revoked.
[susanc@biowulf]$

Documentation

http://bio.math.berkeley.edu/eXpress/manual.html