Biowulf High Performance Computing at the NIH
whippet on Biowulf

Lightweight and Fast; RNA-seq quantification at the event-level

References:

  • T. Sterne-Weiler, R. J. Weatheritt, A. J. Best, K. C. H. Ha, and B. J. Blencowe. Efficient and Accurate Quantitative Profiling of Alternative Splicing Patterns of Any Complexity on a Laptop. Molecular Cell 2018, 72:187-200
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=10g --gres=lscratch:25
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load whippet
[user@cn3144]$ cp -L $WHIPPET_TEST_DATA/* .
[user@cn3144]$ ls -lh
total 7.2G
-rw-r--r-- 1 user group  29M Mar 18 08:51 Mus_musculus.GRCm38.95.gtf.gz
-rw-r--r-- 1 user group 2.6G Mar 18 08:51 Mus_musculus.GRCm38.dna.primary_assembly.fa
-rw-r--r-- 1 user group 2.3G Mar 18 08:51 R1_ENCFF463WEH.fastq.gz
-rw-r--r-- 1 user group 2.3G Mar 18 08:51 R2_ENCFF312OKA.fastq.gz

Note that whippet by default attempts to create a new index inside the whippet installation directory where users don't have write access. Therefore, whippet commands on biowulf need to explicitly use the -x/--index

The first step is to build an index. Note that this has to be done only once per annotation/genome combination. So we'll save the index file for later use

[user@cn3144]$ whippet-index.jl --gtf Mus_musculus.GRCm38.95.gtf.gz \
                      --fasta Mus_musculus.GRCm38.dna.primary_assembly.fa \
                      --index m_musculus_grcm38_ens95 \
                      --suppress-low-tsl
Whippet v0.11 loading and compiling...
INFO: Using proxy ... from http_proxy environment variable
INFO: Using HTTPS proxy ... from https_proxy environment variable
 16.788097 seconds.
Loading GTF file: /lscratch/46116226/Mus_musculus.GRCm38.95.gtf.gz
Loaded 415151 annotated splice-sites from GTF file..
182.078082 seconds (211.84 M allocations: 8.410 GiB, 5.56% gc time)
Indexing transcriptome...
Indexing /lscratch/46116226/Mus_musculus.GRCm38.dna.primary_assembly.fa...
Building Splice Graphs for 1..
 12.099525 seconds (14.28 M allocations: 7.126 GiB, 13.34% gc time)
...
[user@cn3144]$ cp m_musculus_grcm38_ens95.jls /data/$USER/whipet_indices

Then analyse a paired end RNA-Seq experiment


[user@cn3144]$ whippet-quant.jl --biascorrect \
                    --index ./m_musculus_grcm38_ens95 \
                    ENCFF463WEH.fastq.gz  ENCFF312OKA.fastq.gz
Whippet v0.11 loading and compiling...
INFO: Using proxy ... from http_proxy environment variable
INFO: Using HTTPS proxy ... from https_proxy environment variable
 18.474292 seconds
Loading splice graph index... /lscratch/46116226/m_musculus_grcm38_ens95.jls
 10.723297 seconds (5.96 M allocations: 776.678 MiB, 10.32% gc time)
Processing reads from file...
FASTQ_1: /lscratch/46116226/ENCFF463WEH.fastq.gz
FASTQ_2: /lscratch/46116226/ENCFF312OKA.fastq.gz
2019.595531 seconds (3.10 G allocations: 180.318 GiB, 11.79% gc time)
Finished mapping 25109445 paired-end reads of length 101 each out of a total of 31557381 mate-pairs...
Calculating expression values and MLE of equivalence classes with EM:
- 24820 multi-gene mapping read equivalence classes...
- 33369 multi-isoform equivalence classes...
  6.000376 seconds (39.10 k allocations: 3.457 MiB)
Finished calculating transcripts per million (TpM) after 1938 iterations of EM...
Assigning multi-mapping reads based on maximum likelihood estimate..
  0.318131 seconds (348.51 k allocations: 14.446 MiB, 6.94% gc time)
Calculating maximum likelihood estimate of events...
 27.932160 seconds (69.85 M allocations: 4.503 GiB, 14.38% gc time)
Whippet v0.11 done.
2083.024616 seconds (3.20 G allocations: 186.771 GiB, 11.78% gc time)

[user@cn3144]$ ls -lh
total 7.8G
-rw-r--r-- 1 user group 2.3G Mar 18 08:02 ENCFF312OKA.fastq.gz
-rw-r--r-- 1 user group 2.3G Mar 18 08:02 ENCFF463WEH.fastq.gz
-rw-r--r-- 1 user group  29M Nov 25 14:19 Mus_musculus.GRCm38.95.gtf.gz
-rw-r--r-- 1 user group 2.6G Mar 16 10:37 Mus_musculus.GRCm38.dna.primary_assembly.fa
-rw-r--r-- 1 user group 458K Mar 18 08:39 output.gene.tpm.gz
-rw-r--r-- 1 user group 905K Mar 18 08:39 output.isoform.tpm.gz
-rw-r--r-- 1 user group 2.0M Mar 18 08:39 output.jnc.gz
-rw-r--r-- 1 user group  308 Mar 18 08:39 output.map.gz
-rw-r--r-- 1 user group 5.8M Mar 18 08:40 output.psi.gz
-rw-r--r-- 1 user group 3.7M Mar 16 11:28 m_musculus_grcm38_ens95.exons.tab.gz
-rw-r--r-- 1 user group 569M Mar 16 11:28 m_musculus_grcm38_ens95.jls

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. whippet.sh), which uses the input file 'whippet.in'. For example:

We are re-using the index created in the interactive session above

#!/bin/bash
module load whippet/0.11 || exit 1
whippet-quant.jl --biascorrect \
    --index /data/$USER/whipet_indices/m_musculus_grcm38_ens95 \
    ${WHIPPET_TEST_DATA}/ENCFF463WEH.fastq.gz  ${WHIPPET_TEST_DATA}/ENCFF312OKA.fastq.gz

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=2 --mem=10g whippet.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. whippet.swarm). For example:

whippet-quant.jl --biascorrect --index /data/$USER/whipet_indices/m_musculus_grcm38_ens95 s1_r1.fq.gz s1_r2.fq.gz
whippet-quant.jl --biascorrect --index /data/$USER/whipet_indices/m_musculus_grcm38_ens95 s2_r1.fq.gz s2_r2.fq.gz
whippet-quant.jl --biascorrect --index /data/$USER/whipet_indices/m_musculus_grcm38_ens95 s3_r1.fq.gz s3_r2.fq.gz

Submit this job using the swarm command.

swarm -f whippet.swarm -g 10 -t 1 --module whippet
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module whippet Loads the whippet module for each subjob in the swarm