Biowulf High Performance Computing at the NIH
rMATS on Biowulf

MATS is a computational tool to detect differential alternative splicing events from RNA-Seq data. The statistical model of MATS calculates the P-value and false discovery rate that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold. The replicate MATS (rMATS) is designed for detection of differential alternative splicing from replicate RNA-Seq data.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive -c 16 --mem 45g --gres=lscratch:20
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ mkdir -p /data/$USER/rmats && cd /data/$USER/rmats
Here is how one can use rMATS version 4.0.2:
[user@cn3144 ~]$ module load rmats/4.0.2
[+] Loading rmats  4.0.2  on cn2352
[+] Loading singularity  3.6.1  on cn2352
[user@cn3144 ~]$ cp -r /usr/local/apps/rmats/testData/* .

[user@cn3144 ~]$ export TMPDIR=/lscratch/$SLURM_JOBID # write temp files in lscratch

[user@cn3144 ~]$ rmats --s1 $PWD/s1.txt --s2 $PWD/s2.txt --gtf gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --bi /fdb/STAR_indices/2.6.1c/GENCODE/Gencode_human/release_27/genes-100 --od out_test -t paired --nthread $SLURM_CPUS_PER_TASK --readLength 50 --tophatAnchor 8 --cstat 0.0001 --tstat 6 
In order to use the most recent version 4.1.0, type:
[user@cn3144 ~]$ module load rmats/4.1.0
[+] Loading samtools 1.10  ...
[+] Loading STAR  2.7.1a
[+] Loading openmpi 3.1.4  for GCC 9.2.0
[+] Loading ImageMagick  7.0.8  on cn0860
[+] Loading HDF5  1.10.4
[+] Loading NetCDF 4.7.4_gcc9.2.0
[+] Loading pandoc  2.10.1  on cn0860
[+] Loading pcre2 10.21  ...
[+] Loading R 4.0.0
[+] Loading zlib 1.2.11  ...
[+] Loading LAPACK/3.8.0  3.8.0-gcc4.8.5  libraries...
[+] Loading GSL 2.6 for GCC 4.8.5 ...
[+] Loading rMATS 4.1.0  ...
[user@cn3144 ~]$ cp -r $RMATS_DATA/* .

[user@cn3144 ~]$ cp -r $RMATS_SRC/* .  

[user@cn3144 ~]$ make

[user@cn3144 ~]$ source ./setup_environment.sh

[user@cn3144 ~]$ export TMPDIR=/lscratch/$SLURM_JOBID # write temp files in lscratch

IMPORTANT NOTE: the command make builds the local versions, ./test_rmats and ./run_rmats, of the executables test_rmats and run_rmats. It is important that you use these local versions, i.e. put the prefix "./" before the executable name.

[user@cn3144 ~]$ ./test_rmats ... ... test (tests.alternative_3_splice_site_novel.test.NovelJunction) ... ok test (tests.alternative_3_splice_site_novel.test.NovelSpliceSite) ... ok test (tests.alternative_5_splice_site_novel.test.NovelJunction) ... ok test (tests.alternative_5_splice_site_novel.test.NovelSpliceSite) ... ok test (tests.mutually_exclusive_exons_novel.test.NovelJunction) ... ok test (tests.mutually_exclusive_exons_novel.test.NovelSpliceSite) ... ok test (tests.only_one_sample.test.StatOffTest) ... ok test (tests.only_one_sample.test.StatOnTest) ... ok test (tests.paired_stats.test.FilteredEventTest) ... ok test (tests.paired_stats.test.OneEventTest) ... ok test (tests.paired_stats.test.TwoEventTest) ... ok test (tests.prep_post.test.Test) ... ok test (tests.retained_intron_novel.test.NovelJunction) ... ok test (tests.retained_intron_novel.test.NovelSpliceSite) ... ok test (tests.skipped_exon_basic.test.Test) ... ok test (tests.skipped_exon_novel.test.NovelJunction) ... ok test (tests.skipped_exon_novel.test.NovelSpliceSite) ... ok test (tests.variable_read_length.test.Length1Test) ... ok test (tests.variable_read_length.test.Length1VariableTest) ... ok test (tests.variable_read_length.test.Length2Test) ... ok test (tests.variable_read_length.test.Length2VariableTest) ... ok ---------------------------------------------------------------------- Ran 21 tests in 75.324s OK [user@cn3144 ~]$ ./run_rmats --s1 $PWD/s1.txt --s2 $PWD/s2.txt --gtf $PWD/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --bi /fdb/STAR_indices/2.7.2a/GENCODE/Gencode_human/release_27/genes-100 --od out_test -t paired --nthread 16 --readLength 50 --tophatAnchor 8 --cstat 0.0001 --tstat 6 --tmp=$TMPDIR mapping the first sample cmd= STAR --chimSegmentMin 2 --outFilterMismatchNmax 3 --alignEndsType EndToEnd --runThreadN 4 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 8 --alignIntronMax 299999 --genomeDir /fdb/STAR_indices/2.7.2a/GENCODE/Gencode_human/release_27/genes-100 --sjdbGTFfile /data/user/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --outFileNamePrefix /lscratch/63360764/bam1_1/ --readFilesIn 231ESRP.25K.rep-1.R1.fastq 231ESRP.25K.rep-1.R2.fastq mapping sample_0, 231ESRP.25K.rep-1.R1.fastq 231ESRP.25K.rep-1.R2.fastq is done with status 0 Aug 20 06:33:51 ..... started STAR run Aug 20 06:33:51 ..... loading genome Aug 20 06:34:25 ..... processing annotations GTF Aug 20 06:34:38 ..... inserting junctions into the genome indices Aug 20 06:39:57 ..... started mapping Aug 20 06:39:58 ..... finished mapping Aug 20 06:40:00 ..... started sorting BAM Aug 20 06:40:00 ..... finished successfully cmd= STAR --chimSegmentMin 2 --outFilterMismatchNmax 3 --alignEndsType EndToEnd --runThreadN 4 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 8 --alignIntronMax 299999 --genomeDir /fdb/STAR_indices/2.7.2a/GENCODE/Gencode_human/release_27/genes-100 --sjdbGTFfile /data/user/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --outFileNamePrefix /lscratch/63360764/bam1_2/ --readFilesIn 231ESRP.25K.rep-2.R1.fastq 231ESRP.25K.rep-2.R2.fastq mapping sample_0, 231ESRP.25K.rep-2.R1.fastq 231ESRP.25K.rep-2.R2.fastq is done with status 0 Aug 20 06:40:01 ..... started STAR run Aug 20 06:40:01 ..... loading genome Aug 20 06:40:38 ..... processing annotations GTF Aug 20 06:40:50 ..... inserting junctions into the genome indices Aug 20 06:46:10 ..... started mapping Aug 20 06:46:11 ..... finished mapping Aug 20 06:46:14 ..... started sorting BAM Aug 20 06:46:14 ..... finished successfully mapping the first sample cmd= STAR --chimSegmentMin 2 --outFilterMismatchNmax 3 --alignEndsType EndToEnd --runThreadN 4 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 8 --alignIntronMax 299999 --genomeDir /fdb/STAR_indices/2.7.2a/GENCODE/Gencode_human/release_27/genes-100 --sjdbGTFfile /data/user/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --outFileNamePrefix /lscratch/63360764/bam2_1/ --readFilesIn 231EV.25K.rep-1.R1.fastq 231EV.25K.rep-1.R2.fastq Aug 20 06:46:14 ..... started STAR run Aug 20 06:46:14 ..... loading genome Aug 20 06:46:50 ..... processing annotations GTF Aug 20 06:47:03 ..... inserting junctions into the genome indices Aug 20 06:52:21 ..... started mapping Aug 20 06:52:22 ..... finished mapping Aug 20 06:52:24 ..... started sorting BAM Aug 20 06:52:24 ..... finished successfully cmd= STAR --chimSegmentMin 2 --outFilterMismatchNmax 3 --alignEndsType EndToEnd --runThreadN 4 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 8 --alignIntronMax 299999 --genomeDir /fdb/STAR_indices/2.7.2a/GENCODE/Gencode_human/release_27/genes-100 --sjdbGTFfile /data/user/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --outFileNamePrefix /lscratch/63360764/bam2_2/ --readFilesIn 231EV.25K.rep-2.R1.fastq 231EV.25K.rep-2.R2.fastq mapping sample_1, 231EV.25K.rep-2.R1.fastq 231EV.25K.rep-2.R2.fastq is done with status 0 Aug 20 06:52:25 ..... started STAR run Aug 20 06:52:25 ..... loading genome Aug 20 06:53:00 ..... processing annotations GTF Aug 20 06:53:13 ..... inserting junctions into the genome indices Aug 20 06:58:35 ..... started mapping Aug 20 06:58:36 ..... finished mapping Aug 20 06:58:39 ..... started sorting BAM Aug 20 06:58:39 ..... finished successfully gtf: 14.693128108978271 There are 57195 distinct gene ID in the gtf file There are 194665 distinct transcript ID in the gtf file There are 35196 one-transcript genes in the gtf file There are 1191466 exons in the gtf file There are 24080 one-exon transcripts in the gtf file There are 21328 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 3.403532 Average number of exons per transcript is 6.120597 Average number of exons per transcript excluding one-exon tx is 6.843427 Average number of gene per geneGroup is 7.430884 statistic: 0.03266119956970215 novel: 0.15076112747192383 The splicing graph and candidate read have been saved into /lscratch/63360764/2020-08-20-06:58:54_457695.rmats save: 0.0016617774963378906 loadsg: 0.0012867450714111328 ========== Done processing each gene from dictionary to compile AS events Found 37784 exon skipping events Found 2018 exon MX events Found 12583 alt SS events There are 7698 alt 3 SS events and 4885 alt 5 SS events. Found 5539 RI events ========== ase: 1.9961071014404297 count: 0.5084185600280762 Processing count files. Done processing count files.
The results are stored in the folder out_test.
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. submit.sh). For example:

#!/bin/bash
set -e
module load rmats
export TMPDIR=/lscratch/$SLURM_JOBID
run_rmats --s1 $PWD/s1.txt --s2 $PWD/s2.txt --gtf $PWDgtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --bi /fdb/STAR_indices/2.6.1c/GENCODE/Gencode_human/release_27/genes-100 --od out_test -t paired --nthread $SLURM_CPUS_PER_TASK --readLength 50 --tophatAnchor 8 --cstat 0.0001 --tstat 6

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=16 --mem=30g --gres=lscratch:20 submit.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. rmats.swarm). For example:

run_rmats --s1 $PWD/s1.txt --s2 $PWD/s2.txt --gtf $PWDgtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --bi /fdb/STAR_indices/2.6.1c/GENCODE/Gencode_human/release_27/genes-100 --od out_test1 -t paired --nthread $SLURM_CPUS_PER_TASK --readLength 50  --tophatAnchor 8 --cstat
run_rmats --s1 $PWD s3.txt --s2 $PWD s4.txt --gtf $PWDgtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --bi /fdb/STAR_indices/2.6.1c/GENCODE/Gencode_human/release_27/genes-100 --od out_test2 -t paired --nthread $SLURM_CPUS_PER_TASK --readLength 50  --tophatAnchor 8 --cstat
run_rmats --s1 $PWD s5.txt --s2 $PWD s6.txt --gtf $PWDgtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --bi /fdb/STAR_indices/2.6.1c/GENCODE/Gencode_human/release_27/genes-100 --od out_test3 -t paired --nthread $SLURM_CPUS_PER_TASK --readLength 50  --tophatAnchor 8 --cstat

Submit this job using the swarm command.

swarm -f rmats.swarm [-g 30] [-t 16] --module rmats
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module TEMPLATE Loads the TEMPLATE module for each subjob in the swarm