High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
rail-rna on Biowulf & Helix

Description

Rail-RNA is a spliced RNA-Seq aligner. It has three modes of operation:

local
Run on a single compute node using multiple processes
parallel
Run across multiple compute nodes using an ad hoc ipyparallel cluster
elastic
Run on Amazon Elastic MapReduce

Rail-RNA uses cross-sample information to improve sensitivity and specificity of splice junction detection. In addition to alignments Rail-RNA also produces a number of other outputs (bigwig density tracks, feature matrices, read count matrix, ...).

There may be multiple versions of rail-rna available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail rail-rna 

To select a module use

module load rail-rna/[version]

where [version] is the version of choice.

rail-rna is a multithreaded application. Make sure to match the number of cpus requested with the number of processes.

Environment variables set

Dependencies

All dependencies are packaged as part of rail-rna. No additional modules need to be loaded

References

Documentation

On Helix

Rail-RNA should not be run on helix.

Batch job on Biowulf

For the following example copy the test data from the Rail-RNA directory:

biowulf$ mkdir fq
biowulf$ cp /usr/local/apps/rail-rna/TEST_DATA/*.fastq fq
biowulf$ tree fq
fq
|-- [user   5.0M]  dm3_example_1_left.fastq
|-- [user   5.0M]  dm3_example_1_right.fastq
|-- [user   5.1M]  dm3_example_2_left.fastq
`-- [user   5.1M]  dm3_example_2_right.fastq

Rail-RNA uses a tab delimited configuration file to set up an analysis. It's format is

<fastq/fasta 1> <md5sum or 0> [<fastq/fasta 2> <md5sum or 0>] <sample label>

Create a configuration file for the test data

biowulf$ cat > config <<EOF
./fq/dm3_example_1_left.fastq	0	./fq/dm3_example_1_right.fastq	0	dm3_example-1-1
./fq/dm3_example_2_left.fastq	0	./fq/dm3_example_2_right.fastq	0	dm3_example-2-1
EOF

To run Rail-RNA in local mode on a single compute node create a batch script similar to the following example:

#! /bin/bash
#SBATCH --mem=10g
#SBATCH --cpus-per-task=4
#SBATCH --gres=lscratch:20

module load rail-rna || exit 1

dm3=/fdb/igenomes/Drosophila_melanogaster/UCSC/dm3/Sequence/
rail-rna go local -x $dm3/BowtieIndex/genome,$dm3/Bowtie2Index/genome \
   -o out --log logs -p $SLURM_CPUS_PER_TASK --scratch /lscratch/$SLURM_JOBID \
   -m config

Submit to the queue with sbatch:

biowulf$ sbatch rail_local.sh

Using local scratch is optional in local mode.

Parallel job on Biowulf

For larger numbers of samples, Rail-RNA can make use of multiple compute nodes. It does so by connecting as a client to an existing ipyparallel python cluster. Therefore a batch script to run Rail-RNA in parallel mode needs to first create an ad hoc ipyparallel cluster. A batch script to run parallel mode on the example above would look similar to the following:

#! /bin/bash
#SBATCH --ntasks=32
#SBATCH --partition=ibqdr
#SBATCH --cpus-per-task=2
#SBATCH --mem=21g
#SBATCH --exclusive
#SBATCH --time=20
#SBATCH --gres=lscratch:400

# need to increase the limit of processes per user
ulimit -S -u 4096

# the ipython parallel profile
profile=job_${SLURM_JOB_ID}
profile_d=${PWD}/${profile}
# uncomment the following line if you would like to automatically
# delete the profile directory at the end of the run
#trap "sleep 30; rm -rf ${profile_d}" EXIT

module load rail-rna || exit 1

echo "nodes: ${SLURM_NODELIST}"
echo "Launching controller and creating profile '${profile}'"
ipcontroller --init --profile-dir=${profile_d} --ip="*" --log-to-file --ping 6000 &
sleep 10

echo "Launching engines with srun"
srun ipengine --profile-dir=${profile_d} --location=$(hostname) --log-to-file &
sleep 45

echo -e "\n\nRunning rail"
echo "--------------------------------------------------------------------------------"
dm3=/fdb/igenomes/Drosophila_melanogaster/UCSC/dm3/Sequence/
rail-rna go parallel -x $dm3/BowtieIndex/genome,$dm3/Bowtie2Index/genome \
   -o out --log logs --scratch /lscratch/$SLURM_JOBID \
   --max-task-attempts 10 \
   --gzip-intermediates \
   --ipcontroller-json ${profile_d}/security/ipcontroller-client.json \
   -m config
echo "--------------------------------------------------------------------------------"

Important notes

In our tests, rail-rna processed about 0.1 Gbases per task per hour of human paired end RNA-Seq data.

Submit to the queue with sbatch:

biowulf$ sbatch rail_parallel.sh

Note that in this case the job is run on an infiniband partition since it now spans multiple nodes.