High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
mirdeep on Biowulf

miRDeep2 uses the distribution of next generation sequencing reads in the genome along with RNA structure prediction to discover and quantitate the expression of known and novel miRNAs. miRDeep2 represents a complete overhaul of the original miRDeep tool.

miRDeep2 is a collection of perl scripts tied together by 3 main scripts:

Of these, mapper.pl and quantifier.pl may run multithreaded bowtie suprocesses. -o determines the thread count for mapper.pl, and -T for quantifier.pl. Because of this mixed nature of processes, it's best to run individual steps separately rather than combining them into a single batch script.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$

[user@cn3144 ~]$ module load mirdeep
[user@cn3144 ~]$ cp -r $MIRDEEP_TEST_DATA tutorial
[user@cn3144 ~]$ cd tutorial
[user@cn3144 ~]$ # the following step is not necessary if a prebuilt
[user@cn3144 ~]$ # genome index is available
[user@cn3144 ~]$ bowtie-build cel_cluster.fa cel_cluster

[user@cn3144 ~]$ mapper.pl reads.fa -c -j -k TCGTATGCCGTCTTCTGCTTGT  \
  -l 18 -m -p cel_cluster -o 2 -v -n \
  -s reads_collapsed.fa \
  -t reads_collapsed_vs_genome.arf
.....
[user@cn3144 ~]$ miRDeep2.pl reads_collapsed.fa cel_cluster.fa \
  reads_collapsed_vs_genome.arf \
  mature_ref_this_species.fa \
  mature_ref_other_species.fa \
  precursors_ref_this_species.fa \
  -t C.elegans 2> report.log

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

The single threaded and multi threaded steps of the miRDeep2 pipeline could be tied together with snakemake or a similar workflow tool capable of sumbitting batch jobs. For the example here, we will simply write a script that uses job dependencies to tie together the three steps of the whole pipeline:

#! /bin/bash
cd /data/$USER/test_data/mirdeep
cp -r /usr/local/apps/mirdeep/2.0.0.7/tutorial .
cd tutorial
module load mirdeep
bowtie-build cel_cluster.fa cel_cluster

# create files for each job in the pipeline
cat > step1.sh <<'EOF'
#! /bin/bash
#SBATCH --job-name=mirdeep_s1
mapper.pl reads.fa -c -j -k TCGTATGCCGTCTTCTGCTTGT  \
  -l 18 -m -p cel_cluster -v \
  -o ${SLURM_CPUS_PER_TASK} \
  -s reads_collapsed.fa \
  -t reads_collapsed_vs_genome.arf
EOF

cat > step2.sh <<'EOF'
#! /bin/bash
#SBATCH --job-name=mirdeep_s2
quantifier.pl -p precursors_ref_this_species.fa \
  -m mature_ref_this_species.fa \
  -T ${SLURM_CPUS_PER_TASK} \
  -r reads_collapsed.fa -t cel -y 16_19
EOF

cat > step3.sh <<'EOF'
#! /bin/bash
#SBATCH --job-name=mirdeep_s3
miRDeep2.pl reads_collapsed.fa cel_cluster.fa \
  reads_collapsed_vs_genome.arf \
  mature_ref_this_species.fa \
  mature_ref_other_species.fa \
  precursors_ref_this_species.fa \
  -t C.elegans 2> report.log
EOF

# set up the pipeline run
jid1=$(sbatch -c4 step1.sh)
jid2=$(sbatch -c4 --dependency=afterany:${jid1} step2.sh)
jid3=$(sbatch --dependency=afterany:${jid2} step3.sh)

The script above will submit all three steps as separate jobs. Each job will only execute if the previous job finished successfully. Of course, the same can be achieved by manually creating batch scripts for each job and sumbitting them individually as batch jobs.