Biowulf High Performance Computing at the NIH
casper on Biowulf

From the casper home page:

CASPER (Context-Aware Scheme for Paired-End Read) is state-of-the art merging tool in terms of accuracy and robustness. Using this sophisticated merging method, we could get elongated reads from the forward and reverse reads.
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

biowulf$ sinteractive --cpus-per-task=4
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load casper
[user@cn3144]$ # 100nt paired end reads from fragments sized between 160 and 190 nts
[user@cn3144]$ # so they all overlapi (simulated data)
[user@cn3144]$ cp -r $CASPER_TEST_DATA/A4 .
[user@cn3144]$ ls -lh A4
total 424M
-rw-rw-r-- 1 user group 211M Feb 13  2014 A4_1.fastq
-rw-rw-r-- 1 user group 211M Feb 13  2014 A4_2.fastq
-rw-rw-r-- 1 user group 4.1K Feb 13  2014 A4_reference.fasta
-rw-rw-r-- 1 user group  413 Feb 13  2014 README

[user@cn3144]$ casper -t 4 -o A4 A4/A4_1.fastq A4/A4_2.fastq
=============================================================================
[CASPER] Context-Aware Scheme for Paired-End Read

  Input Files
     -  Forward file     : A4/A4_1.fastq
     -  Reverse file     : A4/A4_2.fastq

  Parameters
     -  Number of threads for parallel processing : 4
     -  K-mer size                                : 17
     -  Threshold for difference of quality score : 19
     -  Threshold for mismatching ratio           : 0.5
     -  Minimum length of overlap                 : 10
     -  Using Jellyfish                           : true

  K-mers : Jellyfish
     -  jellyfish count -m 17 -L 2 -o A4jellykmer -c 3 -s 10M -t 4 A4/A4_1.fastq
     -  jellyfish count -m 17 -L 2 -o A4jellykmer -c 3 -s 10M -t 4 A4/A4_2.fastq

  Output Files
     -  Merged output file            : A4.fastq

  Merging Result Statistics
     -  Total number of reads     :    1000000
     -  Number of merged reads    :     999936 (99.99%)
     -  Number of unmerged reads  :         64 (0.01%)
     -  TIME for total processing :     35.592 sec
=============================================================================
[user@cn3144]$ # creates output file A4.fastq
[user@cn3144]$ ls -lh
total 358M
drwxrwsr-x 2 user group 4.0K Feb 13  2014 A4
-rw-r--r-- 1 user group 357M Mar 22 09:21 A4.fastq
[user@cn3144]$ exit
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. casper.sh), which uses the input file 'casper.in'. For example:

#! /bin/bash

module load casper/0.8.2 || exit 1
casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 \
    -o A4 A4/A4_1.fastq A4/A4_2.fastq

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=4 --mem=6g casper.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. casper.swarm). For example:

casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 -o sample1 sample1/sample1_1.fastq sample1/sample1_2.fastq
casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 -o sample2 sample2/sample2_1.fastq sample2/sample2_2.fastq
casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 -o sample3 sample3/sample3_1.fastq sample3/sample3_2.fastq

Submit this job using the swarm command.

swarm -f casper.swarm -g 6 -t 4 --module casper/0.8.2
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module casper Loads the casper module for each subjob in the swarm