High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
casper on Biowulf & Helix

Description

From the casper home page:

CASPER (Context-Aware Scheme for Paired-End Read) is state-of-the art merging tool in terms of accuracy and robustness. Using this sophisticated merging method, we could get elongated reads from the forward and reverse reads.

There may be multiple versions of casper available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail casper 

To select a module use

module load casper/[version]

where [version] is the version of choice.

casper is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

Dependencies

The jellyfish module is loaded automatically

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

biowulf$ sinteractive --cpus-per-task=4
salloc.exe: Pending job allocation 36375834
[...snip...]
node$ module load casper/0.8.2
[+] Loading jellyfish 2.2.6
[+] Loading casper 0.8.2
node$ # 100nt paired end reads from fragments sized between 160 and 190 nts
node$ # so they all overlapi (simulated data)
node$ cp -r $CASPER_TEST_DATA/A4 .
node$ ls -lh A4
total 424M
-rw-rw-r-- 1 wresch staff 211M Feb 13  2014 A4_1.fastq
-rw-rw-r-- 1 wresch staff 211M Feb 13  2014 A4_2.fastq
-rw-rw-r-- 1 wresch staff 4.1K Feb 13  2014 A4_reference.fasta
-rw-rw-r-- 1 wresch staff  413 Feb 13  2014 README
node$ casper -t 4 -o A4 A4/A4_1.fastq A4/A4_2.fastq
=============================================================================
[CASPER] Context-Aware Scheme for Paired-End Read

  Input Files
     -  Forward file     : A4/A4_1.fastq
     -  Reverse file     : A4/A4_2.fastq

  Parameters
     -  Number of threads for parallel processing : 4
     -  K-mer size                                : 17
     -  Threshold for difference of quality score : 19
     -  Threshold for mismatching ratio           : 0.5
     -  Minimum length of overlap                 : 10
     -  Using Jellyfish                           : true

  K-mers : Jellyfish
     -  jellyfish count -m 17 -L 2 -o A4jellykmer -c 3 -s 10M -t 4 A4/A4_1.fastq
     -  jellyfish count -m 17 -L 2 -o A4jellykmer -c 3 -s 10M -t 4 A4/A4_2.fastq

  Output Files
     -  Merged output file            : A4.fastq

  Merging Result Statistics
     -  Total number of reads     :    1000000
     -  Number of merged reads    :     999936 (99.99%)
     -  Number of unmerged reads  :         64 (0.01%)
     -  TIME for total processing :     35.592 sec
=============================================================================
node$ # creates output file A4.fastq
node$ ls -lh
total 358M
drwxrwsr-x 2 wresch staff 4.0K Feb 13  2014 A4
-rw-r--r-- 1 wresch staff 357M Mar 22 09:21 A4.fastq
node$ exit
biowulf$
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is casper.batch

module load casper/0.8.2 || exit 1
casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 \
    -o A4 A4/A4_1.fastq A4/A4_2.fastq

Submit to the queue with sbatch:

biowulf$ sbatch --cpus-per-task=4 --mem=6g casper.batch