casper on Biowulf
From the casper home page:
CASPER (Context-Aware Scheme for Paired-End Read) is state-of-the art merging tool in terms of accuracy and robustness. Using this sophisticated merging method, we could get elongated reads from the forward and reverse reads.
Documentation
Important Notes
- Module Name: casper (see the modules page for more information)
- casper is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.
- Example files in
$CASPER_TEST_DATA
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
biowulf$ sinteractive --cpus-per-task=4 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ module load casper [user@cn3144]$ # 100nt paired end reads from fragments sized between 160 and 190 nts [user@cn3144]$ # so they all overlapi (simulated data) [user@cn3144]$ cp -r $CASPER_TEST_DATA/A4 . [user@cn3144]$ ls -lh A4 total 424M -rw-rw-r-- 1 user group 211M Feb 13 2014 A4_1.fastq -rw-rw-r-- 1 user group 211M Feb 13 2014 A4_2.fastq -rw-rw-r-- 1 user group 4.1K Feb 13 2014 A4_reference.fasta -rw-rw-r-- 1 user group 413 Feb 13 2014 README [user@cn3144]$ casper -t 4 -o A4 A4/A4_1.fastq A4/A4_2.fastq ============================================================================= [CASPER] Context-Aware Scheme for Paired-End Read Input Files - Forward file : A4/A4_1.fastq - Reverse file : A4/A4_2.fastq Parameters - Number of threads for parallel processing : 4 - K-mer size : 17 - Threshold for difference of quality score : 19 - Threshold for mismatching ratio : 0.5 - Minimum length of overlap : 10 - Using Jellyfish : true K-mers : Jellyfish - jellyfish count -m 17 -L 2 -o A4jellykmer -c 3 -s 10M -t 4 A4/A4_1.fastq - jellyfish count -m 17 -L 2 -o A4jellykmer -c 3 -s 10M -t 4 A4/A4_2.fastq Output Files - Merged output file : A4.fastq Merging Result Statistics - Total number of reads : 1000000 - Number of merged reads : 999936 (99.99%) - Number of unmerged reads : 64 (0.01%) - TIME for total processing : 35.592 sec ============================================================================= [user@cn3144]$ # creates output file A4.fastq [user@cn3144]$ ls -lh total 358M drwxrwsr-x 2 user group 4.0K Feb 13 2014 A4 -rw-r--r-- 1 user group 357M Mar 22 09:21 A4.fastq [user@cn3144]$ exit [user@biowulf]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. casper.sh), which uses the input file 'casper.in'. For example:
#! /bin/bash module load casper/0.8.2 || exit 1 casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 \ -o A4 A4/A4_1.fastq A4/A4_2.fastq
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=4 --mem=6g casper.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. casper.swarm). For example:
casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 -o sample1 sample1/sample1_1.fastq sample1/sample1_2.fastq casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 -o sample2 sample2/sample2_1.fastq sample2/sample2_2.fastq casper -t $SLURM_CPUS_PER_TASK -w 15 -k 20 -o sample3 sample3/sample3_1.fastq sample3/sample3_2.fastq
Submit this job using the swarm command.
swarm -f casper.swarm -g 6 -t 4 --module casper/0.8.2where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module casper | Loads the casper module for each subjob in the swarm |