High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Picard on Biowulf and Helix

Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported.

Picard changed a bit back to a single jar, so currently picard command lines look like this:

   java [java opts] -jar $PICARDJARPATH/picard.jar command [options]
   java -jar $PICARDJARPATH/picard.jar --help

Note 1

MarkDuplicates tends to initiate garbage collection threads. It's suggested that users add -XX:ParallelGCThreads=5 in the picard command and request 6 cpus for the command (1 for picard, 5 for garbage collection)

java -Xmx???g -XX:ParallelGCThreads=5 -jar $PICARDJARPATH/picard.jar MarkDuplicates ........

Note 2

For tools that can generate lots of temperatory files (such as FastqToSam), or when error message 'slurmstepd: Exceeded job memory limit at some point' appears, it's suggested to include this flag to the picard command:

TMP_DIR=/lscratch/$SLURM_JOBID

Then submit with:

sbatch --cpus-per-task=6 --mem=?g --gres=lscratch:200

or

swarm -f swarmfile -t 6 -g ? --gres=lscratch:200

Replace ? above with memory in GB

Environment variables set

Running on Helix

Sample session:

helix$ module load picard
helix$ java -Xmx4g -XX:ParallelGCThreads=5 -jar $PICARDJARPATH/picard.jar command [options]
Submitting a single batch job

1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of program location before running.

#!/bin/bash 

module load picard
cd /data/$USER/somewhereWithInputFile
java -Xmx4g -XX:ParallelGCThreads=5 -jar $PICARDJARPATH/picard.jar command [options]
....
....

2. Submit the script on Biowulf.

$ sbatch --cpus-per-task=6 myscript

Submitting a swarm of jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

cd /data/user/run1/; java -Xmx4g -XX:ParallelGCThreads=5 -jar $PICARDJARPATH/picard.jar command [options]
cd /data/user/run2/; java -Xmx4g -XX:ParallelGCThreads=5 -jar $PICARDJARPATH/picard.jar command [options]
........
cd /data/user/run10/; java -Xmx4g -XX:ParallelGCThreads=5 -jar $PICARDJARPATH/picard.jar command [options]

The -f flag is required to specify swarm file name.
-g flags specify the memory needed for each Picard task.

Submit the swarm job:

$ swarm -f swarmfile -g 4 -t 6 --module picard

For more memory requirement (default 1.5gb per line in swarmfile), use -g flag
and at the mean time, change -Xmx4g in your script to corresponding number (-Xmx4g to -Xmx10g in this example):

$ swarm -g 10 -f swarmfile --module picard

For more information regarding running swarm, see swarm.html

 

Running an interactive job

User may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf]$ sinteractive --cpus-per-task=6 

[user@pXXXX]$ cd /data/$USER/myruns

[user@pXXXX]$ module load picard

[user@pXXXX]$ java -Xmx4g -XX:ParallelGCThreads=5 -jar $PICARDJARPATH/picard.jar command [options]

[user@pXXXX] exit
slurm stepepilog here!
                   
[user@biowulf]$ 
Documentation

http://broadinstitute.github.io/picard/