High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
GATK on Biowulf
Description

The Genome Analysis Toolkit (GATK) is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit includes a wide variety of tools, with a focus on variant discovery and genotyping as well as emphasis on data quality assurance.

With the release of GATK 3.5-0 MuTect2 and ContEst are included as part of GATK.

There are multiple versions of GATK available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail GATK 

To select a module use

module load GATK/[version]

where [version] is the version of choice.

Environment variables set

Dependencies

References

Documentation

Release notes

Resource bundles

GATK Resource bundles contain reference genome assembles (.fa, .fai, .dict), variation data (dbSNP, HapMap, 1000 genomes, ...), and some genotype calls for a gold standard genome.

Resource bundles are available at

/fdb/GATK_resource_bundle

Environmemt module, memory, scratch, and open files

The environment module for GATK (module load GATK or module load GATK/version)

The GATK wrapper script provides a simplified interface to the GATK tools:

usage: GATK [options] [-m MEM] tool [gatk_tool_args]

This GATK wrapper script invokes GATK tools with a more friendly interface.
For documentation on each of the tools and their arguments see
https://www.broadinstitute.org/gatk/guide/tooldocs/

positional arguments:
  tool               GATK tool. Use 'LIST' or 'list' to get the current list
                     of available tools

optional arguments:
  -h, --help         show this help message and exit
  -m MEM, --mem MEM  maximal heap memory allowed for the Java process.
                     *Format*: <number>[kmg]. *Examples*: 1024k, 1000m, 2g,
                     32g. Passed to Java with -Xmx (default: 2g)
  -p N, --pgct N     maximal number of parallel GC threads allowed for Java
                     VM. Defaults to 1 less than the number of allocated CPUs.
                     (default: -1)
  -n, --dry-run      show command that would have been executed and set up
                     temp dir (default: False)

It takes care of finding the GATK jar and writing temporary files to either

Whenever possible, please use lscratch for temporary files. It will be more performant and reduce strain on the shared filesystems.

When not using the GATK wrapper script, the basic GATK command should be

java -Xmx####g -Djava.io.tmpdir=/lscratch/${SLURM_JOBID} \
  -XX:ParallelGCThreads=## -jar $GATK_JAR -T <Toolname> [options]

which requires requesting lscratch space at job submission. If you have specific reasons not to use local scratch, and will be running very few simultaneous jobs, use:

java -Xmx####g -Djava.io.tmpdir=/scratch/$USER -XX:ParallelGCThreads=## \
  -jar $GATK_JAR -T <Toolname> [options]

Note that /lscratch is the node-local scratch space. See the user guide for more information about allocating local scratch space.

Limit memory with -Xmx

The -Xmx option to Java specifies the maximal size of the memory allocation pool. It accepts suffixes k, m, and g. This is the option used by the wrapper script to limit memory usage.

Note that you should request additional memory (1-2GB) for batch/interactive jobs on top of the memory given to java to ensure that the whole java process stays below the allocated memory limit.

Redirect temporary files to /lscratch/${SLURM_JOBID} or /scratch/$USER with -Djava.io.tmpdir

GATK writes some temporary files during its run. By default, Java writes temp files to /tmp. However this is not advisable because /tmp is used by the operating system and is very limited in space. The tmp files should be redirected to /lscratch/${SLURM_JOBID} (if allocating local disk space) or /scratch/$USER or /scratch/$USER/unique_dir, by adding one of

-Djava.io.tmpdir=/scratch/$USER/unique_tmp_dir
-Djava.io.tmpdir=/lscratch/${SLURM_JOBID}

to the java command as shown above. The wrapper script takes care of this automatically. Note that the temp dir must already exist if specifying it manually. The wrapper script will automatically create it if necessary.

Limit the number of parallel garbage collection threads

The Java VM has a tendency to start many threads including parallel threads for garbage collection. This can be aggravated when there is not sufficient memory allocated to the JVM. Use -XX:ParallelGCThreads=## to limit the number of parallel garbage collection threads to the number of allocated CPUs minus 1 (or less if the GATK tool used does explicit multithreading itself). The GATK wrapper by default sets the number of GC threads to one less than the number of allocated CPUs.

Phone home

Note that GATK has a phone home feature which will send some generic information about each GATK run to the developers at the Broad Institute. This feature will not work on the Biowulf computational nodes, which are on a private network. This will not affect your runs in most cases. If you're getting an error related to 'phone-home', add -et NO_ET -K $GATK_KEY to your GATK command line to disable the phone-home feature.

Open files

GATK opens many files simultaneously. It can happen that you hit the open-file limit on Biowulf, and see errors like:

##### ERROR MESSAGE: Couldn't read file .... (Too many open files)

The most important change you can make is to reduce the multithreading. If your GATK command has any multithreading flags turned on (e.g. '-nt' and 'nct'), reduce the number of threads to 1.

The open-file limit is set to 4096 on the Biowulf nodes, but the soft limit is still 1024. To open more than 1024 files simultaneously, you will need to add the following into your batch script:

 ulimit -n 4096 

Here are some other possibilities (thanks to Ryan Neff, NHGRI):

Relevant discussions from the GATK mailing list:

GATK Parallelization

Some GATK tools can make use of more than one CPU using option -nt or -nct. See the GATK documentation for more detail.

Running a single GATK batch job on Biowulf

Set up a batch script along the following lines:

#! /bin/bash
# gatk.bat
set -e

module load GATK/3.5-0 || exit 1

cd /data/$USER/test_data/GATK

# use 8 GB memory
java -Xmx8g -Djava.io.tmpdir=/lscratch/$SLURM_JOBID -jar $GATK_JAR \
  -T CountReads \
  -R exampleFASTA.fasta \
  -I exampleBAM.bam \
  -et NO_ET -K $GATK_KEY

Submit to the queue with sbatch:

biowulf$ sbatch --mem=8g --gres=lscratch:10 gatk.bat

Since the simple CountReads tool does not support multithreading, only a single core (the default) is requested.

Running a swarm of GATK batch jobs on Biowulf

The following swarm commend file would run the RealignerTargetCreator ,which determines which genomic intervals require realignment, for a number of samples:

# this file is gatk-swarm
GATK -m 7g RealignerTargetCreator \
  -R ref.fasta
  -I input1.bam \
  -o output1.intervals \
  --known /fdb/GATK_resource_bundle/hg19/1000G_phase1.indels.hg19.vcf.gz
GATK -m 7g RealignerTargetCreator \
  -R ref.fasta
  -I input2.bam \
  -o output2.intervals \
  --known /fdb/GATK_resource_bundle/hg19/1000G_phase1.indels.hg19.vcf.gz
[...]

These command lines specify that the process will use 7 GB of memory (-Xmx7g). Thus, this swarm job should be submitted with the '-g 7' flag, to tell swarm that each process will require 7 GB of memory.

biowulf$ swarm -g 7 -f gatk-swarm --module GATK/3.5-0
Running an interactive job on Biowulf

It may be useful for debugging purposes to run GATK jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive --cpus-per-task=4 --gres=lscratch:10
salloc.exe: Granted job allocation 16535
slurm stepprolog here!
                                             Begin slurm taskprolog!
End slurm taskprolog!
p999$ module load GATK
p999$ cd /data/$USER/test_data/GATK
p999$ java -Xmx7g -Djava.io.tmpdir=/lscratch/$SLURM_JOBID -jar $GATK_JAR -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam --known indel_calls.vcf -nt $SLURM_CPUS_PER_TASK 
[...etc...]

p999$ exit
exit
srun: error: p999: task 0: Exited with exit code 1
                                                  slurm stepepilog here!
salloc.exe: Relinquishing job allocation 16535
biowulf$

Make sure to exit the job once you have finished your run.

LiftOverVCF -- converts a VCF file from one reference build to another

LiftOverVCF has been deprecated with release 3.5-0. Use picard's LiftoverVCF instead.

The procedure for lifting over VCF file from one genome build to different build in GATK is a three step process - (1) LiftoverVCF (2) sort the VCF and (3) FilterLiftedVCF. This process is automated by liftOverVCF

Usage: liftOverVCF
   -vcf            <input vcf>
   -gatk           <path to gatk trunk>
   -chain          <chain file>
   -newRef         <path to new reference prefix;
                    we will need newRef.dict, .fasta, and .fasta.fai>
   -oldRef         <path to old reference prefix; we will need oldRef.fasta>
   -out            <output vcf>
   -recordOriginalLocation
                   <Should we record what the
                    original location was in the INFO field?; defaults to false>

This procedure is to use liftOverVCF to switch from, say, hg18 to hg19.

Required data:

Example:

liftOverVCF  \
    -vcf test.hg18.vcf.gz \
    -chain /fdb/GATK_resource_bundle/liftover_chains/hg18ToHg19.over.chain \
    -newRef /fdb/GATK_resource_bundle/hg19/ucsc.hg19 \
    -oldRef /fdb/GATK_resource_bundle/hg18/Homo_sapiens_assembly18 \
    -out test.hg19.vzf
Walkthrough analysis of a single exome seq data set

Here is an example walkthrough starting from paired-end 100x coverage exome seq fastq files to a raw (unfiltered) VCF file. Note that this is not meant as a drop in pipeline - Queue would be better for this. It's simply an illustration of some of the more commonly used tools in the best practices workflow of GATK. This walkthrough makes use of the gcat_set_053 100x exome seq data obtained from GCAT. This data was originally collected by M. Linderman at Icahn Institute of Genomics and Multiscale Biology at Mount Sinai for the reference genome NA12878. The GCAT site allows uploading of finished variant calls which are then compared to gold standard calls as well as other workflows (see the GCAT publication).


# The data used for this walkthrough was obtained from bioplanet/GCAT. This is
# data from the Genome in a bottle project for a single interval. 30x paired end
# exome sequencing.

################################################################################
#                                  ALIGNMENT                                   #
################################################################################

# align with bwa mem, filter to q20, and sort
cat > aln.batch <<"EOF"
#! /bin/bash
set -e
set -o pipefail

function fail {
    echo "$@"
    exit 1
}

cd /data/$USER/test_data/gatk || fail "Could not find test data directory"
module load bwa || fail "could not load bwa module"
module load samtools/1.2 || fail "could not load samtools module"

bwa mem -M -t $(( SLURM_CPUS_PER_TASK - 4 )) \
    -R '@RG\tID:1\tSM:gcat_set_025\tPL:ILLUMINA' \
    /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa \
    fastq/gcat_set_053_1.fastq.gz \
    fastq/gcat_set_053_2.fastq.gz \
  | samtools view -Sbu -q 10 - \
  | samtools sort -@ 3 -m 5g -O bam -T bam/gcat_set_053 - \
  > bam/gcat_set_053.bam
EOF


# mark duplicates with picard
cat > dedup.batch <<"EOF"
#! /bin/bash
set -e

function fail {
    echo "$@"
    exit 1
}

cd /data/$USER/test_data/gatk || fail "Could not find test data directory"
module load picard/1.139 || fail "could not load picard module"
module load samtools/1.2 || fail "could not load samtools module"
java -Xmx28g -Djava.io.tmpdir=./tmp -XX:ParallelGCThreads=2 \
    -jar $PICARDJARPATH/picard.jar \
    MarkDuplicates \
    I=bam/gcat_set_053.bam \
    O=bam/gcat_set_053_dedup.bam \
    M=bam/gcat_set_053.markDuplicates \
    AS=true \
  && mv bam/gcat_set_053_dedup.bam bam/gcat_set_053.bam
samtools index bam/gcat_set_053.bam  
EOF

################################################################################
#                                 RE-ALIGNMENT                                 #
################################################################################

# determine which regions need to be realigned
# use the 1000 genomes indel file as the known file
cat > rtc.batch <<"EOF"
#! /bin/bash
#SBATCH -J rtc
set -e
function fail {
    echo "$@"
    exit 1
}

module load GATK/3.5-0 || fail "could not load GATK module"
cd /data/$USER/test_data/gatk
GATK RealignerTargetCreator \
   -R /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
   -I bam/gcat_set_053.bam \
   -o bam/gcat_set_053.forIndelRealigner.intervals \
   -known /fdb/GATK_resource_bundle/hg19/1000G_phase1.indels.hg19.vcf.gz \
   -nt $SLURM_CPUS_PER_TASK
EOF

# indel realigner
# using output from RealignerTargetCreator
cat > realign.batch <<"EOF"
#! /bin/bash
#SBATCH -J realign
set -e
function fail {
    echo "$@"
    exit 1
}

module load GATK/3.5-0 || fail "could not load GATK module"
cd /data/$USER/test_data/gatk
GATK --mem 4g IndelRealigner \
    -R /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
    -I bam/gcat_set_053.bam \
    -known /fdb/GATK_resource_bundle/hg19/1000G_phase1.indels.hg19.vcf.gz \
    -targetIntervals bam/gcat_set_053.forIndelRealigner.intervals \
    -o bam/gcat_set_053_realigned.bam
EOF

################################################################################
#                              BASE RECALIBRATION                              #
################################################################################

# base recalibrator - step 1
cat > recalib1.batch <<"EOF"
#! /bin/bash
#SBATCH -J recalib1
set -e
function fail {
    echo "$@"
    exit 1
}

module load GATK/3.5-0 || fail "could not load GATK module"
cd /data/$USER/test_data/gatk
GATK --mem 4g BaseRecalibrator \
   -I bam/gcat_set_053_realigned.bam \
   -R /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
   -knownSites /fdb/GATK_resource_bundle/hg19/dbsnp_138.hg19.vcf.gz \
   -o bam/gcat_set_053_realigned.BaseRecalibrator
EOF


# base recalibrator - step 2 and plots
cat > recalib2.batch <<"EOF"
#! /bin/bash
#SBATCH -j recalib2
set -e
function fail {
    echo "$@"
    exit 1
}

module load GATK/3.5-0 || fail "could not load GATK module"
cd /data/$USER/test_data/gatk
GATK PrintReads \
    -I bam/gcat_set_053_realigned.bam \
    -o bam/gcat_set_053_realigned_recal.bam \
    -R /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
    -BQSR bam/gcat_set_053_realigned.BaseRecalibrator
# rerun baserecalibrator on the recalibrated data (after)
GATK --mem 4g BaseRecalibrator \
    -I bam/gcat_set_053_realigned_recal.bam \
    -R /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
    -knownSites /fdb/GATK_resource_bundle/hg19/dbsnp_138.hg19.vcf.gz \
    -o bam/gcat_set_053_realigned.BaseRecalibrator.after
EOF

# Note: R packages required for BQSR.R: ggplot2, gplots, reshape, gsalib.
# as of 2015-04-27, the BQSR.R script downloaded during the plotting process
# fails if the default for stringsAsFactors is set to FALSE. This is a workaround
# for this problem. A real solution would be to specifically set stringsAsFactors
# to TRUE when loading the CSV file or to explicitly factor
# data$CovariateName.
cat > recalib3.batch <<"EOF"
#! /bin/bash
#SBATCH -J recalib3
set -e
function fail {
    echo "$@"
    exit 1
}

module load GATK || fail "could not load GATK module"
cd /data/$USER/test_data/gatk

if [[ -f .Rprofile ]]; then
    mv .Rprofile .Rprofile.bak
fi
echo "options(stringsAsFactors = TRUE)" > .Rprofile
GATK AnalyzeCovariates \
    -R /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
    -before bam/gcat_set_053_realigned.BaseRecalibrator \
    -after  bam/gcat_set_053_realigned.BaseRecalibrator.after \
    -plots  bam/gcat_set_053_realigned_recal_plots.pdf
rm .Rprofile
if [[ -f .Rprofile.bak ]]; then
    mv .Rprofile.bak .Rprofile
fi
EOF

################################################################################
#                               VARIANT CALLING                                #
################################################################################

# now we run the Haplotype caller. This is a single sample analysis which is
# different from what you would do with a cohort (i.e. no need to create a
# GVCF). See GATK documentation for details.

cat > hapcall.batch <<"EOF"
#! /bin/bash
#SBATCH -J hapcall
set -e
function fail {
    echo "$@"
    exit 1
}

module load GATK || fail "could not load GATK module"
cd /data/$USER/test_data/gatk
GATK --mem=10g HaplotypeCaller \
    --genotyping_mode DISCOVERY \
    -stand_call_conf 30 \
    -R /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
    -L annot/SeqCap_EZ_Exome_v3_capture.bed \
    -I bam/gcat_set_053_realigned_recal.bam \
    --dbsnp /fdb/GATK_resource_bundle/hg19/dbsnp_138.hg19.vcf.gz \
    -o vcf/gcat_set_053_raw.vcf
EOF

################################################################################
#                                     RUN                                      #
################################################################################

## Run as a pipeline with dependencies
j1=$(sbatch --mem=32G -c 24 aln.batch)
j2=$(sbatch --mem=30G --dependency=afterok:$j1 --cpus-per-task=4 --out dedup.out dedup.batch)
j4=$(sbatch --mem=5g  --gres=lscratch:5 --out rtc.out -c 6 --dependency=afterok:$j2 rtc.batch)
j5=$(sbatch --mem=5g --dependency=afterok:$j4 --out realign.out realign.batch)
j6=$(sbatch --mem=5g --time=3:00:00 --dependency=afterok:$j5 --out recalib1.out recalib1.batch)
j7=$(sbatch --mem=5g --time=5:00:00 --dependency=afterok:$j6 --out recalib2.out recalib2.batch)
j8=$(sbatch --mem=4g --dependency=afterok:$j7 --out recalib3.out recalib3.batch)
sbatch --mem=10g --dependency=afterok:$j8 --out hapcall.out hapcall.batch 


Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as shown below

biowulf$ sinteractive --mem=4g 
node$ cd /data/$USER/test_data
node$ mkdir GATK
node$ cd GATK
node$ module load GATK
node$ cp $GATK_HOME/resources/example* .
node$ GATK CountReads -R exampleFASTA.fasta -I exampleBAM.bam
INFO  14:16:14,013 HelpFormatter - --------------------------------------------------------------------------------
INFO  14:16:14,015 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO  14:16:14,016 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  14:16:14,016 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  14:16:14,021 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
INFO  14:16:14,028 HelpFormatter - Executing as xxxxx@cn0001 on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_05-b06.
INFO  14:16:14,028 HelpFormatter - Date/Time: 2015/04/23 14:16:14
INFO  14:16:14,028 HelpFormatter - --------------------------------------------------------------------------------
INFO  14:16:14,029 HelpFormatter - --------------------------------------------------------------------------------
INFO  14:16:14,691 GenomeAnalysisEngine - Strictness is SILENT
INFO  14:16:14,814 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  14:16:14,827 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  14:16:14,849 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO  14:16:15,030 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  14:16:15,036 GenomeAnalysisEngine - Done preparing for traversal
INFO  14:16:15,036 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  14:16:15,037 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  14:16:15,037 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime
INFO  14:16:15,040 ReadShardBalancer$1 - Loading BAM index data
INFO  14:16:15,041 ReadShardBalancer$1 - Done loading BAM index data
INFO  14:16:15,085 CountReads - CountReads counted 33 reads in the traversal
INFO  14:16:15,087 ProgressMeter -            done        33.0     0.0 s      25.0 m        0.3%     0.0 s       0.0 s
INFO  14:16:15,088 ProgressMeter - Total runtime 0.05 secs, 0.00 min, 0.00 hours
INFO  14:16:15,093 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 33 total reads (0.00%)
INFO  14:16:15,094 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  14:16:15,851 GATKRunReport - Uploaded run statistics report to AWS S3
node$ exit
biowulf$