The Genome Analysis Toolkit (GATK) is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit includes a wide variety of tools, with a focus on variant discovery and genotyping as well as emphasis on data quality assurance.
Online Tutorial: A practical introduction to GATK 4 on Biowulf
New in May 2021: A self-paced, online tutorial to work through a GATK example on Biowulf. Developed by the Biowulf staff, this tutorial includes a case study of germline variant discovery with WGS data from a trio, and benchmarks for each step. By working through the tutorial, you will learn NGS data preprocessing and how to optimize your Biowulf batch jobs.
There are multiple versions of GATK available. An easy way of selecting the version is to use modules. To see the modules available, type
module avail GATK
To select a module use
module load GATK/[version]
where [version]
is the version of choice.
GATK Resource bundles contain reference genome assembles (.fa, .fai, .dict), variation data (dbSNP, HapMap, 1000 genomes, ...), and some genotype calls for a gold standard genome.
Resource bundles are available at
/fdb/GATK_resource_bundle
The environment module for GATK (module load GATK
or
module load GATK/version
)
$GATK_HOME
$GATK_JAR
$GATK_JARPATH
$GATK_KEY
$GATK_QUEUE_JAR
$GATK_TEST_DATA
The GATK wrapper script provides a simplified interface to the GATK tools:
usage: GATK [options] [-m MEM] tool [gatk_tool_args] This GATK wrapper script invokes GATK tools with a more friendly interface. For documentation on each of the tools and their arguments see https://www.broadinstitute.org/gatk/guide/tooldocs/ positional arguments: tool GATK tool. Use 'LIST' or 'list' to get the current list of available tools optional arguments: -h, --help show this help message and exit -m MEM, --mem MEM maximal heap memory allowed for the Java process. *Format*: <number>[kmg]. *Examples*: 1024k, 1000m, 2g, 32g. Passed to Java with -Xmx (default: 2g) -p N, --pgct N maximal number of parallel GC threads allowed for Java VM. Defaults to 1 less than the number of allocated CPUs. (default: -1) -n, --dry-run show command that would have been executed and set up temp dir (default: False)
It takes care of finding the GATK jar and writing temporary files to either
/lscratch/${SLURM_JOB_ID}
: if running on a compute node with
allocated local disk space./unique_tmp_dir
: if running on a compute
node without allocated local disk spaceWhenever possible, please use lscratch for temporary files. It will be more performant and reduce strain on the shared filesystems.
When not using the GATK
wrapper script, the basic GATK command
should be
java -Xmx####g -Djava.io.tmpdir=/lscratch/${SLURM_JOB_ID} \ -XX:ParallelGCThreads=## -jar $GATK_JAR -T <Toolname> [options]
which requires requesting lscratch space at job submission.
Note that /lscratch
is the node-local scratch space. See the
user guide for more information
about allocating local scratch space.
Some tools may perform better with non-default garbage collectors.
The -Xmx
option to Java specifies the maximal size of the
memory allocation pool. It accepts suffixes k
, m
, and
g
. This is the option used by the wrapper script to limit memory
usage.
Note that you should request additional memory (1-2GB) for batch/interactive jobs on top of the memory given to java to ensure that the whole java process stays below the allocated memory limit.
GATK writes some temporary files during its run. By default, Java writes
temp files to /tmp
. However this is not advisable because
/tmp
is used by the operating system and is very limited in space.
The tmp files should be redirected to /lscratch/${SLURM_JOB_ID}
(if
allocating local disk space)
-Djava.io.tmpdir=/lscratch/${SLURM_JOB_ID}
to the java command as shown above. The wrapper script takes care of this automatically. Note that the temp dir must already exist if specifying it manually. The wrapper script will automatically create it if necessary.
The Java VM has a tendency to start many threads including parallel threads for garbage collection since it automatically detects all CPUs on a compute node irrespective of the allocation. This can be aggravated when there is not sufficient memory allocated to the JVM. Use -XX:ParallelGCThreads=## to limit the number of parallel garbage collection threads to the number of allocated CPUs up to 8 CPUs plus 5 for every 8 beyond that. The GATK wrapper by default sets the number of GC threads correctly one less than the number of allocated CPUs.
GATK 3.8-0 made the intel de/infaltor the default. While
slightly faster than their jdk equivalents, this did introduce a bug that can
lead to segfaults when allocating large heaps. Adding --use_jdk_inflater
--use_jdk_deflater
to the GATK tool options avoids this. The GATK
wrapper takes care of this automatically.
Note that GATK has a
phone
home feature which will send some generic information about each GATK run
to the developers at the Broad Institute. This feature will not work on the
Biowulf computational nodes, which are on a private network. This will not
affect your runs in most cases. If you're getting an error related to
'phone-home', add -et NO_ET -K $GATK_KEY
to your GATK command line to disable the phone-home feature.
GATK opens many files simultaneously. It can happen that you hit the open-file limit on Biowulf, and see errors like:
##### ERROR MESSAGE: Couldn't read file .... (Too many open files)
The most important change you can make is to reduce the multithreading. If your GATK command has any multithreading flags turned on (e.g. '-nt' and 'nct'), reduce the number of threads to 1.
If that happens, the limit on the number of open files has to be
increased from its default soft limit of 4096. This can be done by adding
the following command to your batch script: ulimit -n 16384
.
Here are some other possibilities (thanks to Ryan Neff, NHGRI):
Relevant discussions from the GATK mailing list:
Some GATK tools can make use of more than one CPU using option -nt
or -nct
. See the GATK documentation for more detail.
Set up a batch script along the following lines:
#! /bin/bash # gatk.bat set -e module load GATK/3.8-0 || exit 1 cd /data/$USER/test_data/GATK # use 8 GB memory java -Xmx8g -Djava.io.tmpdir=/lscratch/$SLURM_JOB_ID -jar $GATK_JAR \ -T CountReads \ -R exampleFASTA.fasta \ -I exampleBAM.bam \ -et NO_ET -K $GATK_KEY
Submit to the queue with sbatch:
biowulf$ sbatch --mem=9g --gres=lscratch:10 gatk.bat
Since the simple CountReads
tool does not support
multithreading, only a single core (the default) is requested.
The following swarm commend file would run the RealignerTargetCreator ,which determines which genomic intervals require realignment, for a number of samples:
# this file is gatk-swarm GATK -m 7g RealignerTargetCreator \ -R ref.fasta -I input1.bam \ -o output1.intervals \ --known /fdb/GATK_resource_bundle/hg19/1000G_phase1.indels.hg19.vcf.gz GATK -m 7g RealignerTargetCreator \ -R ref.fasta -I input2.bam \ -o output2.intervals \ --known /fdb/GATK_resource_bundle/hg19/1000G_phase1.indels.hg19.vcf.gz [...]
These command lines specify that the process will use 7 GB of memory (-Xmx7g). Thus, this swarm job should be submitted with the '-g 7' flag, to tell swarm that each process will require 7 GB of memory.
biowulf$ swarm -g 7 -f gatk-swarm --module GATK/3.8-1
Allocate an interactive session with sinteractive and use as shown below
[user@biowulf]$ sinteractive --gres=lscratch:50 --cpus-per-task=2 --mem=6g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load GATK/3.8-1 [user@cn3144]$ cp ${GATK_TEST_DATA:-none}/NA12878.ba[im]* . [user@cn3144]$ ls -lh total 18M -rw-r--r-- 1 user group 2.2M Jun 30 13:45 NA12878.bai -rw-r--r-- 1 user group 15M Jun 30 13:45 NA12878.bam [user@cn3144]$ GATK CountReads \ -R /fdb/GATK_resource_bundle/hg38/Homo_sapiens_assembly38.fasta \ -I NA12878.bam directory for temp files: /lscratch/60475214 Executing ' java -Djava.io.tmpdir=/lscratch/60475214 -Xmx2g -XX:ParallelGCThreads=2 -jar /usr/local/apps/GATK/3.8-1/GenomeAnalysisTK.jar -T CountReads -R /fdb/GATK_resource_bundle/hg38/Homo_sapiens_assembly38.fasta -I NA12878.bam --use_jdk_inflater --use_jdk_deflater ' INFO 13:50:21,043 HelpFormatter - ---------------------------------------------------------------- INFO 13:50:21,047 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled... INFO 13:50:21,048 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 13:50:21,048 HelpFormatter - For support and documentation go to https://software.broadinstit... INFO 13:50:21,049 HelpFormatter - [Tue Jun 30 13:50:21 EDT 2020] Executing on Linux 3.10.0-862.... INFO 13:50:21,049 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_181-b13 INFO 13:50:21,056 HelpFormatter - Program Args: -T CountReads -R /fdb/GATK_resource_bundle/hg38/Ho... INFO 13:50:21,064 HelpFormatter - Executing as user@cn3293 on Linux 3.10.0-862.14.4.el7.x86_64 a... INFO 13:50:21,065 HelpFormatter - Date/Time: 2020/06/30 13:50:21 INFO 13:50:21,065 HelpFormatter - ---------------------------------------------------------------- INFO 13:50:21,066 HelpFormatter - ---------------------------------------------------------------- INFO 13:50:21,072 GenomeAnalysisEngine - Deflater: JdkDeflater INFO 13:50:21,073 GenomeAnalysisEngine - Inflater: JdkInflater INFO 13:50:21,074 GenomeAnalysisEngine - Strictness is SILENT INFO 13:50:23,245 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:50:23,258 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:50:23,540 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.28 INFO 13:50:24,955 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:50:24,961 GenomeAnalysisEngine - Done preparing for traversal INFO 13:50:24,962 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:50:24,962 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:50:24,962 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 13:50:24,964 ReadShardBalancer$1 - Loading BAM index data INFO 13:50:24,966 ReadShardBalancer$1 - Done loading BAM index data INFO 13:50:26,502 CountReads - CountReads counted 61614 reads in the traversal INFO 13:50:26,507 ProgressMeter - done 61614.0 1.0 s 25.0 s 99.9% 1.0 s 0.0 s INFO 13:50:26,508 ProgressMeter - Total runtime 1.55 secs, 0.03 min, 0.00 hours INFO 13:50:26,512 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 61614 total reads (0.00%) INFO 13:50:26,514 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 13:50:26,515 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter ------------------------------------------------------------------------------------------ Done. There were no warn messages. ------------------------------------------------------------------------------------------ GATK finished (exit code 0) [user@cn3144]$ exit
LiftOverVCF has been deprecated with release 3.5-0. Use picard's LiftoverVCF instead.
The procedure for lifting over VCF file from one genome build to different
build in GATK is a three step process - (1) LiftoverVCF (2) sort the VCF and
(3) FilterLiftedVCF. This process is automated by liftOverVCF
Usage: liftOverVCF -vcf <input vcf> -gatk <path to gatk trunk> -chain <chain file> -newRef <path to new reference prefix; we will need newRef.dict, .fasta, and .fasta.fai> -oldRef <path to old reference prefix; we will need oldRef.fasta> -out <output vcf> -recordOriginalLocation <Should we record what the original location was in the INFO field?; defaults to false>
This procedure is to use liftOverVCF to switch from, say, hg18 to hg19.
fdb/GATK_resource_bundle/liftover_chains
. In addtion
a hg18 to hg19 liftover file from UCSC is included as well.Example:
liftOverVCF \ -vcf test.hg18.vcf.gz \ -chain /fdb/GATK_resource_bundle/liftover_chains/hg18ToHg19.over.chain \ -newRef /fdb/GATK_resource_bundle/hg19/ucsc.hg19 \ -oldRef /fdb/GATK_resource_bundle/hg18/Homo_sapiens_assembly18 \ -out test.hg19.vzf
The environment module for GATK (module load GATK
or
module load GATK/version
)
$GATK_TEST_DATA
Whenever possible, please use lscratch for temporary files. It will be more performant and reduce strain on the shared filesystems.
GATK now includes its own wrapper script (gatk
) and therefore
the Biowulf-supplied wrapper for GATK 3 is not available for GATK 4. The basic
GATK 4 command is
gatk --java-options "[java options]" <Toolname> [tool options]
For detailed GATK help use
gatk --help Usage template for all tools (uses --spark-runner LOCAL when used with a Spark tool) gatk AnyTool toolArgs Usage template for Spark tools (will NOT work on non-Spark tools) gatk SparkTool toolArgs [ -- --spark-runnersparkArgs ] Getting help gatk --list Print the list of available tools gatk Tool --help Print help on a particular tool Configuration File Specification --gatk-config-file PATH/TO/GATK/PROPERTIES/FILE gatk forwards commands to GATK and adds some sugar for submitting spark jobs --spark-runner controls how spark tools are run valid targets are: LOCAL: run using the in-memory spark runner SPARK: run using spark-submit on an existing cluster --spark-master must be specified --spark-submit-command may be specified to control the Spark submit command arguments to spark-submit may optionally be specified after -- GCS: run using Google cloud dataproc commands after the -- will be passed to dataproc --cluster must be specified after the -- spark properties and some common spark-submit parameters will be translated to dataproc equivalents --dry-run may be specified to output the generated command line without running it --java-options 'OPTION1[ OPTION2=Y ... ]' optional - pass the given string of options to the java JVM at runtime. Java options MUST be passed inside a single string with space-separated values.
The -Xmx
option to Java specifies the maximal size of the
heap memory. It accepts suffixes k
, m
, and
g
. This is passed to gatk in the --java-otions
string. For
example
gatk --java-options "-Xmx6g" <Toolname> [tool options]
Note that you should request additional memory (1-2GB) for batch/interactive jobs on top of the memory given to java to ensure that the whole java process stays below the allocated memory limit.
Note that some tools may perform better with non-default garbage collectors. For example here are some possible options for a large heap allocation using ParallelGCThreads:
gatk --java-options "-Djava.io.tmpdir=/lscratch/$SLURM_JOB_ID -Xms48G -Xmx48G \ -XX:ParallelGCThreads=8" ...
GATK writes some temporary files during its run. By default, Java writes
temp files to /tmp
. However this is not advisable because
/tmp
is used by the operating system and is very limited in space.
The tmp files should be redirected to /lscratch/${SLURM_JOB_ID}
by adding
-Djava.io.tmpdir=/lscratch/${SLURM_JOB_ID}
to the java options.
For more details on the node-local storage under /lscratch
see the
user guide
If you find that your GATK runs have more active threads than you were
expecting you may have to limit the number of prallel garbage collection
threads. The JVM options -XX:ParallelGCThreads
and
-XX:ConcGCThreads
can be used to tune the number of threads
dedicated to garbage collection.
GATK may open many files simultaneously. It can happen that you hit the open-file limit on Biowulf, and see errors like:
##### ERROR MESSAGE: Couldn't read file .... (Too many open files)
If that happens, the limit on the number of open files has to be
increased from its default soft limit of 4096. This can be done by adding
the following command to your batch script: ulimit -n 16384
.
Here are some other possibilities (thanks to Ryan Neff, NHGRI):
The paradigm for parallelization for GATK 4 is completely different from GATK 3. Spark is used to parallelize workloads and the tools capable of using this form of parallelism include 'Spark' in their toolname. Some of these tools are still in beta and outputs may not be compatible with non-parallel variants. See the GATK documentation for more detail.
Set up a batch script along the following lines:
#! /bin/bash # gatk.bat set -e module load GATK/4.5.0.0 || exit 1 cd /lscratch/$SLURM_JOB_ID || exit 1 ref=/fdb/GATK_resource_bundle/hg38/Homo_sapiens_assembly38.fasta bam=${GATK_TEST_DATA:-none}/HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam # use 8 GB memory gatk --java-options "-Xmx8g -Djava.io.tmpdir=/lscratch/$SLURM_JOB_ID" \ CountReads \ -R $ref \ -I $bam \
Submit to the queue with sbatch:
[user@biowulf]$ sbatch --mem=9g --gres=lscratch:10 gatk.bat
Since the simple CountReads
tool does not support
multithreading, only a single core (the default) is requested.
The following swarm commend file would run the MarkDuplicatesSpark. It will also sort and index the bam file. This is a Spark implementation of Picard MarkDuplicates that allows the tool to be run in parallel on multiple cores:
# this file is gatk-swarm gatk --java-options "-Xmx8g" MarkDuplicatesSpark \ -I input1.bam \ -O input1_markdup.bam \ --spark-runner LOCAL \ --spark-master local[$SLURM_CPUS_PER_TASK] \ gatk --java-options "-Xmx8g" MarkDuplicatesSpark \ -I input2.bam \ -O input2_markdup.bam \ --spark-runner LOCAL \ --spark-master local[$SLURM_CPUS_PER_TASK] \ [...]
These command lines specify that the process will use 8 GB of memory (-Xmx8g). Thus, this swarm job should be submitted with the '-g 9' flag, to tell swarm that each process will require 9 GB (8GB plus some buffer) of memory. Please do not use more than 12 CPUs due to the parallel efficiency.
biowulf$ swarm -g 9 -f gatk-swarm -t 16 --module GATK/4.5.0.0
It may be useful for debugging purposes to run GATK jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.
[user@biowulf]$ sinteractive --gres=lscratch:50 --cpus-per-task=2 --mem=6g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load GATK/4.5.0.0 [user@cn3144]$ cp ${GATK_TEST_DATA:-none}/NA12878.ba[im]* . [user@cn3144]$ ls -lh total 18M -rw-r--r-- 1 user group 2.2M Jun 30 13:45 NA12878.bai -rw-r--r-- 1 user group 15M Jun 30 13:45 NA12878.bam [user@cn3144]$ gatk --java-options "-Xmx5g -Xms5g -Djava.io.tmpdir=/lscratch/$SLURM_JOB_ID" \ CountReads \ -R /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ -I NA12878.bam Using GATK jar /usr/local/apps/GATK/4.5.0.0/gatk-package-4.5.0.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx5g -Xms5g -Djava.io.tmpdir=/lscratch/60475214 -jar /usr/local/apps/GATK/4.5.0.0/gatk-package-4.5.0.0-local.jar CountReads -R /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -I NA12878.bam 17:06:09.007 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/apps/GAT K/4.5.0.0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 17:06:09.513 INFO CountReads - ------------------------------------------------------------ 17:06:09.514 INFO CountReads - The Genome Analysis Toolkit (GATK) v4.5.0.0 17:06:09.514 INFO CountReads - For support and documentation go to https://software.broadinstitute.org/ gatk/ 17:06:09.515 INFO CountReads - Executing as user@cn3293 on Linux v3.10.0-862.14.4.el7.x86_64 amd64 17:06:09.515 INFO CountReads - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-b13 17:06:09.516 INFO CountReads - Start Date/Time: June 30, 2020 5:06:08 PM EDT 17:06:09.517 INFO CountReads - ------------------------------------------------------------ 17:06:09.517 INFO CountReads - ------------------------------------------------------------ 17:06:09.519 INFO CountReads - HTSJDK Version: 2.22.0 17:06:09.519 INFO CountReads - Picard Version: 2.22.8 17:06:09.519 INFO CountReads - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:06:09.520 INFO CountReads - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:06:09.520 INFO CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:06:09.521 INFO CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:06:09.521 INFO CountReads - Deflater: IntelDeflater 17:06:09.521 INFO CountReads - Inflater: IntelInflater 17:06:09.521 INFO CountReads - GCS max retries/reopens: 20 17:06:09.521 INFO CountReads - Requester pays: disabled 17:06:09.521 INFO CountReads - Initializing engine WARNING: BAM index file /lscratch/60475214/NA12878.bai is older than BAM /lscratch/60475214/NA12878.bam 17:06:10.212 INFO CountReads - Done initializing engine 17:06:10.212 INFO ProgressMeter - Starting traversal 17:06:10.212 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads /Minute 17:06:11.192 INFO CountReads - 0 read(s) filtered by: WellformedReadFilter 17:06:11.194 INFO ProgressMeter - chrUn_KN707963v1_decoy:19294 0.0 61614 3768440.4 17:06:11.195 INFO ProgressMeter - Traversal complete. Processed 61614 total reads in 0.0 minutes. 17:06:11.195 INFO CountReads - CountReads counted 61614 total reads 17:06:11.195 INFO CountReads - Shutting down engine [June 30, 2020 5:06:11 PM EDT] org.broadinstitute.hellbender.tools.CountReads done. Elapsed time: 0.04 m inutes. Runtime.totalMemory()=5145362432 Tool returned: 61614