Biowulf High Performance Computing at the NIH
deepvariant on Biowulf

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data by converting pileups from bam files to images and feeding them to a DNN based model.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Note that for the purposes of this demonstration a GPU node is used for all three steps of the pipeline. For real applications, please only use GPU nodes for the call_variant step and make sure to parallelize the make_examples step. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:1 --cpus-per-task=6
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn4224 are ready for job

[user@cn4224 ~]$ mkdir input
[user@cn4224 ~]$ module load deepvariant/0.4.1-gpu
[user@cn4224 ~]$ cp -r "${DEEPVARIANT_TEST_DATA}" input
[user@cn4224 ~]$ tree input
input
|-- [user   3.7M]  NA12878_S1.chr20.10_10p1mb.bam
|-- [user   5.3K]  NA12878_S1.chr20.10_10p1mb.bam.bai
|-- [user    264]  test_nist.b37_chr20_100kbp_at_10mb.bed
|-- [user   5.6K]  test_nist.b37_chr20_100kbp_at_10mb.vcf.gz
|-- [user    197]  test_nist.b37_chr20_100kbp_at_10mb.vcf.gz.tbi
|-- [user    61M]  ucsc.hg19.chr20.unittest.fasta
|-- [user     23]  ucsc.hg19.chr20.unittest.fasta.fai
|-- [user   593K]  ucsc.hg19.chr20.unittest.fasta.gz
|-- [user     23]  ucsc.hg19.chr20.unittest.fasta.gz.fai
`-- [user    15K]  ucsc.hg19.chr20.unittest.fasta.gz.gzi

[user@cn4224 ~]$ mkdir output
[user@cn4224 ~]$ REF=input/ucsc.hg19.chr20.unittest.fasta
[user@cn4224 ~]$ BAM=input/NA12878_S1.chr20.10_10p1mb.bam

Extract images of pileups from the input bam file with make_examples, which is single threaded and consumes 1-2GB of memory.

[user@cn4224 ~]$ make_examples --mode calling \
                      --ref "${REF}" \
                      --reads "${BAM}" \
                      --regions "chr20:10,000,000-10,010,000" \
                      --examples output/examples.tfrecord.gz

This step can be parallelized with gnu parallel on a single machine, in which case multiple output files are generated.

[user@cn4224 ~]$ rm output/examples.tfrecord.gz # discard the output from the non-parallel run above
[user@cn4224 ~]$ module load parallel
[user@cn4224 ~]$ N_SHARDS=3
[user@cn4224 ~]$ mkdir -p logs
[user@cn4224 ~]$ seq 0 $((N_SHARDS-1)) \
                      | parallel -P ${SLURM_CPUS_PER_TASK} --eta --halt 2 --joblog "logs/log" --res "logs" \
                        make_examples --mode calling \
                          --ref "${REF}" \
                          --reads "${BAM}" \
                          --regions "chr20:10,000,000-10,010,000" \
                          --examples output/examples.tfrecord.gz\
                          --task {}
[user@cn4224 ~]$ tree output
output
|-- [user   200K]  examples.tfrecord-00000-of-00003.gz
|-- [user   186K]  examples.tfrecord-00001-of-00003.gz
|-- [user   191K]  examples.tfrecord-00002-of-00003.gz

Call variants on the sharded output. The module is stored at /opt/models/model.chpt in the singularity image containing deepvariant. To parallelize this step, multiple separate jobs would have to be run. Note that since version 0.5.1, there are two models - one for WGS data, one for WES. They are /opt/wes/model.ckpt and /opt/wes/model.ckpt. For backwards compatibility with version 0.4.1, /opt/models/model.ckpt points to the WGS model as well.

[user@cn4224 ~]$ CALL_VARIANTS_OUTPUT="output/call_variants_output.tfrecord.gz"
[user@cn4224 ~]$ MODEL="/opt/wgs/model.ckpt"
[user@cn4224 ~]$ call_variants \
                     --outfile "${CALL_VARIANTS_OUTPUT}" \
                     --examples "output/examples.tfrecord@${N_SHARDS}.gz" \
                     --checkpoint "${MODEL}"

Finally, convert the output to VCF format

[user@cn4224 ~]$ postprocess_variants \
                      --ref "${REF}" \
                      --infile "${CALL_VARIANTS_OUTPUT}" \
                      --outfile "output/example.vcf.gz"
[user@cn4224 ~]$ tree output
output
|-- [user   4.2K]  call_variants_output.tfrecord.gz
|-- [user   200K]  examples.tfrecord-00000-of-00003.gz
|-- [user   186K]  examples.tfrecord-00001-of-00003.gz
|-- [user   191K]  examples.tfrecord-00002-of-00003.gz
`-- [user   2.2K]  example.vcf.gz
[user@cn4224 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

With version 0.5.1, deepvariant gained the ability to generate gVCF output.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file for each step (e.g. deepvariant_step[123].sh) similar to the following. Note that for simplicity there is little error checking in the example scripts. Note also that only step2 makes use of a GPU and that the current version of DeepVariant cannot scale to more than 1 GPU.

#! /bin/bash
# this is deepvariant_step1.sh

module load deepvariant || exit 1
module load parallel

wd=$PWD

[[ -d /lscratch/${SLURM_JOB_ID} ]] && cd /lscratch/${SLURM_JOB_ID}
rm -rf input output logs
mkdir input output logs
cp "${DEEPVARIANT_TEST_DATA}"/* input || exit 1

REF=input/ucsc.hg19.chr20.unittest.fasta
BAM=input/NA12878_S1.chr20.10_10p1mb.bam
MODEL="/opt/models/model.ckpt"
N_SHARDS=12

seq 0 $((N_SHARDS-1)) \
    | parallel -P ${SLURM_CPUS_PER_TASK} --eta --halt 2 \
        --joblog "logs/log" --res "logs" \
      make_examples --mode calling \
        --ref "${REF}" \
        --reads "${BAM}" \
        --regions "chr20:10,000,000-10,010,000" \
        --examples output/examples.tfrecord@${N_SHARDS}.gz\
        --task {} \
|| exit 1

cp -r output $wd
#! /bin/bash
# this is deepvariant_step2.sh

module load deepvariant || exit 1

MODEL="/opt/models/model.ckpt"
N_SHARDS=12
CALL_VARIANTS_OUTPUT="output/call_variants_output.tfrecord.gz"

call_variants \
 --outfile "${CALL_VARIANTS_OUTPUT}" \
 --examples "output/examples.tfrecord@${N_SHARDS}.gz" \
 --checkpoint "${MODEL}" \
|| exit 1
#! /bin/bash
# this is deepvariant_step3.sh

module load deepvariant || exit 1

REF=input/ucsc.hg19.chr20.unittest.fasta
CALL_VARIANTS_OUTPUT="output/call_variants_output.tfrecord.gz"

postprocess_variants \
  --ref "${REF}" \
  --infile "${CALL_VARIANTS_OUTPUT}" \
  --outfile "output/example.vcf.gz" \
|| exit 1

Submit these jobs using the Slurm sbatch command. In this example, job dependencies are used to tie the jobs together.

biowulf$ sbatch --cpus-per-task=6 --mem-per-cpu=2g --gres=lscratch:10 deepvariant_step1.sh
1111
biowulf$ sbatch --dependency=afterany:1111 --cpus-per-task=6 \
               --gres=lscratch:10,gpu:k80:1 --partition=gpu deepvariant_step2.sh
1112
biowulf$ sbatch --dependency=afterany:1112 deepvariant_step3.sh
1113
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile for the first step of the pipeline (e.g. deepvariant.swarm). For example:

make_examples --mode calling --ref /path/to/genome.fa --reads "sample1.bam" \
    --regions "chr1" --examples sample1_chr1.tfrecord.gz
make_examples --mode calling --ref /path/to/genome.fa --reads "sample1.bam" \
    --regions "chr2" --examples sample1_chr2.tfrecord.gz
make_examples --mode calling --ref /path/to/genome.fa --reads "sample1.bam" \
    --regions "chr3" --examples sample1_chr3.tfrecord.gz

Submit this job using the swarm command.

swarm -f deepvariant.swarm [-g #] --module deepvariant
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
--module deepvariant Loads the deepvariant module for each subjob in the swarm