DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data by converting pileups from bam files to images and feeding them to a DNN based model.
Starting with version 1.5.0 a variant module with the -deeptrio
is provided.
This variant is targeted at analyzing trios and duos. An example is provieded below
${DEEPVARIANT_TEST_DATA}
(or ${DEEPTRIO_TEST_DATA}
for
the deeptrio variant--writer_threads
option to the call_variants.
It defaults to auto-detecting all CPUs which can lead to the creation of too many threads for our per-user process limit.
To avoid this use --writer_threads=6
(or some other number ≤ allocated CPUs)
ulimit -u 32768
at the beginning of your batch scriptAllocate an interactive session and run the program. Note that for the purposes of this
demonstration a GPU node is used for all three steps of the pipeline. For real applications, please only use GPU nodes for the call_variant
step and
make sure to parallelize the make_examples
step. Sample session:
[user@biowulf]$ sinteractive --gres=gpu:p100:1,lscratch:50 --cpus-per-task=6 --mem=24g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4224 are ready for job [user@cn4224 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn4224 ~]$ module load deepvariant/1.6.0 [user@cn4224 ~]$ cp -r ${DEEPVARIANT_TEST_DATA:-none} input [user@cn4224 ~]$ tree input input |-- [user 3.7M] NA12878_S1.chr20.10_10p1mb.bam |-- [user 5.3K] NA12878_S1.chr20.10_10p1mb.bam.bai |-- [user 264] test_nist.b37_chr20_100kbp_at_10mb.bed |-- [user 5.6K] test_nist.b37_chr20_100kbp_at_10mb.vcf.gz |-- [user 197] test_nist.b37_chr20_100kbp_at_10mb.vcf.gz.tbi |-- [user 61M] ucsc.hg19.chr20.unittest.fasta |-- [user 23] ucsc.hg19.chr20.unittest.fasta.fai |-- [user 593K] ucsc.hg19.chr20.unittest.fasta.gz |-- [user 23] ucsc.hg19.chr20.unittest.fasta.gz.fai `-- [user 15K] ucsc.hg19.chr20.unittest.fasta.gz.gzi [user@cn4224 ~]$ mkdir output [user@cn4224 ~]$ REF=input/ucsc.hg19.chr20.unittest.fasta [user@cn4224 ~]$ BAM=input/NA12878_S1.chr20.10_10p1mb.bam
Extract images of pileups from the input bam file with make_examples
, which
is single threaded and consumes 1-2GB of memory. Note that starting with version 1.1.0
it is necessary to specify --channels insert_size
to make_examples for WGS and
WES models. For other models other channels are needed.
[user@cn4224 ~]$ make_examples --mode calling \ --ref "${REF}" \ --reads "${BAM}" \ --regions "chr20:10,000,000-10,010,000" \ --channels insert_size \ --examples output/examples.tfrecord.gz
For 1.8.0 --channels
does not exist any more. Instead use --channel_list
.
For example:
[user@cn4224 ~]$ module load deepvariant/1.8.0 [user@cn4224 ~]$ make_examples --mode calling \ --ref "${REF}" \ --reads "${BAM}" \ --regions "chr20:10,000,000-10,010,000" \ --channel_list=read_base,base_quality,mapping_quality,strand,read_supports_variant,base_differs_from_ref,insert_size \ --examples output/examples.tfrecord.gz
Please note, that make_examples
must be called with --norealign_reads
--vsc_min_fraction_indels 0.12
flag for PacBio long reads.
This step can be parallelized with gnu parallel on a single machine, in which case multiple output files are generated. Let's clean up the serial results and re-produce them in parallel.
[user@cn4224 ~]$ rm output/examples.tfrecord.gz # discard the output from the non-parallel run above [user@cn4224 ~]$ module load parallel [user@cn4224 ~]$ N_SHARDS=3 [user@cn4224 ~]$ mkdir -p logs [user@cn4224 ~]$ seq 0 $((N_SHARDS-1)) \ | parallel -j ${SLURM_CPUS_PER_TASK} --eta --halt 2 --joblog "logs/log" --res "logs" \ make_examples --mode calling \ --ref "${REF}" \ --reads "${BAM}" \ --regions "chr20:10,000,000-10,010,000" \ --examples "output/examples.tfrecord@${N_SHARDS}.gz" \ --channels insert_size \ --task {} [user@cn4224 ~]$ tree output output ├── [user 231K] examples.tfrecord-00000-of-00003.gz ├── [user 80] examples.tfrecord-00000-of-00003.gz.example_info.json ├── [user 214K] examples.tfrecord-00001-of-00003.gz ├── [user 80] examples.tfrecord-00001-of-00003.gz.example_info.json ├── [user 234K] examples.tfrecord-00002-of-00003.gz └── [user 80] examples.tfrecord-00002-of-00003.gz.example_info.json 0 directories, 6 files
Call variants on the sharded output. The module is stored at /opt/models/model.chpt in the singularity image containing deepvariant. To parallelize this step, multiple separate jobs would have to be run. Note that since version 0.5.1, there are two models - one for WGS data, one for WES. They are /opt/wes/model.ckpt and /opt/wes/model.ckpt. For backwards compatibility with version 0.4.1, /opt/models/model.ckpt points to the WGS model as well.
[user@cn4224 ~]$ CALL_VARIANTS_OUTPUT="output/call_variants_output.tfrecord.gz" [user@cn4224 ~]$ #< 0.9.0 MODEL="/opt/wgs/model.ckpt" [user@cn4224 ~]$ #< 1.6.0 MODEL="/opt/models/wgs/model.ckpt" [user@cn4224 ~]$ MODEL="/opt/models/wgs" [user@cn4224 ~]$ call_variants \ --outfile "${CALL_VARIANTS_OUTPUT}" \ --examples "output/examples.tfrecord@${N_SHARDS}.gz" \ --checkpoint "${MODEL}" \ --writer_threads=6
Other models available in 1.6.{0,1}:
Note that starting with version 1.6.0 call_variants
creates
mutiple output files depending on the number of writer threads. That does
not change the way postprocess_variants
is called below but it
is necessary to limit the threads to avoid creating more threads per user
than the default limit on Biowulf compute nodes.
Finally, convert the output to VCF format
[user@cn4224 ~]$ postprocess_variants \ --ref "${REF}" \ --infile "${CALL_VARIANTS_OUTPUT}" \ --outfile "output/example.vcf.gz" [user@cn4224 ~]$ tree output output |-- [user 4.1K] call_variants_output.tfrecord.gz |-- [user 180K] examples.tfrecord-00000-of-00003.gz |-- [user 151K] examples.tfrecord-00000-of-00003.gz.run_info.pbtxt |-- [user 166K] examples.tfrecord-00001-of-00003.gz |-- [user 151K] examples.tfrecord-00001-of-00003.gz.run_info.pbtxt |-- [user 174K] examples.tfrecord-00002-of-00003.gz |-- [user 151K] examples.tfrecord-00002-of-00003.gz.run_info.pbtxt `-- [user 2.2K] example.vcf.gz [user@cn4224 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
With version 0.5.1, deepvariant gained the ability to generate gVCF output.
In version 0.9.0 deepvariant added a single command (run_deepvariant
) to
run the whole pipeline. However, since step 1 and 3 can be run without GPU it is more efficient
to run the steps separately
The run_deepvariant
pipeline wrapper applies additional arguments to
make_examples
depending on the choice of model. In the example above with
WGS that is just --channel insert_size
. For the PACBIO and ONT_R104 models
more options are used (this is current for version 1.6.1):
[user@cn4224 ~]$ special_args=( --add_hp_channel --alt_aligned_pileup='diff_channels' --max_reads_per_partition=600 --min_mapping_quality=1 --parse_sam_aux_fields --partition_size=25000 --phase_reads --pileup_image_width=199 --norealign_reads --sort_by_haplotypes --track_ref_reads --vsc_min_fraction_indels=0.12 ) [user@cn4224 ~]$ if [[ "${MODEL}" == "/opt/models/pacbio" ]] ; then special_args+=( --min_mapping_quality=1 ) elif [[ "${MODEL}" == "/opt/models/ont_r104" ]] ; then special_args+=( --min_mapping_quality=5 --vsc_min_fraction_snps=0.08 ) fi [user@cn4224 ~]$ seq 0 $((N_SHARDS-1)) \ | parallel -j ${SLURM_CPUS_PER_TASK} --eta --halt 2 --joblog "logs/log" --res "logs" \ make_examples --mode calling \ --ref "${REF}" \ --reads "${BAM}" \ --regions "chr20:10,000,000-10,010,000" \ --examples "output/examples.tfrecord@${N_SHARDS}.gz" \ "${special_args[@]}" \ --task {}
Deeptrio is build on top of deepvariant and follows the same steps. Deeptrio considers genetic inheritance when calling variants and produces joint examples from all samples to ensure that variant calls are made for the same sites in all samples. It is available as a variant version starting with 1.5.0.
[user@biowulf]$ sinteractive --gres=gpu:p100:1,lscratch:50 --cpus-per-task=6 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4224 are ready for job [user@cn4224 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn4224 ~]$ module load deepvariant/1.6.0-deeptrio [user@cn4224 ~]$ cp -r ${DEEPTRIO_TEST_DATA:-none} input [user@cn4224 ~]$ tree input input |-- [user 2.9G] GRCh38_no_alt_analysis_set.fasta |-- [user 7.6K] GRCh38_no_alt_analysis_set.fasta.fai |-- [user 1.3M] HG002.chr20.10_10p1mb.bam |-- [user 6.7K] HG002.chr20.10_10p1mb.bam.bai |-- [user 11M] HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed |-- [user 149M] HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz |-- [user 1.6M] HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi |-- [user 1.4M] HG003.chr20.10_10p1mb.bam |-- [user 6.7K] HG003.chr20.10_10p1mb.bam.bai |-- [user 13M] HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed |-- [user 140M] HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz |-- [user 1.6M] HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi |-- [user 1.3M] HG004.chr20.10_10p1mb.bam |-- [user 6.7K] HG004.chr20.10_10p1mb.bam.bai |-- [user 12M] HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed |-- [user 142M] HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz `-- [user 1.6M] HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi [user@cn4224 ~]$ mkdir outputMake examples
[user@cn4224 ~]$ module load parallel [user@cn4224 ~]$ N_SHARDS=$SLURM_CPUS_PER_TASK [user@cn4224 ~]$ mkdir -p logs output [user@cn4224 ~]$ seq 0 $((N_SHARDS-1)) \ | parallel -j ${SLURM_CPUS_PER_TASK} --eta --halt 2 --joblog "logs/log" --res "logs" \ make_examples --mode calling \ --ref input/GRCh38_no_alt_analysis_set.fasta \ --reads_parent1 "input/HG003.chr20.10_10p1mb.bam" \ --sample_name_parent1 "HG003" \ --reads_parent2 "input/HG004.chr20.10_10p1mb.bam" \ --sample_name_parent2 "HG004" \ --reads "input/HG002.chr20.10_10p1mb.bam" \ --sample_name "HG002" \ --regions "chr20:10,000,000-10,010,000" \ --examples "output/intermediate_results_dir/make_examples.tfrecord@${N_SHARDS}.gz" \ --channels insert_size \ --gvcf "output/intermediate_results_dir/gvcf.tfrecord@6.gz" \ --pileup_image_height_child "60" \ --pileup_image_height_parent "40" \ --task {}
Call variants on the sharded output. The module is stored at /opt/models/model.chpt in the singularity image containing deepvariant. To parallelize this step, multiple separate jobs would have to be run - for example a job for each member of the trio
[user@cn4224 ~]$ for n in parent1 parent2 child ; do call_variants \ --outfile output/intermediate_results_dir/call_variants_output_${n}.tfrecord.gz \ --examples output/intermediate_results_dir/make_examples_${n}.tfrecord@${N_SHARDS}.gz \ --checkpoint "/opt/models/deeptrio/wgs/parent/model.ckpt" --writer_threads=6 done
Finally, convert the output to VCF format
[user@cn4224 ~]$ declare -Asamples=( [parent1]=HG003 [parent2]=HG004 [child]=HG002 ) [user@cn4224 ~]$ for n in parent1 parent2 child ; do postprocess_variants \ --ref input/GRCh38_no_alt_analysis_set.fasta \ --infile output/intermediate_results_dir/call_variants_output_${n}.tfrecord.gz \ --outfile output/${samples[$n]}.output.vcf.gz \ --nonvariant_site_tfrecord_path output/intermediate_results_dir/gvcf_${n}.tfrecord@${N_SHARDS}.gz \ --gvcf_outfile output/${samples[$n]}.g.vcf.gz done [user@cn4224 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file for each step (e.g. deepvariant_step[123].sh) similar to the following. Note that for simplicity there is little error checking in the example scripts. Note also that only step2 makes use of a GPU and that the current version of DeepVariant cannot scale to more than 1 GPU.
#! /bin/bash # this is deepvariant_step1.sh module load deepvariant/1.6.0 || exit 1 module load parallel wd=$PWD [[ -d /lscratch/${SLURM_JOB_ID} ]] && cd /lscratch/${SLURM_JOB_ID} || exit 1 rm -rf input output mkdir input output $wd/logs-parallel-$SLURM_JOB_ID cp "${DEEPVARIANT_TEST_DATA}"/* input || exit 1 REF=input/ucsc.hg19.chr20.unittest.fasta BAM=input/NA12878_S1.chr20.10_10p1mb.bam N_SHARDS=12 seq 0 $((N_SHARDS-1)) \ | parallel -P ${SLURM_CPUS_PER_TASK} --halt 2 \ --joblog "$wd/logs-parallel-$SLURM_JOB_ID/log" --res "$wd/logs-parallel-$SLURM_JOB_ID" \ make_examples --mode calling \ --ref "${REF}" \ --reads "${BAM}" \ --regions "chr20:10,000,000-10,010,000" \ --examples output/examples.tfrecord@${N_SHARDS}.gz\ --channels insert_size \ --task {} \ || exit 1 cp -r output $wd
#! /bin/bash # this is deepvariant_step2.sh module load deepvariant/1.6.0 || exit 1 #< 0.9.0 MODEL="/opt/wgs/model.ckpt" MODEL="/opt/models/wgs" N_SHARDS=12 CALL_VARIANTS_OUTPUT="output/call_variants_output.tfrecord.gz" call_variants \ --outfile "${CALL_VARIANTS_OUTPUT}" \ --examples "output/examples.tfrecord@${N_SHARDS}.gz" \ --checkpoint "${MODEL}" \ --writer_threads=6 || exit 1
#! /bin/bash # this is deepvariant_step3.sh module load deepvariant/1.6.0 || exit 1 REF=input/ucsc.hg19.chr20.unittest.fasta CALL_VARIANTS_OUTPUT="output/call_variants_output.tfrecord.gz" postprocess_variants \ --ref "${REF}" \ --infile "${CALL_VARIANTS_OUTPUT}" \ --outfile "output/example.vcf.gz" \ || exit 1
Submit these jobs using the Slurm sbatch command. In this example, job dependencies are used to tie the jobs together.
biowulf$ sbatch --cpus-per-task=6 --mem-per-cpu=2g --gres=lscratch:10 deepvariant_step1.sh 1111 biowulf$ sbatch --dependency=afterany:1111 --cpus-per-task=6 \ --gres=lscratch:10,gpu:p100:1 --partition=gpu deepvariant_step2.sh 1112 biowulf$ sbatch --dependency=afterany:1112 deepvariant_step3.sh 1113
Create a swarmfile for the first step of the pipeline (e.g. deepvariant.swarm). For example:
make_examples --mode calling --ref /path/to/genome.fa --reads "sample1.bam" \ --regions "chr1" --examples sample1_chr1.tfrecord.gz --channels insert_size make_examples --mode calling --ref /path/to/genome.fa --reads "sample1.bam" \ --regions "chr2" --examples sample1_chr2.tfrecord.gz --channels insert_size make_examples --mode calling --ref /path/to/genome.fa --reads "sample1.bam" \ --regions "chr3" --examples sample1_chr3.tfrecord.gz --channels insert_size
Submit this job using the swarm command.
swarm -f deepvariant.swarm [-g #] --module deepvariant/1.6.0where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
--module deepvariant | Loads the deepvariant module for each subjob in the swarm |