High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing

The accurate identification of DNA sequence variants is particularly difficult for single molecule sequencing, which has a high per-nucleotide error rate (~5%-15%). Clairvoyante implements a multitask five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. Using well-characterized tesing data, Clairvoyante achieved 99.73%, 97.68% and 95.36% precision on known variants, and 98.65%, 92.57%, 77.89% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session on a CPU node:

[user@biowulf ~]$ sinteractive --gres=gpu:p100:1 --mem=16g
[user@cn2379 ~]$ module load Clairvoyante
[user@cn2379 ~]$ time cv python /Clairvoyante/clairvoyante/callVarBam.py --chkpnt_fn \
    /fdb/Clairvoyante/trained_models/fullv3-illumina-novoalign-hg001+hg002-hg38/learningRate1e-3.epoch500 \
 --bam_fn /fdb/Clairvoyante/testing_data/chr21/chr21.bam \
 --ref_fn /fdb/Clairvoyante/testing_data/chr21/chr21.fa \
 --call_fn tensor_can_chr21.vcf \
 --ctgName chr21 
Using CPU implementation
Delay 2 seconds before starting variant calling ...
Loading model ...
Restoring parameters from /fdb/Clairvoyante/trained_models/fullv3-illumina-novoalign-hg001+hg002-hg38/learningRate1e-3.epoch500
Calling variants ...
Processed 1000 tensors
Processed 2000 tensors
Processed 3000 tensors
...
Processed 135000 tensors
Processed 136000 tensors
Processed 136725 tensors
Total time elapsed: 594.45 s
...

To do variant calling using trained models, CPU will suffice.
On GPU node P100, the same command takes 608.78 s, i.e. not much less than on the CPU node.
The real advantage of GPU nodes will become obvious when doing training of the models.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. cv.sh). For example:

#!/bin/bash
set -e
module load Clairvoyante
cd /data/$USER
cv python /Clairvoyante/clairvoyante/callVarBam.py --chkpnt_fn \
    /fdb/Clairvoyante/trained_models/fullv3-illumina-novoalign-hg001+hg002-hg38/learningRate1e-3.epoch500 \
 --bam_fn /fdb/Clairvoyante/testing_data/chr21/chr21.bam \
 --ref_fn /fdb/Clairvoyante/testing_data/chr21/chr21.fa \
 --call_fn tensor_can_chr21.vcf \
 --ctgName chr21

Submit this job using the Slurm sbatch command.

sbatch cv.sh