High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Vcfanno on NIH HPC Systems

vcfanno annotates a VCF with any number of sorted and tabixed input BED, BAM, and VCF files in parallel. It does this by finding overlaps as it streams over the data and applying user-defined operations on the overlapping annotations.

In order to parallelize, work is broken down as follows. A slice (array) of query intervals is accumulated until a specified number is reached (usually ~5K-25K) or a gap cutoff is exceeded; at that point, the bounds of the region are used to perform a tabix (or any regional) query on the database files. This is all done in irelate. vcfanno then iterates over the streams that result from the tabix queries and finds intersections with the query stream. This is a parallel chrom-sweep. This method avoids problems with chromosome order.

For VCF, values are pulled by name from the INFO field. For BED, values are pulled from (1-based) column number. For BAM, depth (count), "mapq" and "seq" are currently supported.

 

Example files can be copied from /usr/local/apps/vcfanno/example

 

On Helix

Sample session:


[susanc@helix ~]$ module load vcfanno
[susanc@helix ~]$ vcfanno -p 2 -lua example/custom.lua example/conf.toml example/query.vcf.gz > annotated.vcf
-p Set to the number of processes that vcfanno can use during annotation. vcfanno parallelizes well up to 15 or so cores (use biowulf cluster instead of helix if multithread).

Batch job on Biowulf

Create a batch input file (e.g. vcfanno.sh). For example:

#!/bin/bash
module load vcfanno

cd /data/$USER/dir
vcfanno -p $SLURM_CPUS_PER_TASK -lua example/custom.lua example/conf.toml example/query.vcf.gz > annotated.vcf

Submit this job using the Slurm sbatch command. Note that the variable $SLURM_CPUS_PER_TASK is used within the batch file to specify the number of threads that the program should spawn. This variable is set by Slurm when the job runs, and matches the value specified in --cpus-per-task=# in the sbatch command below.

sbatch --cpus-per-task=4 vcfanno.sh
Swarm of Jobs on Biowulf

Create a swarmfile (e.g.vcfanno.swarm). For example:

# this file is called vcfanno.swarm
cd dir1;vcfanno -p $SLURM_CPUS_PER_TASK -lua custom.lua conf.toml query.vcf.gz > annotated.vcf
cd dir2;vcfanno -p $SLURM_CPUS_PER_TASK -lua custom.lua conf.toml query.vcf.gz > annotated.vcf
cd dir3;vcfanno -p $SLURM_CPUS_PER_TASK -lua custom.lua conf.toml query.vcf.gz > annotated.vcf
[...]

Submit this job using the swarm command.

swarm -f -t 4 vcfanno.swarm --module vcfanno

The -t flag assign the $SLURM_CPUS_PER_TASK to each command.

Interactive job on Biowulf
Allocate an interactive session and run raremetal. Sample session:
[susanc@biowulf ~]$ sinteractive --cpus-per-task=4
salloc.exe: Pending job allocation 15194042
salloc.exe: job 15194042 queued and waiting for resources
salloc.exe: job 15194042 has been allocated resources
salloc.exe: Granted job allocation 15194042
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1719 are ready for job

[susanc@cn1719 ~]$ module load vcfanno

[susanc@cn1719 ~]$ vcfanno -p $SLURM_CPUS_PER_TASK -lua custom.lua conf.toml query.vcf.gz > annotated.vcf
Documentation