High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
VEP on Biowulf

VEP (Variant Effect Predictor) determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.

References:

Documentation
Important Notes

By default VEP requires internet connectivity to the Ensembl databases. THIS IS NOT POSSIBLE ON THE BIOWULF CLUSTER! Instead, the databases have been locally cached into a version-specific directory $VEP_CACHEDIR, as set by the VEP module, allowing for offline analysis.

Commands

Reference Files

--assembly is needed for human sequences because there are two available (GRCh37 and GRCh38). For cat, dog, and mouse, no assembly required.

Plugins

There are a large number of plugins available for use with VEP. Some of these plugins require third-party reference data. Most of this data is available within $VEP_CACHEDIR, but some are available in the /fdb tree. Here is an example for using plugins:

module load VEP
vep \
 -i $VEP_EXAMPLES/homo_sapiens_GRCh38.vcf \
 -o example.out \
 --offline \
 --cache \
 --force_overwrite \
 --dir_cache $VEP_CACHEDIR \
 --species human \
 --assembly GRCh38 \
 --fasta $VEP_CACHEDIR/GRCh38.fa \
 --plugin CSN \
 --plugin Blosum62 \
 --plugin Carol \
 --plugin Condel,$VEP_CACHEDIR/Plugins/config/Condel/config,b \
 --plugin Phenotypes \
 --plugin ExAC,$VEP_CACHEDIR/ExAC.r0.3.sites.vep.vcf.gz \
 --plugin GeneSplicer,$GS/bin/genesplicer,$GS/human,context=200 \
 --plugin CADD,$VEP_CACHEDIR/whole_genome_SNVs.tsv.gz,$VEP_CACHEDIR/InDels.tsv.gz \
 --plugin Downstream \
 --plugin LoFtool \
 --plugin FATHMM,"python $VEP_CACHEDIR/fathmm.py" \
 --af_gnomad \
 --custom $VEP_CACHEDIR/gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz,gnomADg,vcf,exact,0,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH

For more information about plugins, type

perldoc $VEP_CACHEDIR/Plugins/[name].pm

where [name] is the name of the plugin.

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load VEP
[user@cn3144 ~]$ ln -s $VEP_EXAMPLES/homo_sapiens_GRCh38.vcf .
[user@cn3144 ~]$ vep -i homo_sapiens_GRCh38.vcf -o test.out --offline --cache --dir_cache $VEP_CACHEDIR --species human --assembly GRCh38 --fasta $VEP_CACHEDIR/GRCh38.fa 

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. VEP.sh). For example:

#!/bin/bash
module load VEP
vep -i trial1.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] VEP.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. VEP.swarm). For example:

vep -i trial1.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out
vep -i trial2.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial2.out
vep -i trial3.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial3.out
vep -i trial4.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial4.out

By default, vep will write to the same output file ("variant_effect_output.txt") unless directed to do otherwise using the --output option. For swarms of multiple runs, be sure to include this option.

Submit this job using the swarm command.

swarm -f VEP.swarm [-g #] [-t #] --module VEP
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module VEP Loads the VEP module for each subjob in the swarm