Biowulf High Performance Computing at the NIH
VEP on Biowulf

VEP (Variant Effect Predictor) determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.

References:

Documentation
Important Notes

By default VEP requires internet connectivity to the Ensembl databases. THIS IS NOT POSSIBLE ON THE BIOWULF CLUSTER! Instead, the databases have been locally cached into a version-specific directory $VEP_CACHEDIR, as set by the VEP module, allowing for offline analysis.

Without supplying --dir_cache $VEP_CACHEDIR, VEP will attempt to create a cache directory in your home directory as ~/.vep. This may cause your /home directory to fill up, causing unexpected failures to your jobs. Make sure to include --cache --dir_cache $VEP_CACHEDIR with all VEP commands.

Commands

Reference Files

--assembly is needed for human sequences because there are two available (GRCh37 and GRCh38). For cat, dog, and mouse, no assembly required.

Plugins

There are a large number of plugins available for use with VEP. Some of these plugins require third-party reference data. Most of this data is available within $VEP_CACHEDIR, but some are available in the /fdb tree. Here is an example for using plugins:

module load VEP/104
vep \
 -i ${VEP_EXAMPLES}/homo_sapiens_GRCh38.vcf \
 -o example.out \
 --offline \
 --cache \
 --force_overwrite \
 --dir_cache ${VEP_CACHEDIR} \
 --species human \
 --assembly GRCh38 \
 --fasta ${VEP_CACHEDIR}/GRCh38.fa \
 --plugin CSN \
 --plugin Blosum62 \
 --plugin Carol \
 --plugin Condel,$VEP_CACHEDIR/Plugins/config/Condel/config,b \
 --plugin Phenotypes,file=${VEP_CACHEDIR}/Plugins/Phenotypes.pm_human_104_GRCh38.gvf.gz \
 --plugin ExAC,$VEP_CACHEDIR/ExAC.r0.3.sites.vep.vcf.gz \
 --plugin GeneSplicer,$GS/bin/genesplicer,$GS/human,context=200 \
 --plugin CADD,${VEP_CACHEDIR}/CADD_1.4_GRCh38_whole_genome_SNVs.tsv.gz,${VEP_CACHEDIR}/CADD_1.4_GRCh38_InDels.tsv.gz \
 --plugin Downstream \
 --plugin LoFtool \
 --plugin FATHMM,"python ${VEP_CACHEDIR}/fathmm.py" \
 --af_gnomad \
 --custom $VEP_CACHEDIR/gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz,gnomADg,vcf,exact,0,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH

Here is a synopsis of --plugin examples using GRCh38:

Some of the plugins have multiple versions of reference files.

NOTE: The name of the variation file for the Phenotypes plugin depends on the version of VEP being used. In the above example, VEP/104 is being used, and the variation file is ${VEP_CACHEDIR}/Plugins/Phenotypes.pm_homo_sapiens_104_GRCh38.gvf.gz. Please run the command ls -l ${VEP_CACHEDIR} to see all the reference files available.

For more information about plugins, type

perldoc $VEP_CACHEDIR/Plugins/[name].pm

where [name] is the name of the plugin.

LOFTEE Plugin

The LOFTEE (Loss-Of-Function Transcript Effect Estimator) plugin is a bit trickier than others, and is not well documented. It is available for VEP versions >= 101. Here are examples for annotating GRCh37 and GRCh38 aligned VCF files:

ml VEP/101

export PERL5LIB=$PERL5LIB:${VEP_CACHEDIR}/Plugins/loftee_GRCh37
vep \
  --offline --cache --dir_cache ${VEP_CACHEDIR} \
  --input_file ${VEP_EXAMPLES}/homo_sapiens_GRCh37.vcf \
  --species human --assembly GRCh37 --fasta ${VEP_CACHEDIR}/GRCh37.fa \
  --output_file GRCh37.txt --stats_file GRCh37_summary.txt \
  --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh37,\
conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh37/phylocsf_gerp.sql,\
human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh37/human_ancestor.fa.gz >& GRCh37.out &

export PERL5LIB=$PERL5LIB:${VEP_CACHEDIR}/Plugins/loftee_GRCh38
vep \
  --offline --cache --dir_cache ${VEP_CACHEDIR} \
  --input_file ${VEP_EXAMPLES}/homo_sapiens_GRCh38.vcf \
  --species human --assembly GRCh38 --fasta ${VEP_CACHEDIR}/GRCh38.fa \
  --output_file GRCh38.txt --stats_file GRCh38_summary.txt \
  --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh38,\
human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/human_ancestor.fa.gz,\
conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/loftee.sql,\
gerp_bigwig:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load VEP
[user@cn3144 ~]$ ln -s $VEP_EXAMPLES/homo_sapiens_GRCh38.vcf .
[user@cn3144 ~]$ vep -i homo_sapiens_GRCh38.vcf -o test.out --offline --cache --dir_cache $VEP_CACHEDIR --species human --assembly GRCh38 --fasta $VEP_CACHEDIR/GRCh38.fa 

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. VEP.sh). For example:

#!/bin/bash
module load VEP
vep -i trial1.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] VEP.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. VEP.swarm). For example:

vep -i trial1.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out
vep -i trial2.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial2.out
vep -i trial3.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial3.out
vep -i trial4.vcf --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial4.out

By default, vep will write to the same output file ("variant_effect_output.txt") unless directed to do otherwise using the --output option. For swarms of multiple runs, be sure to include this option.

Submit this job using the swarm command.

swarm -f VEP.swarm [-g #] [-t #] --module VEP
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module VEP Loads the VEP module for each subjob in the swarm