VEP (Variant Effect Predictor) determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
References:
- VEP Main Page
- VEP Tutorial
- ensembl-vep Git Page
-
- Old VEP versions
- VEP Main Page (version 111)
- VEP Main Page (version 110)
- VEP Main Page (version 109)
- VEP Main Page (version 108)
- VEP Main Page (version 107)
- VEP Main Page (version 106)
- VEP Main Page (version 105)
- VEP Main Page (version 104)
- VEP Main Page (version 103)
- VEP Main Page (version 102)
- VEP Main Page (version 101)
- VEP Main Page (version 100)
- Old VEP versions
- Module Name: VEP (see the modules page for more information)
- Singlethreaded
- Environment variables set
- VEP_HOME
- VEP_EXAMPLES
- VEP_CACHEDIR -- reference data and plugins
- GS -- location of the GeneSplicer application
- PERL5LIB
- Example files in $VEP_EXAMPLES
- Reference data in $VEP_CACHEDIR
By default VEP requires internet connectivity to the Ensembl databases. THIS IS NOT POSSIBLE ON THE BIOWULF CLUSTER! Instead, the databases have been locally cached into a version-specific directory $VEP_CACHEDIR, as set by the VEP module, allowing for offline analysis.
Without supplying --dir $VEP_CACHEDIR, VEP will attempt to create a cache directory in your home directory as ~/.vep. This may cause your /home directory to fill up, causing unexpected failures to your jobs. Make sure to include --cache --dir $VEP_CACHEDIR with all VEP commands (including haplo and variant_recoder).
Older versions of VEP used --dir_cache $VEP_CACHEDIR rather than --dir $VEP_CACHEDIR. Both are valid.
Commands
- convert_cache.pl
- filter_vep: filter the output of VEP
- vep: predict effect of variants
- haplo: Ensembl transcript haplotypes view
- variant_recoder: translating between different variant encodings
Reference Files
--assembly is needed for human sequences because there are two available (GRCh37 and GRCh38). For cat, dog, and mouse, no assembly required.
- --fasta $VEP_CACHEDIR/cat.fa --species cat
- --fasta $VEP_CACHEDIR/dog.fa --species dog
- --fasta $VEP_CACHEDIR/mouse.fa --species mouse
- --fasta $VEP_CACHEDIR/zebrafish.fa --species zebrafish
- --fasta $VEP_CACHEDIR/GRCh38.fa --species human --assembly GRCh38
- --fasta $VEP_CACHEDIR/GRCh37.fa --species human --assembly GRCh37
Plugins
There are a large number of plugins available for use with VEP. Some of these plugins require third-party reference data. Most of this data is available within $VEP_CACHEDIR, but some are available in the /fdb tree. Here is an example for using plugins:
module load VEP/111 vep \ -i ${VEP_EXAMPLES}/homo_sapiens_GRCh38.vcf \ -o example.out \ --offline \ --cache \ --force_overwrite \ --dir ${VEP_CACHEDIR} \ --species human \ --assembly GRCh38 \ --fasta ${VEP_CACHEDIR}/GRCh38.fa \ --everything \ --plugin CSN \ --plugin Blosum62 \ --plugin Carol \ --plugin Condel,$VEP_CACHEDIR/Plugins/config/Condel/config,b \ --plugin Phenotypes,file=${VEP_CACHEDIR}/Plugins/Phenotypes.pm_human_111_GRCh38.gvf.gz \ --plugin ExAC,$VEP_CACHEDIR/ExAC.r0.3.sites.vep.vcf.gz \ --plugin GeneSplicer,$GS/bin/genesplicer,$GS/human,context=200 \ --plugin CADD,${VEP_CACHEDIR}/CADD_1.4_GRCh38_whole_genome_SNVs.tsv.gz,${VEP_CACHEDIR}/CADD_1.4_GRCh38_InDels.tsv.gz \ --plugin Downstream \ --plugin LoFtool \ --plugin FATHMM,"python ${VEP_CACHEDIR}/fathmm.py"
Here is a synopsis of --plugin examples using GRCh38:
- AlphaMissense:
--plugin AlphaMissense,file=${VEP_CACHEDIR}/AlphaMissense_hg38.tsv.gz,cols=all
- BayesDel:
--plugin BayesDel,file=${VEP_CACHEDIR}//BayesDel_170824_addAF_all_scores_GRCh38.txt.gz
- Blosum62:
--plugin Blosum62
- CADD:
--plugin CADD,${VEP_CACHEDIR}/CADD_1.4_GRCh38_whole_genome_SNVs.tsv.gz,${VEP_CACHEDIR}/CADD_1.4_GRCh38_InDels.tsv.gz
- Carol:
--plugin Carol
- Condel:
--plugin Condel,${VEP_CACHEDIR}/Plugins/Condel/config,b
- CSN
--plugin CSN
- dbNSFP:
--plugin dbNSFP,${VEP_CACHEDIR}/dbNSFP4.2a_hg38.txt.gz,LRT_score,GERP++_RS
- dbscSNV:
--plugin dbscSNV,${VEP_CACHEDIR}/dbscSNV1.1.txt.gz
- DisGeNET:
--plugin DisGeNET,file=${VEP_CACHEDIR}/all_variant_disease_pmid_associations_sorted.tsv.gz,disease=1,filter_source=GWASDB&GWASCAT
- DosageSensitivity:
--plugin DosageSensitivity,file=${VEP_CACHEDIR}//Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz
- Downstream:
--plugin Downstream
- Enformer:
--plugin Enformer,file=$VEP_CACHEDIR/enformer_grch38.vcf.gz
- ExAC:
--plugin ExAC,${VEP_CACHEDIR}/ExAC.r0.3.1.sites.vep.vcf.gz
- FATHMM:
--plugin FATHMM,"python ${VEP_CACHEDIR}/fathmm.py"
- G2P:
--plugin G2P,file=${VEP_CACHEDIR}/SkinG2P.csv
- GeneSplicer:
--plugin GeneSplicer,$GS/bin/genesplicer,$GS/human,context=200
- gnomad:
--custom file=${VEP_CACHEDIR}/gnomad.genomes.v4.0.sites.noVEP.vcf.gz,\ short_name=gnomADg,format=vcf,type=exact,coords=0,\ fields=AF_afr%AF_amr%AF_asj%AF_eas%AF_fin%AF_nfe%AF_oth
- HGVSReferenceBase:
--hgvs --plugin HGVSReferenceBase
- Lof:
--dir_plugins ${VEP_CACHEDIR}/Plugins/loftee_GRCh38 \ --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh38,\ human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/human_ancestor.fa.gz,\ conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/loftee.sql,\ gerp_bigwig:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw
- LoFtool:
--plugin LoFtool
- MaxEntScan:
--plugin MaxEntScan,${VEP_CACHEDIR}/MaxEntScan
- MTR:
--plugin MTR,${VEP_CACHEDIR}/mtrflatfile_2.0.txt.gz
- OpenTargets:
--plugin OpenTargets,file=${VEP_CACHEDIR}//OTGenetics.tsv.gz
- Phenotypes:
--plugin Phenotypes,file=${VEP_CACHEDIR}/Plugins/Phenotypes.pm_human_111_GRCh38.gvf.gz
- REVEL:
--plugin REVEL,${VEP_CACHEDIR}/revel_GRCh38_1.3.tsv.gz
- SpliceAI:
--plugin SpliceAI,snv=${VEP_CACHEDIR}/spliceai_scores.masked.snv.hg38.vcf.gz,indel=${VEP_CACHEDIR}/spliceai_scores.masked.indel.hg38.vcf.gz
- SpliceVault:
--plugin SpliceVault,file=${VEP_CACHEDIR}//SpliceVault_data_GRCh38.tsv.gz
- StructuralVariantOverlap:
--plugin StructuralVariantOverlap,file=${VEP_CACHEDIR}/gnomad_v2_sv.sites.vcf.gz
- VARITY:
--plugin VARITY,file=${VEP_CACHEDIR}//varity_all_predictions_GRCh38.tsv.gz
Some of the plugins have multiple versions of reference files.
NOTE: The name of the variation file for the Phenotypes plugin depends on the version of VEP being used. In the above example, VEP/111 is being used, and the variation file is ${VEP_CACHEDIR}/Plugins/Phenotypes.pm_homo_sapiens_111_GRCh38.gvf.gz. Please run the command ls -l ${VEP_CACHEDIR} to see all the reference files available.
For more information about plugins, type
perldoc $VEP_CACHEDIR/Plugins/[name].pm
where [name] is the name of the plugin.
LOFTEE Plugin
The LOFTEE (Loss-Of-Function Transcript Effect Estimator) plugin is a bit trickier than others, and is not well documented. It is available for VEP versions >= 101. Here are examples for annotating GRCh37 and GRCh38 aligned VCF files:
ml VEP/101 export PERL5LIB=$PERL5LIB:${VEP_CACHEDIR}/Plugins/loftee_GRCh37 vep \ --offline --cache --dir ${VEP_CACHEDIR} \ --input_file ${VEP_EXAMPLES}/homo_sapiens_GRCh37.vcf \ --species human --assembly GRCh37 --fasta ${VEP_CACHEDIR}/GRCh37.fa \ --output_file GRCh37.txt --stats_file GRCh37_summary.txt \ --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh37,\ conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh37/phylocsf_gerp.sql,\ human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh37/human_ancestor.fa.gz >& GRCh37.out export PERL5LIB=$PERL5LIB:${VEP_CACHEDIR}/Plugins/loftee_GRCh38 vep \ --offline --cache --dir ${VEP_CACHEDIR} \ --input_file ${VEP_EXAMPLES}/homo_sapiens_GRCh38.vcf \ --species human --assembly GRCh38 --fasta ${VEP_CACHEDIR}/GRCh38.fa \ --output_file GRCh38.txt --stats_file GRCh38_summary.txt \ --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh38,\ human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/human_ancestor.fa.gz,\ conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/loftee.sql,\ gerp_bigwig:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw >& GRCh38.out
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load VEP [user@cn3144 ~]$ ln -s $VEP_EXAMPLES/homo_sapiens_GRCh38.vcf . [user@cn3144 ~]$ vep -i homo_sapiens_GRCh38.vcf -o test.out --offline --cache --dir $VEP_CACHEDIR --species human --assembly GRCh38 --fasta $VEP_CACHEDIR/GRCh38.fa [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. VEP.sh). For example:
#!/bin/bash module load VEP vep -i trial1.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] VEP.sh
Create a swarmfile (e.g. VEP.swarm). For example:
vep -i trial1.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out vep -i trial2.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial2.out vep -i trial3.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial3.out vep -i trial4.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial4.out
By default, vep will write to the same output file ("variant_effect_output.txt") unless directed to do otherwise using the --output option. For swarms of multiple runs, be sure to include this option.
Submit this job using the swarm command.
swarm -f VEP.swarm [-g #] [-t #] --module VEPwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module VEP | Loads the VEP module for each subjob in the swarm |