VEP (Variant Effect Predictor) determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
By default VEP requires internet connectivity to the Ensembl databases. THIS IS NOT POSSIBLE ON THE BIOWULF CLUSTER! Instead, the databases have been locally cached into a version-specific directory $VEP_CACHEDIR, as set by the VEP module, allowing for offline analysis.
Without supplying --dir $VEP_CACHEDIR, VEP will attempt to create a cache directory in your home directory as ~/.vep. This may cause your /home directory to fill up, causing unexpected failures to your jobs. Make sure to include --cache --dir $VEP_CACHEDIR with all VEP commands (including haplo and variant_recoder).
Older versions of VEP used --dir_cache $VEP_CACHEDIR rather than --dir $VEP_CACHEDIR. Both are valid.
--assembly is needed for human sequences because there are two available (GRCh37 and GRCh38). For cat, dog, and mouse, no assembly required.
There are a large number of plugins available for use with VEP. Some of these plugins require third-party reference data. Most of this data is available within $VEP_CACHEDIR, but some are available in the /fdb tree. Here is an example for using plugins:
module load VEP/111 vep \ -i ${VEP_EXAMPLES}/homo_sapiens_GRCh38.vcf \ -o example.out \ --offline \ --cache \ --force_overwrite \ --dir ${VEP_CACHEDIR} \ --species human \ --assembly GRCh38 \ --fasta ${VEP_CACHEDIR}/GRCh38.fa \ --everything \ --plugin CSN \ --plugin Blosum62 \ --plugin Carol \ --plugin Condel,$VEP_CACHEDIR/Plugins/config/Condel/config,b \ --plugin Phenotypes,file=${VEP_CACHEDIR}/Plugins/Phenotypes.pm_human_111_GRCh38.gvf.gz \ --plugin ExAC,$VEP_CACHEDIR/ExAC.r0.3.sites.vep.vcf.gz \ --plugin GeneSplicer,$GS/bin/genesplicer,$GS/human,context=200 \ --plugin CADD,${VEP_CACHEDIR}/CADD_1.4_GRCh38_whole_genome_SNVs.tsv.gz,${VEP_CACHEDIR}/CADD_1.4_GRCh38_InDels.tsv.gz \ --plugin Downstream \ --plugin LoFtool \ --plugin FATHMM,"python ${VEP_CACHEDIR}/fathmm.py"
Here is a synopsis of --plugin examples using GRCh38:
--plugin AlphaMissense,file=${VEP_CACHEDIR}/AlphaMissense_hg38.tsv.gz,cols=all
--plugin BayesDel,file=${VEP_CACHEDIR}//BayesDel_170824_addAF_all_scores_GRCh38.txt.gz
--plugin Blosum62
--plugin CADD,${VEP_CACHEDIR}/CADD_1.4_GRCh38_whole_genome_SNVs.tsv.gz,${VEP_CACHEDIR}/CADD_1.4_GRCh38_InDels.tsv.gz
--plugin Carol
--plugin Condel,${VEP_CACHEDIR}/Plugins/Condel/config,b
--plugin CSN
--plugin dbNSFP,${VEP_CACHEDIR}/dbNSFP4.2a_hg38.txt.gz,LRT_score,GERP++_RS
--plugin dbscSNV,${VEP_CACHEDIR}/dbscSNV1.1.txt.gz
--plugin DisGeNET,file=${VEP_CACHEDIR}/all_variant_disease_pmid_associations_sorted.tsv.gz,disease=1,filter_source=GWASDB&GWASCAT
--plugin DosageSensitivity,file=${VEP_CACHEDIR}//Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz
--plugin Downstream
--plugin Enformer,file=$VEP_CACHEDIR/enformer_grch38.vcf.gz
--plugin ExAC,${VEP_CACHEDIR}/ExAC.r0.3.1.sites.vep.vcf.gz
--plugin FATHMM,"python ${VEP_CACHEDIR}/fathmm.py"
--plugin G2P,file=${VEP_CACHEDIR}/SkinG2P.csv
--plugin GeneSplicer,$GS/bin/genesplicer,$GS/human,context=200
--custom file=${VEP_CACHEDIR}/gnomad.genomes.v4.0.sites.noVEP.vcf.gz,\ short_name=gnomADg,format=vcf,type=exact,coords=0,\ fields=AF_afr%AF_amr%AF_asj%AF_eas%AF_fin%AF_nfe%AF_oth
--hgvs --plugin HGVSReferenceBase
--dir_plugins ${VEP_CACHEDIR}/Plugins/loftee_GRCh38 \ --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh38,\ human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/human_ancestor.fa.gz,\ conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/loftee.sql,\ gerp_bigwig:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw
--plugin LoFtool
--plugin MaxEntScan,${VEP_CACHEDIR}/MaxEntScan
--plugin MTR,${VEP_CACHEDIR}/mtrflatfile_2.0.txt.gz
--plugin OpenTargets,file=${VEP_CACHEDIR}//OTGenetics.tsv.gz
--plugin Phenotypes,file=${VEP_CACHEDIR}/Plugins/Phenotypes.pm_human_111_GRCh38.gvf.gz
--plugin REVEL,${VEP_CACHEDIR}/revel_GRCh38_1.3.tsv.gz
--plugin SpliceAI,snv=${VEP_CACHEDIR}/spliceai_scores.masked.snv.hg38.vcf.gz,indel=${VEP_CACHEDIR}/spliceai_scores.masked.indel.hg38.vcf.gz
--plugin SpliceVault,file=${VEP_CACHEDIR}//SpliceVault_data_GRCh38.tsv.gz
--plugin StructuralVariantOverlap,file=${VEP_CACHEDIR}/gnomad_v2_sv.sites.vcf.gz
--plugin VARITY,file=${VEP_CACHEDIR}//varity_all_predictions_GRCh38.tsv.gz
Some of the plugins have multiple versions of reference files.
NOTE: The name of the variation file for the Phenotypes plugin depends on the version of VEP being used. In the above example, VEP/111 is being used, and the variation file is ${VEP_CACHEDIR}/Plugins/Phenotypes.pm_homo_sapiens_111_GRCh38.gvf.gz. Please run the command ls -l ${VEP_CACHEDIR} to see all the reference files available.
For more information about plugins, type
perldoc $VEP_CACHEDIR/Plugins/[name].pm
where [name] is the name of the plugin.
The LOFTEE (Loss-Of-Function Transcript Effect Estimator) plugin is a bit trickier than others, and is not well documented. It is available for VEP versions >= 101. Here are examples for annotating GRCh37 and GRCh38 aligned VCF files:
ml VEP/101 export PERL5LIB=$PERL5LIB:${VEP_CACHEDIR}/Plugins/loftee_GRCh37 vep \ --offline --cache --dir ${VEP_CACHEDIR} \ --input_file ${VEP_EXAMPLES}/homo_sapiens_GRCh37.vcf \ --species human --assembly GRCh37 --fasta ${VEP_CACHEDIR}/GRCh37.fa \ --output_file GRCh37.txt --stats_file GRCh37_summary.txt \ --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh37,\ conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh37/phylocsf_gerp.sql,\ human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh37/human_ancestor.fa.gz >& GRCh37.out export PERL5LIB=$PERL5LIB:${VEP_CACHEDIR}/Plugins/loftee_GRCh38 vep \ --offline --cache --dir ${VEP_CACHEDIR} \ --input_file ${VEP_EXAMPLES}/homo_sapiens_GRCh38.vcf \ --species human --assembly GRCh38 --fasta ${VEP_CACHEDIR}/GRCh38.fa \ --output_file GRCh38.txt --stats_file GRCh38_summary.txt \ --plugin LoF,loftee_path:${VEP_CACHEDIR}/Plugins/loftee_GRCh38,\ human_ancestor_fa:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/human_ancestor.fa.gz,\ conservation_file:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/loftee.sql,\ gerp_bigwig:${VEP_CACHEDIR}/Plugins/loftee_GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw >& GRCh38.out
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load VEP [user@cn3144 ~]$ ln -s $VEP_EXAMPLES/homo_sapiens_GRCh38.vcf . [user@cn3144 ~]$ vep -i homo_sapiens_GRCh38.vcf -o test.out --offline --cache --dir $VEP_CACHEDIR --species human --assembly GRCh38 --fasta $VEP_CACHEDIR/GRCh38.fa [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. VEP.sh). For example:
#!/bin/bash module load VEP vep -i trial1.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] VEP.sh
Create a swarmfile (e.g. VEP.swarm). For example:
vep -i trial1.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial1.out vep -i trial2.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial2.out vep -i trial3.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial3.out vep -i trial4.vcf --offline --cache --dir $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --output trial4.out
By default, vep will write to the same output file ("variant_effect_output.txt") unless directed to do otherwise using the --output option. For swarms of multiple runs, be sure to include this option.
Submit this job using the swarm command.
swarm -f VEP.swarm [-g #] [-t #] --module VEPwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module VEP | Loads the VEP module for each subjob in the swarm |