Biowulf High Performance Computing at the NIH
ANNOVAR

Description

ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, as well as mouse, worm, fly, yeast and many others).

Citation

If you use ANNOVAR, please cite:

How to Use

There are multiple versions of ANNOVAR available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail annovar

To select a module, type

module load annovar/[ver]

where [ver] is the version of choice. This will set your $PATH variable, as well as $ANNOVAR_HOME and $ANNOVAR_DATA.

ANNOVAR takes text-based input files, where each line corresponds to one variant. On each line, the first five space- or tab- delimited columns represent chromosome, start position, end position, the reference nucleotides and the observed nucleotides. Here is the example file $ANNOVAR_HOME/example/ex1.avinput

1	948921	948921	T	C	comments: rs15842, a SNP in 5' UTR of ISG15
1	1404001	1404001	G	T	comments: rs149123833, a SNP in 3' UTR of ATAD3C
1	5935162	5935162	A	T	comments: rs1287637, a splice site variant in NPHP4
1	162736463	162736463	C	T	comments: rs1000050, a SNP in Illumina SNP arrays
1	84875173	84875173	C	T	comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1	13211293	13211294	TC	-	comments: rs59770105, a 2-bp deletion
1	11403596	11403596	-	AT	comments: rs35561142, a 2-bp insertion
1	105492231	105492231	A	ATAAA	comments: rs10552169, a block substitution
1	67705958	67705958	G	A	comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
2	234183368	234183368	A	G	comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
16	50745926	50745926	C	T	comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
16	50756540	50756540	G	C	comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
16	50763778	50763778	-	C	comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
13	20763686	20763686	G	-	comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
13	20797176	21105944	0	-	comments: a 342kb deletion encompassing GJB6, associated with hearing loss

Reference files are pre-installed in $ANNOVAR_DATA/{build}, where {build} can be either hg18 or hg19. If other builds are needed, contact staff@hpc.nih.gov.

At the command line, type

[helix]$ cp $ANNOVAR_HOME/example/ex1.avinput .
[helix]$ annotate_variation.pl --geneanno --dbtype refGene --buildver hg19 ex1.avinput $ANNOVAR_DATA/hg19

table_annovar.pl

The table_annovar.pl script allows running annotate_variation.pl for a single input against multiple databases simultaneously using multiple cpus. Here is an example:

table_annovar.pl ex1.avinput $ANNOVAR_DATA/hg19 \
  --tempdir /path/to/temporary/directory \
  --thread 4 \
  --buildver hg19 \
  --outfile ex1.out \
  --remove \
  --protocol gene,avsift,ljb26_all,dbnsfp30a,cg46,dbscsnv11,cosmic64,cosmic70,exac03,exac03nontcga,1000g2015aug_all,1000g2012apr_all,snp138,avsnp147,clinvar_20160302 \
  --operation g,f,f,f,f,f,f,f,f,f,f,f,f,f,f \
  --nastring ''

Type table_annovar.pl --help for more information about running.

Biowulf Cluster Use

sbatch

Create an sbatch file (script.sh):

#!/bin/bash
module load annovar
annotate_variation.pl --geneanno --dbtype gene --buildver hg19 ex1.avinput $ANNOVAR_DATA/hg19

Then submit, supplying the appropriate sbatch options to ensure 8 cpus (to match the --threads option) on a single node:

sbatch script.sh

swarm

The easiest way to run ANNOVAR with multiple VCF files is via swarm. Create a file containing these lines:

convert2annovar.pl -format vcf4 file1.vcf > file1.inp; annotate_variation.pl --geneanno --dbtype gene --buildver hg19 file1.inp $ANNOVAR_DATA/hg19
convert2annovar.pl -format vcf4 file2.vcf > file2.inp; annotate_variation.pl --geneanno --dbtype gene --buildver hg19 file2.inp $ANNOVAR_DATA/hg19
convert2annovar.pl -format vcf4 file3.vcf > file3.inp; annotate_variation.pl --geneanno --dbtype gene --buildver hg19 file3.inp $ANNOVAR_DATA/hg19
convert2annovar.pl -format vcf4 file4.vcf > file4.inp; annotate_variation.pl --geneanno --dbtype gene --buildver hg19 file4.inp $ANNOVAR_DATA/hg19

Then submit with the --module option:

swarm -f swarmfile --module annovar

Notes On Reference Files

Some of the reference files for ANNOVAR are updated on a regular basis. The environment variable $ANNOVAR_DATA is set to the reference files as they existed at the time that ANNOVAR was updated. As a consequence, some of the reference files are not current. In order to use the most current, up-to-date reference files for ANNOVAR, use /fdb/annovar/current as the base directory for reference files. For example,

annotate_variation.pl --geneanno --dbtype refGene --buildver hg19 ex1.avinput /fdb/annovar/current/hg19

Alternatively, the environment variable $ANNOVAR_DATA_CURRENT can be used instead:

annotate_variation.pl --geneanno --dbtype refGene --buildver hg19 ex1.avinput $ANNOVAR_DATA_CURRENT/hg19

Please note that the reference files in /fdb/annovar/current are subject to change. This means that identical ANNOVAR jobs run on different days may give different results. For more information, contact staff@hpc.nih.gov.

Documentation