High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Oncotator on NIH HPC Systems

Oncotator (http://www.broadinstitute.org/cancer/cga/oncotator) is a tool for annotating information onto genomic point mutations (SNPs/SNVs) and indels. It is primarily intended to be used on human genome variant callsets and only data sources that are relevant to cancer researchers are provided. However, the tool can technically be used to annotate any kind of information onto variant callsets from any organism, and there are instructions the Broad Institute web site on how to prepare custom data sources for inclusion in the process.

Program Environment

Before using oncotator, you must add the oncotator environment module and the other modules it uses into your shell environment. This is most easily done by using the module commands, as in the example below:

[user@biowulf]$ module avail oncotator                   (see what versions are available)

-------------------- /usr/local/lmod/modulefiles --------------------
oncotator/ (D)

   (D):  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

[user@biowulf]$ module load oncotator

[user@biowulf]$ module list
Currently Loaded Modules

  1) oncotator/

Loading the Oncotator module causes the following environment variables to be set up:
$ONCOTATOR_DATASOURCE: points to latest Oncotator database bundle, e.g. /fdb/oncotator/oncotator_v1_ds_April052016
$ONCOTATOR_TESTDATA: points to Oncotator test files and data which is used in the examples on this page.

Transcript override lists

The Broad Institute highly recommends that you use one of the transcript override lists discussed below, especially with clinical applications of Oncotator. When running Oncotator, provide one of the UniProt Exact Match files with the -c parameter.
  1. UniProt Exact Match For GENCODE v19

    Gives selection priority to transcripts with protein sequences that match the UniProt protein sequence exactly.

    On Biowulf, the file can be found in the Oncotator database directory, at "$ONCOTATOR_DATASOURCE/tx_exact_uniprot_matches.txt".

  2. UniProt Exact Match + Clinical for GENCODE v19

    Gives priority to known clinical protein changes. The file can be found in the Oncotator database directory, at "$ONCOTATOR_DATASOURCE/tx_exact_uniprot_matches.AKT1_CRLF2_FGFR1.txt".

    On biowulf, the file is a modification of the UniProt Exact Match For GENCODE v19. For more details, please see the powerpoint presentation tx_selection_results_LTLedits.pptx.

Batch job on biowulf

Set up a batch script along the lines of the script below. This script uses the test input data provided with Oncotator.

#  this file is called test.sh

module load oncotator
mkdir /data/$USER/oncotator
cd /data/$USER/oncotator
oncotator -v --no-multicore --db-dir=$ONCOTATOR_DATASOURCE   \
	$ONCOTATOR_TESTDATA/testdata/maflite/Patient0.snp.maf.txt  \
	exampleOutput.tsv   \

Submit with:

sbatch test.sh
This job will produce some warnings in the output. See here for an explanation.

Some oncotator runs may require more than the default 4 GB of memory. If an oncotator job dies due to memory, you can increase the memory allocation with:

sbatch --mem=#g  test.sh

Running Oncotator with swarm

Set up a swarm command file along the following lines:

oncotator -v --no-multicore --db-dir=$ONCOTATOR_DATASOURCE   /path/to/file1.vcf  Output1.tsv   hg19
oncotator -v --no-multicore --db-dir=$ONCOTATOR_DATASOURCE   /path/to/file1.vcf  Output2.tsv   hg19
oncotator -v --no-multicore --db-dir=$ONCOTATOR_DATASOURCE   /path/to/file1.vcf  Output3.tsv   hg19

Submit with:

swarm -f  swarm_command_file --module oncotator/