High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
ContEst on Biowulf and Helix

ContEst is a tool (and method) for estimating the amount of cross-sample contamination in next generation sequencing data.  Using a Bayesian framework, contamination levels are estimated from array based genotypes and sequencing reads.  

The most recent version of ContEst has been merged into GATK.

References

http://bioinformatics.oxfordjournals.org/content/early/2011/07/29/bioinformatics.btr446.abstract

Note 1:

Example files below can be copied to user's area from /usr/local/apps/contest/example directory

Running on Helix

Sample session:

helix$ module load contest
helix$ java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar \
		-I example/chr20_sites.bam \
		-R example/human_g1k_v37.fasta \
		-B:pop,vcf example/hg19_population_stratified_af_hapmap_3.3.vcf \
		-T Contamination -B:genotypes,vcf example/hg00142.vcf \
		-BTI genotypes -o /data/$USER/contamination_results_chr20_2.txt

Submitting a single batch job

1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of program location before running.

#!/bin/bash 

module load contest
cd /data/$USER/somewhere
java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar \
		-I example/chr20_sites.bam \
		-R example/human_g1k_v37.fasta \
		-B:pop,vcf example/hg19_population_stratified_af_hapmap_3.3.vcf \
		-T Contamination -B:genotypes,vcf example/hg00142.vcf \
		-BTI genotypes -o /data/$USER/contamination_results_chr20_2.txt
....
....

2. Submit the script on Biowulf.

$ sbatch myscript

Submitting a swarm of jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:

cd /data/user/run1/; java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar [options]
cd /data/user/run2/; java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar [options]
........
cd /data/user/run10/; java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar [options]

The -f flag is required to specify swarm file name.

Submit the swarm job:

$ swarm -f swarmfile --module contest

For more memory requirement (default 1.5gb per line in swarmfile), use -g flag
and at the mean time, change -Xmx2g in your script to corresponding number (-Xmx2g to -Xmx10g in this example):

$ swarm -g 10 -f swarmfile --module contest

For more information regarding running swarm, see swarm.html

 

Running an interactive job

User may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf]$ sinteractive 

[user@pXXXX]$ cd /data/$USER/myruns

[user@pXXXX]$ module load contest

[user@pXXXX]$ cd /data/user/run1/; java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar [options]

[user@pXXXX] exit
slurm stepepilog here!
                   
[user@biowulf]$ 

Documentation

http://www.broadinstitute.org/cancer/cga/contest_run