High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Conpair on Biowulf & Helix

Conpair: concordance and contamination estimator for tumor–normal pairs

Conpair is a fast and robust method dedicated for human tumor-normal studies to perform concordance verification (= samples coming from the same individual), as well as cross-individual contamination level estimation in whole-genome and whole-exome sequencing experiments. Importantly, our method of estimating contamination in the tumor samples is not affected by copy number changes and is able to detect contamination levels as low as 0.1%.


Running on Helix

helix$ module load conpair
helix$ cd /data/$USER/dir
helix$ run_gatk_pileup_for_sample.py -h
Usage: run_gatk_pileup_for_sample.py [options]

Program to run GATK Pileup on a single sample

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -B BAM, --bam=BAM     BAMFILE [mandatory field]
  -O OUTFILE, --outfile=OUTFILE
                        OUTPUT FILE (PILEUP) [mandatory field]
  -D CONPAIR_DIR, --conpair_dir=CONPAIR_DIR
                        CONPAIR DIR [$CONPAIR_DIR by default]
                        REFERENCE GENOME [GRCh37 by default]
  -M MARKERS, --markers=MARKERS
                        MARKER FILE [GRCh37-default]
  -G GATK, --gatk=GATK  GATK JAR [$GATK by default]
  -J JAVA, --java=JAVA  PATH to JAVA [java by default]
  -t TEMP_DIR_JAVA, --temp_dir_java=TEMP_DIR_JAVA
                        temporary directory to set -Djava.io.tmpdir
  -m XMX_JAVA, --xmx_java=XMX_JAVA
                        Xmx java memory setting [default: 12g]
                        OUTPUT FILE [false by default]

Running a single batch job on Biowulf

Set up a batch script along the following lines.


cd /data/$USER/mydir
module load conpair

run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup

Submit to the batch system with:

sbatch --mem=12g  myscript

Default memory is 12gb. So --mem is set to 12g

Running a swarm of batch jobs on Biowulf

Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:

cd /data/$USER/mydir1; run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup
cd /data/$USER/mydir2; run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup
cd /data/$USER/mydir3; run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup [...]

Submit this job with

swarm -g 12 -f cmdfile --module conpair

-g: memory required. Default memory is 12gb.
--module: setup required environmental variables for the job

Running an interactive job on Biowulf

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf]$ sinteractive --mem=12g 
      salloc.exe: Granted job allocation 1528

[user@pxxx]$ module load conpair

[user@pxxx]$ cd /data/$USER/run1

[user@pxxx]$ run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup
[user@p4] exit [user@biowulf]$