High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
THetA on Biowulf

Tumor Heterogeneity Analysis (THetA) is an algorithm used to estimate tumor purity and clonal/subclonal copy number aberrations simultaneously from high-throughput DNA sequencing data.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=14g --cpus-per-target=6
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load theta
[user@cn3144 ~]$ cp -r /usr/local/apps/theta/TEST_DATA/example/ .
[user@cn3144 ~]$ RunTHetA example/Example.intervals \
                        --NUM_PROCESSES=$((SLURM_CPUS_PER_TASK - 1)) \
                        --TUMOR_FILE example/TUMOR_SNP.formatted.txt \
                        --NORMAL_FILE example/NORMAL_SNP.formatted.txt
Arguments are:
        Query File: example/Example.intervals
        k: 3
        tau: 2
        Output Directory: ./
        Output Prefix: Example
        Num Processes: 4
        Graph extension: .pdf

Valid sample for THetA analysis:
        Ratio Deviation: 0.1
        Min Fraction of Genome Aberrated: 0.05
        Program WILL cluster intervals.
Reading in query file...

[user@cn3144 ~]$ ls -lh
drwxrwx---  2 user group 4.0K May  3  2016 example
drwxrwx---  9 user group 4.0K Aug  2  2017 Example_2_cluster_data
drwxrwx--- 10 user group 4.0K Aug  2  2017 Example_3_cluster_data
-rw-rw----  1 user group  20K Aug  2  2017 Example_assignment.png
-rw-rw----  1 user group 1.7K Aug  2  2017 Example.BEST.results
-rw-rw----  1 user group 155K Aug  2  2017 Example_by_chromosome.png
-rw-rw----  1 user group  24K Aug  2  2017 Example_classifications.png
-rw-rw----  1 user group  18K Aug  2  2017 Example.n2.graph.pdf
-rw-rw----  1 user group 1.7K Aug  2  2017 Example.n2.results
-rw-rw----  1 user group 3.6K Aug  2  2017 Example.n2.withBounds
-rw-rw----  1 user group  19K Aug  2  2017 Example.n3.graph.pdf
-rw-rw----  1 user group 1.8K Aug  2  2017 Example.n3.results
-rw-rw----  1 user group 3.6K Aug  2  2017 Example.n3.withBounds
-rw-rw----  1 user group  251 Aug  2  2017 Example.RunN3.bash

The analysis will create a number of files including some graphs. For example, the following shows one of the models (2 components):

THetA model n=2

In addition to RunTHetA there are several other tools included in this package

helix$ ls /usr/local/apps/theta/0.7/bin
|-- [  274]  CreateExomeInput
|-- [ 294K]  getAlleleCounts.jar
|-- [  14K]  runBICSeqToTHetA.jar
`-- [  260]  RunTHetA
helix$ java -jar $THETA_JARPATH/runBICSeqToTHetA.jar
Error! Incorrect number of arguments.

Program: BICSeqToTHetA
USAGE (src): java BICSeqToTHetA <INPUT_FILE> [Options]
USAGE (jar): java -jar BICSeqToTHetA <INPUT_FILE> [Options]
<INPUT_FILE> [String]
         A file output by BIC-Seq.
         Prefix for all output files.
-MIN_LENGTH [Integer]
         The minimum length of intervals to keep.

For a more detailed manual see

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. THetA.sh), which uses the input file 'THetA.in'. For example:

#! /bin/bash
module load theta || exit 1

nproc=$((alloc_cpus - 1))
RunTHetA example/Example.intervals \
  --NUM_PROCESSES=$nproc \
  --TUMOR_FILE example/TUMOR_SNP.formatted.txt \
  --NORMAL_FILE example/NORMAL_SNP.formatted.txt

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=6 --mem=14 theta.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. THetA.swarm). For example:

RunTHetA sample1/Example.intervals --NUM_PROCESSES=$((SLURM_CPUS_PER_TASK-1)) \
  --TUMOR_FILE sample1/TUMOR_SNP.formatted.txt --NORMAL_FILE sample2/NORMAL_SNP.formatted.txt
RunTHetA sample2/Example.intervals --NUM_PROCESSES=$((SLURM_CPUS_PER_TASK-1)) \
  --TUMOR_FILE sample2/TUMOR_SNP.formatted.txt --NORMAL_FILE sample2/NORMAL_SNP.formatted.txt

Submit this job using the swarm command.

swarm -f THetA.swarm -g 14 -t 6 --module theta
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module THetA Loads the THetA module for each subjob in the swarm