Biowulf High Performance Computing at the NIH
PyLOH on Biowulf

PyLOH integrates somatic copy number alterations and loss of heterozygosity in tumor/somatic sample pairs to deconvolve the contributions of tumor sample purity and tumor ploidity. It optionally takes a genome segmentation created by one of the automated segmentation packages in bed format as an optional input in addition to the tumor and somatic bam files.

PyLOH contains three modules:

  1. preprocess: Preprocess the tumor-normal bam pair to create paired count files and processed segmentation files. This uses either whole chromosomes as segments or takes a segmentation of the genome produced by an external tool.
  2. run_model: Using output from the first step, estimates tumor purity and the copy number of each segment
  3. BAF_heatmap: Produce heatmap visualization of the B allele frequencies (BAF) in tumor and normal samples for each segment.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=6g --cpus-per-task=6 --gres=lscratch:20
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load pyloh
[user@cn3144]$ cp -a ${PYLOH_TEST_DATA:-none}/aln .
[user@cn3144]$ REFERENCE_GENOME=/fdb/GATK_resource_bundle/hg18/Homo_sapiens_assembly18.fasta
[user@cn3144]$ PyLOH.py preprocess \
                    --segments_bed aln/segments.bed \
                    --min_depth 20 \
                    --min_base_qual 10 \
                    --min_map_qual 10 \
                    --process_num $SLURM_CPUS_PER_TASK \
                    $REFERENCE_GENOME aln/normal.bam aln/tumor.bam TvN
[...snip...]
[user@cn3144]$ PyLOH.py run_model --allelenumber_max 2 --max_iters 100 --stop_value 1e-7 TvN
[...snip...]

[user@cn3144]$ #set the matplotlib backend to a non-interactive backend
[user@cn3144]$ MPLBACKEND=agg PyLOH.py BAF_heatmap TvN

Below are two examples of BAF heatmaps for segments with a copy number of 1 (left) and 2 (right), respectively.

The test data used here is simulated based on 90% tumor purity. The tumor purity estimated by PyLOH is

[user@cn3144]$ cat TvN.PyLOH.purity
Optimum baseline copy number : 2
Maximum copy number of each allele : 2
Tumor purity : 0.901

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Preprocessing

Create a batch script similar to the following example to run the preprocessing step. Note that this batch script uses the example data from $PYLOH_TEST_DATA

#! /bin/bash
#SBATCH --mail-type BEGIN,END,FAIL,TIME_LIMIT_90

module load pyloh/1.4.3 || exit 1

REFERENCE_GENOME=/fdb/GATK_resource_bundle/hg18/Homo_sapiens_assembly18.fasta
if [[ ! -d aln ]]; then
    cp -a $PYLOH_TEST_DATA/aln . || exit 1
fi

PyLOH.py preprocess \
    --segments_bed aln/segments.bed \
    --min_depth 20 \
    --min_base_qual 10 \
    --min_map_qual 10 \
    --process_num $SLURM_CPUS_PER_TASK \
    $REFERENCE_GENOME aln/normal.bam aln/tumor.bam TvN

Submit to the queue with sbatch:

sbatch --mem=4g --cpus-per-task=10 --time=06:00:00 preprocess.sh

Run model

Create a batch script similar to the following example to run the model building step. Note that this script uses the output from the previous script

#! /bin/bash
#SBATCH --mail-type BEGIN,END,FAIL,TIME_LIMIT_90

module load pyloh/1.4.3 || exit 1

PyLOH.py run_model \
    --allelenumber_max 2 \
    --max_iters 100 \
    --stop_value 1e-7 \
    TvN

Submit to the queue with sbatch:

sbatch --mem=4G run_model.sh

Create the BAF heatmap graph

This step makes use of the python package matplotlib to draw heat maps of BAF in tumor vs normal samples for each segment. Since matplotlib defaults to an interactive backend, we create a matplotlibrc file that switches to a non-interactive backend (Agg) for this script to work in batch mode.

Create a batch script similar to the following example to run the BAF heatmap plotting step.

#! /bin/bash
#SBATCH --mail-type BEGIN,END,FAIL,TIME_LIMIT_90

module load pyloh/1.4.3 || exit 1
if [[ ! -e matplotlibrc ]]; then
    echo "backend: Agg" > matplotlibrc
fi

PyLOH.py BAF_heatmap TvN
rm matplotlibrc

Submit to the queue with sbatch:

sbatch --mem=4G baf_heatmap.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. PyLOH.swarm). For example:

PyLOH.py preprocess \
    /path/to/genome/reference.fasta \
    aln/normal1.bam \
    aln/tumor1.bam \
    TvN1 \
    --segments_bed aln/segments1.bed \
    --min_depth 20 \
    --min_base_qual 10 \
    --min_map_qual 10 \
    --process_num $SLURM_CPUS_PER_TASK
PyLOH.py preprocess \
    /path/to/genome/reference.fasta \
    aln/normal2.bam \
    aln/tumor2.bam \
    TvN2 \
    --segments_bed aln/segments2.bed \
    --min_depth 20 \
    --min_base_qual 10 \
    --min_map_qual 10 \
    --process_num $SLURM_CPUS_PER_TASK

Submit this job using the swarm command.

swarm -f PyLOH.swarm -g 4 -t 10 --module PyLOH
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module PyLOH Loads the PyLOH module for each subjob in the swarm