High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
PyLOH on Biowulf & Helix

Description

PyLOH integrates somatic copy number alterations and loss of heterozygosity in tumor/somatic sample pairs to deconvolve the contributions of tumor sample purity and tumor ploidity. It optionally takes a genome segmentation created by one of the automated segmentation packages in bed format as an optional input in addition to the tumor and somatic bam files.

PyLOH contains three modules:

  1. preprocess: Preprocess the tumor-normal bam pair to create paired count files and processed segmentation files. This uses either whole chromosomes as segments or takes a segmentation of the genome produced by an external tool.
  2. run_model: Using output from the first step, estimates tumor purity and the copy number of each segment
  3. BAF_heatmap: Produce heatmap visualization of the B allele frequencies (BAF) in tumor and normal samples for each segment.

There may be multiple versions of PyLOH available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail pyloh 

To select a module use

module load PyLOH/[version]

where [version] is the version of choice.

PyLOH preprocessing can be run using multiple processes. Make sure to match the number of cpus requested with the number of processes.

Environment variables set

References

Documentation

On Helix

PyLOH preprocessing requires a too much compute resources to be run on helix in a reasonable amount of time. Please interactive sessions or submit batch jobs.

Batch job on Biowulf

Preprocessing

Create a batch script similar to the following example to run the preprocessing step. Note that this batch script uses the example data from /usr/local/apps/pyloh/TEST_DATA

#! /bin/bash
#SBATCH --mail-type BEGIN,END,FAIL,TIME_LIMIT_90

module load pyloh/1.4.2 || exit 1

REFERENCE_GENOME=/fdb/GATK_resource_bundle/hg18/Homo_sapiens_assembly18
if [[ ! -d aln ]]; then
    cp -r /usr/local/apps/pyloh/TEST_DATA/aln . || exit 1
fi

PyLOH.py preprocess \
    $REFERENCE_GENOME.fasta \
    aln/normal.bam \
    aln/tumor.bam \
    TvN \
    --segments_bed aln/segments.bed \
    --min_depth 20 \
    --min_base_qual 10 \
    --min_map_qual 10 \
    --process_num $SLURM_CPUS_PER_TASK

Submit to the queue with sbatch:

b2$ sbatch --mem=4G --cpus-per-task=10 --time=06:00:00 preprocess.sh

Run model

Create a batch script similar to the following example to run the model building step. Note that this script uses the output from the previous script

#! /bin/bash
#SBATCH --mail-type BEGIN,END,FAIL,TIME_LIMIT_90

module load pyloh/1.4.2 || exit 1

PyLOH.py run_model \
    TvN \
    --allelenumber_max 2 \
    --max_iters 100 \
    --stop_value 1e-7

Submit to the queue with sbatch:

b2$ sbatch --mem=4G run_model.sh

Create the BAF heatmap graph

This step makes use of the python package matplotlib to draw heat maps of BAF in tumor vs normal samples for each segment. Since matplotlib defaults to an interactive backend, we create a matplotlibrc file that switches to a non-interactive backend (Agg) for this script to work in batch mode.

Create a batch script similar to the following example to run the BAF heatmap plotting step.

#! /bin/bash
#SBATCH --mail-type BEGIN,END,FAIL,TIME_LIMIT_90

module load pyloh/1.4.2 || exit 1
if [[ ! -e matplotlibrc ]]; then
    echo "backend: Agg" > matplotlibrc
fi

PyLOH.py BAF_heatmap TvN
rm matplotlibrc

Submit to the queue with sbatch:

b2$ sbatch --mem=4G baf_heatmap.sh

The test data used here is simulated based on 90% tumor purity. And the estimated tumor purity is

b2$ cat TvN.PyLOH.purity
Optimum baseline copy number : 2
Maximum copy number of each allele : 2
Tumor purity : 0.901

Below are two examples of BAF heatmaps for segments with a copy number of 1 (left) and 2 (right), respectively.

Swarm of jobs on Biowulf

Create a swarm command file similar to the following example to preprocess a number of samples in parallel:

PyLOH.py preprocess \
    /path/to/genome/reference.fasta \
    aln/normal1.bam \
    aln/tumor1.bam \
    TvN1 \
    --segments_bed aln/segments1.bed \
    --min_depth 20 \
    --min_base_qual 10 \
    --min_map_qual 10 \
    --process_num $SLURM_CPUS_PER_TASK
PyLOH.py preprocess \
    /path/to/genome/reference.fasta \
    aln/normal2.bam \
    aln/tumor2.bam \
    TvN2 \
    --segments_bed aln/segments2.bed \
    --min_depth 20 \
    --min_base_qual 10 \
    --min_map_qual 10 \
    --process_num $SLURM_CPUS_PER_TASK

And submit to the queue with swarm

b2$ swarm -f pyloy_preprocess.swarm -g 4 -t 10
Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

b2$ sinteractive --mem=4G --cpus-per-task=10
node$ cp -r /usr/local/apps/pyloh/TEST_DATA/aln .
node$ PyLOH.py preprocess \
    /path/to/genome/reference.fasta \
    aln/normal1.bam \
    aln/tumor1.bam \
    TvN1 \
    --segments_bed aln/segments1.bed \
    --min_depth 20 \
    --min_base_qual 10 \
    --min_map_qual 10 \
    --process_num $SLURM_CPUS_PER_TASK
[...snip...]
node$ exit
b2$