High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
HLA-PRG-LA on Helix/Biowulf

HLA*PRG:LA stands for "HLA*PRG, linear approximation". HLA*PRG:LA approximates the graph alignment process by starting with linear sequence alignments. It brings down the resource requirements per sample for the HLA typing process to 30GB RAM/30 CPU hours, and produces highly accurate calls. HLA*PRG:LA was developed by Alexander Dilthey at NHGRI. [Description of the algorithm] [Git source]

On Helix

HLA-PRG-LA is designed to run on multiple CPUs and uses a significant amount of memory, so it is not suitable for Helix.

Batch job on Biowulf

In this batch script, the input data is copied to local scratch on the allocated node. All the I/O and temp files are created on local disk, and at the end of the job, the final output directory ('myfile/hla' in this example) is copied back to the user's /data area.

Create a batch input file along the following lines:

# this file is called HLA.bat

cd /lscratch/$SLURM_JOBID
module load HLA-PRG-LA

cp /data/$USER/myfile.cram .
samtools index myfile.cram

cpus=$(( SLURM_CPUS_PER_TASK - 1 ))
echo "Running on $cpus CPUs"
HLA-PRG-LA.pl --BAM myfile.cram --graph PRG_MHC_GRCh38_withIMGT --sampleID myfile --maxThreads $cpus --workingDir .

# copy output from /lscratch back to /data area
cp -r myfile/hla  /data/$USER/

Submit this job to the batch system:

sbatch --cpus-per-task=32 --mem=100g --gres=lscratch:100 --time=1-00:00:00 HLA.bat
--cpus-per-task=32 allocate 32 CPUs. BWA and other programs used by HLA-PRG-LA run best if one CPU is reserved for overhead, so within the job, '--maxThreads' is set to one CPU less than allocated.
--mem=100g total memory allocated. You may need to modify this up or down, depending on your job.
--gres=lscratch:100 Allocate 100 GB of local disk. All temporary files are written to this local disk.
--time=1-00:00:00 set walltime to 1 day
Interactive job on Biowulf

Allocate an interactive session and run HLA-PRG-LA on there. 100 GB of local disk is requested on the allocated node. HLA-PRG-LA is run on the local disk, and at the end of the job, the final output directory is copied back to the user's /data area.

Note that the max walltime for an interactive session is 36 hrs.

biowulf% sinteractive --cpus-per-task=16 --mem=100g --gres=lscratch:100 --time=24:00:00
salloc.exe: Pending job allocation 43094756
salloc.exe: job 43094756 queued and waiting for resources
salloc.exe: job 43094756 has been allocated resources
salloc.exe: Granted job allocation 43094756
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3678 are ready for job

cn3678% cd /lscratch/$SLURM_JOBID

cn3678% module load HLA-PRG-LA

cn3678% cp /data/$USER/myfile.cram .

cn3678% cpus=$(( SLURM_CPUS_PER_TASK - 1 ))

cn3678% HLA-PRG-LA.pl --BAM myfile.cram --graph PRG_MHC_GRCh38_withIMGT --sampleID myfile --maxThreads $cpus --workingDir .
[+] Loading HLA-PRG-LA f0833ed on cn3406
[+] Loading gcc 4.7.4 ...
[+] Loading boost libraries v1.59 ...
[+] Loading bamtools 2d7685d on cn3406
[+] Loading BWA 0.7.12 ...
[+] Loading samtools 1.3 ...
[+] Loading Zlib 1.2.8 ...
Using 31 CPUS

Identified paths:
    samtools_bin: /usr/local/apps/samtools/1.3/bin/samtools
    bwa_bin: /usr/local/apps/bwa/0.7.12/bwa
    java_bin: /usr/bin/java
    picard_sam2fastq_bin: /usr/local/apps/picard/1.119/SamToFastq.jar
    General working directory: /lscratch/43090316
    Sample-specific working directory: /lscratch/43090316/NA12878

Extract reads from 534 regions...
Extract unmapped reads...

cn3678% cp -r myfile /data/$USER/

cn3678% exit
salloc.exe: Relinquishing job allocation 43094756