hap.py


a set of programs based on htslib to benchmark variant calls against gold standard truth datasets

Hap.py is a tool to compare diploid genotypes at haplotype level. Rather than comparing VCF records row by row, hap.py will generate and match alternate sequences in a superlocus. A superlocus is a small region of the genome (sized between 1 and around 1000 bp) that contains one or more variants.

Web sites

hap.py Interactive job
back to top

hap.py is installed in a Singularity container and is not suitable for use on Helix. Please run hap.py on a Biowulf compute node instead.

In this example, the user allocates an interactive session on a compute node and then runs a test using example data (user input in bold)

[user@biowulf ~]$ sinteractive -c 16 --mem=200g
salloc.exe: Pending job allocation 38693900
salloc.exe: job 38693900 queued and waiting for resources
salloc.exe: job 38693900 has been allocated resources
salloc.exe: Granted job allocation 38693900
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1234 are ready for job
srun: error: x11: no local DISPLAY defined, skipping

[user@cn1234 ~]$ mkdir -p /data/$USER/hap.py-test

[user@cn1234 ~]$ cd /data/$USER/hap.py-test

[user@cn1234 ~]$ module load hap.py
[+] Loading hap.py 0.3.7 on cn1311
[+] Loading singularity 2.2.1 on cn1311

[user@cn1234 ~]$ cp -r $HAPPY_HOME/example .

[user@cn1234 ~]$ hap.py \
    example/happy/PG_NA12878_chr21.vcf.gz \
    example/happy/NA12878_chr21.vcf.gz \
    -f example/happy/PG_Conf_chr21.bed.gz \
    -o test
WARNING: Bind file source does not exist on host: /etc/resolv.conf
[W] overlapping records at chr21:10993857 for sample 0
[W] Symbolic / SV ALT alleles at chr21:15847469
[W] Variants that overlap on the reference allele: 144
[W] Variants that have symbolic ALT alleles: 14
[I] Total VCF records:         65402
[I] Non-reference VCF records: 65402
[W] overlapping records at chr21:24024261 for sample 0
[W] Variants that overlap on the reference allele: 5
[I] Total VCF records:         101524
[I] Non-reference VCF records: 101524
Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         8937      7839      1098        11812       343       3520     45       0.877140          0.958635        0.298002         0.916079                     NaN                     NaN                   1.357991                   1.457627
INDEL   PASS         8937      7550      1387         9971       283       1964     30       0.844803          0.964656        0.196971         0.900760                     NaN                     NaN                   1.357991                   1.239305
  SNP    ALL        52494     52125       369        90092       582      37348    107       0.992971          0.988966        0.414554         0.990964                2.082614                1.745874                   1.594335                   3.132586
    SNP   PASS        52494     46920      5574        48078       143        992      8       0.893816          0.996963        0.020633         0.942576                2.082614                2.089282                   1.594335                   1.487599

Running a single hap.py job on Biowulf
back to top

Set up a batch script along the following lines:

#!/bin/bash
# file called myjob.bat
cd /data/$USER/hap.py-test
module load hap.py 
hap.py \
    example/happy/PG_NA12878_chr21.vcf.gz \
    example/happy/NA12878_chr21.vcf.gz \
    -f example/happy/PG_Conf_chr21.bed.gz \
    -o test
    --threads $SLURM_CPUS_PER_TASK

Submit this job with:

[user@biowulf ~]$ sbatch --cpus-per-task=16 --mem=120g myjob.bat

For more information on submitting jobs to slurm, see Job Submission in the Biowulf User Guide.

Running a swarm of hap.py jobs on Biowulf
back to top

Sample swarm command file

# --------file myjobs.swarm----------
hap.py example/happy/PG_chr1.vcf.gz chr1.vcf.gz -f PG_Conf_chr1.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK 
hap.py example/happy/PG_chr2.vcf.gz chr2.vcf.gz -f PG_Conf_chr2.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK
hap.py example/happy/PG_chr3.vcf.gz chr3.vcf.gz -f PG_Conf_chr3.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK
....
hap.py example/happy/PG_chrN.vcf.gz chrN.vcf.gz -f PG_Conf_chrN.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK
# -----------------------------------

Submit this set of runs to the batch system by typing

[user@biowulf ~]$ swarm --module hap.py --threads-per-process 16 --gb-per-process 120 -f myjobs.swarm

For details on using swarm see Swarm on Biowulf.

Documentation
back to top