|
Hap.py is a tool to compare diploid genotypes at haplotype level. Rather than comparing VCF records row by row, hap.py will generate and match alternate sequences in a superlocus. A superlocus is a small region of the genome (sized between 1 and around 1000 bp) that contains one or more variants.
hap.py is installed in a Singularity container and is not suitable for use on Helix. Please run hap.py on a Biowulf compute node instead.
In this example, the user allocates an interactive session on a compute node and then runs a test using example data (user input in bold)
[user@biowulf ~]$ sinteractive -c 16 --mem=200g salloc.exe: Pending job allocation 38693900 salloc.exe: job 38693900 queued and waiting for resources salloc.exe: job 38693900 has been allocated resources salloc.exe: Granted job allocation 38693900 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn1234 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn1234 ~]$ mkdir -p /data/$USER/hap.py-test [user@cn1234 ~]$ cd /data/$USER/hap.py-test [user@cn1234 ~]$ module load hap.py [+] Loading hap.py 0.3.7 on cn1311 [+] Loading singularity 2.2.1 on cn1311 [user@cn1234 ~]$ cp -r $HAPPY_HOME/example . [user@cn1234 ~]$ hap.py \ example/happy/PG_NA12878_chr21.vcf.gz \ example/happy/NA12878_chr21.vcf.gz \ -f example/happy/PG_Conf_chr21.bed.gz \ -o test WARNING: Bind file source does not exist on host: /etc/resolv.conf [W] overlapping records at chr21:10993857 for sample 0 [W] Symbolic / SV ALT alleles at chr21:15847469 [W] Variants that overlap on the reference allele: 144 [W] Variants that have symbolic ALT alleles: 14 [I] Total VCF records: 65402 [I] Non-reference VCF records: 65402 [W] overlapping records at chr21:24024261 for sample 0 [W] Variants that overlap on the reference allele: 5 [I] Total VCF records: 101524 [I] Non-reference VCF records: 101524 Benchmarking Summary: Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio INDEL ALL 8937 7839 1098 11812 343 3520 45 0.877140 0.958635 0.298002 0.916079 NaN NaN 1.357991 1.457627 INDEL PASS 8937 7550 1387 9971 283 1964 30 0.844803 0.964656 0.196971 0.900760 NaN NaN 1.357991 1.239305 SNP ALL 52494 52125 369 90092 582 37348 107 0.992971 0.988966 0.414554 0.990964 2.082614 1.745874 1.594335 3.132586 SNP PASS 52494 46920 5574 48078 143 992 8 0.893816 0.996963 0.020633 0.942576 2.082614 2.089282 1.594335 1.487599
Set up a batch script along the following lines:
#!/bin/bash # file called myjob.bat cd /data/$USER/hap.py-test module load hap.py hap.py \ example/happy/PG_NA12878_chr21.vcf.gz \ example/happy/NA12878_chr21.vcf.gz \ -f example/happy/PG_Conf_chr21.bed.gz \ -o test --threads $SLURM_CPUS_PER_TASK
Submit this job with:
[user@biowulf ~]$ sbatch --cpus-per-task=16 --mem=120g myjob.bat
For more information on submitting jobs to slurm, see Job Submission in the Biowulf User Guide.
Sample swarm command file
# --------file myjobs.swarm---------- hap.py example/happy/PG_chr1.vcf.gz chr1.vcf.gz -f PG_Conf_chr1.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK hap.py example/happy/PG_chr2.vcf.gz chr2.vcf.gz -f PG_Conf_chr2.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK hap.py example/happy/PG_chr3.vcf.gz chr3.vcf.gz -f PG_Conf_chr3.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK .... hap.py example/happy/PG_chrN.vcf.gz chrN.vcf.gz -f PG_Conf_chrN.bed.gz -o outdir --threads $SLURM_CPUS_PER_TASK # -----------------------------------
Submit this set of runs to the batch system by typing
[user@biowulf ~]$ swarm --module hap.py --threads-per-process 16 --gb-per-process 120 -f myjobs.swarm
For details on using swarm see Swarm on Biowulf.