Biowulf High Performance Computing at the NIH
purge_haplotigs on Biowulf

purge_haplotigs is a pipeline to help with curating heterozygous diploid genome assemblies.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session based on the tutorial at https://bitbucket.org/mroachawri/purge_haplotigs/wiki/Tutorial (user input in bold):

[user@biowulf]$ sinteractive --mem 8g --gres lscratch:10
salloc.exe: Pending job allocation 3106949
salloc.exe: job 3106949 queued and waiting for resources
salloc.exe: job 3106949 has been allocated resources
salloc.exe: Granted job allocation 3106949
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3101 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
[user@cn3101 ~]$ module load purge_haplotigs
[+] Loading bedtools  2.27.1 
[+] Loading samtools 1.8  ... 
[+] Loading blast 2.7.1+  ... 
[+] Loading LASTZ  1.04.00  on cn3101 
[+] Loading gcc  7.2.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[+] Loading openmpi 3.0.0  for GCC 7.2.0 
[+] Loading R 3.5.0_build2 
[+] Loading purge_haplotigs, version 0~20180529.e6fffc9... 
[user@cn3101 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3101 3106949]$ wget \
 https://zenodo.org/record/841398/files/cns_h_ctg.fasta \
 https://zenodo.org/record/841398/files/cns_p_ctg.aligned.sd.bam{,.bai} \
 https://zenodo.org/record/841398/files/cns_p_ctg.fasta{,.fai} \
 https://zenodo.org/record/841398/files/mapping.sh \
 https://zenodo.org/record/841398/files/README \
 https://zenodo.org/record/841398/files/tutorial_output.tar.gz
[user@cn3101 3106949]$ ln -s /fdb/purge_haplotigs/* .
[user@cn3101 3106949]$ purge_haplotigs readhist cns_p_ctg.aligned.sd.bam
[12-06-2018 16:45:33] CHECKING bedtools... OK
[12-06-2018 16:45:33] CHECKING Rscript... OK
[12-06-2018 16:45:33] CHECKING samtools... OK
[12-06-2018 16:45:33] INFO: ALL DEPENDENCIES OK
[12-06-2018 16:45:33] RUNNING: bedtools genomecov -ibam cns_p_ctg.aligned.sd.bam -max 200 > tmp_purge_haplotigs/cns_p_ctg.aligned.sd.bam.genecov.temp 2> tmp_purge_haplotigs/STDERR/bedtools.genomecov.stderr  &&  mv tmp_purge_haplotigs/cns_p_ctg.aligned.sd.bam.genecov.temp cns_p_ctg.aligned.sd.bam.genecov
[12-06-2018 16:47:27] FINISHED: bedtools genomecov -ibam cns_p_ctg.aligned.sd.bam -max 200 > tmp_purge_haplotigs/cns_p_ctg.aligned.sd.bam.genecov.temp 2> tmp_purge_haplotigs/STDERR/bedtools.genomecov.stderr  &&  mv tmp_purge_haplotigs/cns_p_ctg.aligned.sd.bam.genecov.temp cns_p_ctg.aligned.sd.bam.genecov
[12-06-2018 16:47:27] RUNNING: grep genome cns_p_ctg.aligned.sd.bam.genecov | awk '{ print $2 "," $3 }' > tmp_purge_haplotigs/MISC/cns_p_ctg.aligned.sd.bam.histogram.csv
[12-06-2018 16:47:27] FINISHED: grep genome cns_p_ctg.aligned.sd.bam.genecov | awk '{ print $2 "," $3 }' > tmp_purge_haplotigs/MISC/cns_p_ctg.aligned.sd.bam.histogram.csv
[12-06-2018 16:47:27] RUNNING: /usr/local/apps/purge_haplotigs/0~20180529.e6fffc9/scripts/../scripts//gen_histogram.Rscript tmp_purge_haplotigs/MISC/cns_p_ctg.aligned.sd.bam.histogram.csv cns_p_ctg.aligned.sd.bam.histogram.png 2> tmp_purge_haplotigs/STDERR/gen_histogram.stderr
[12-06-2018 16:47:29] FINISHED: /usr/local/apps/purge_haplotigs/0~20180529.e6fffc9/scripts/../scripts//gen_histogram.Rscript tmp_purge_haplotigs/MISC/cns_p_ctg.aligned.sd.bam.histogram.csv cns_p_ctg.aligned.sd.bam.histogram.png 2> tmp_purge_haplotigs/STDERR/gen_histogram.stderr
[12-06-2018 16:47:29] INFO: 
purge_haplotigs readhist has finished! 

Check 'cns_p_ctg.aligned.sd.bam.histogram.png' to observe where your haploid and diploid peaks are
and choose your low, midpoint, and high cutoffs (check the example histogram png 
files in this git to give you an idea). You will need 'cns_p_ctg.aligned.sd.bam.gencov' and the 
cutoffs for the next step 'purge_haplotigs contigcov'

[user@cn3101 3106949]$ purge_haplotigs  contigcov  -i cns_p_ctg.aligned.sd.bam.genecov  -o coverage_stats.csv  -l 20  -m 75  -h 190

Analysis finished successfully! Contig coverage stats saved to 'coverage_stats.csv'. 
[user@cn3101 3106949]$ purge_haplotigs purge  -g cns_p_ctg.fasta  -c coverage_stats.csv  -b cns_p_ctg.aligned.sd.bam  -t 4  -a 60
[12-06-2018 16:51:14] CHECKING bash... OK
[12-06-2018 16:51:14] CHECKING bedtools... OK
[12-06-2018 16:51:14] CHECKING blastn... OK
[12-06-2018 16:51:14] CHECKING lastz... OK
[12-06-2018 16:51:14] CHECKING makeblastdb... OK
[12-06-2018 16:51:14] CHECKING samtools... OK
[12-06-2018 16:51:14] INFO: All required dependencies OK
[12-06-2018 16:51:14] CHECKING windowmasker... OK
[12-06-2018 16:51:14] INFO: windowmasker available for blastn hit search
[12-06-2018 16:51:15] INFO: 
Beginning Pipeline

PARAMETERS:
Genome fasta:           cns_p_ctg.fasta
Coverage csv:           coverage_stats.csv
Bam file:               cns_p_ctg.aligned.sd.bam
Threads:                4
Cutoff, alignment:      60 %
Cutoff, repeat:         250 %
Cutoff, suspect:        40 %
Out prefix:             curated
Coverage window len:    9000 bp
Window step dist:       3000 bp
Blastn parameters:      -evalue 0.00001 -max_target_seqs 50 -max_hsps 1000 -word_size 28 -culling_limit 10
Lastz paramters:        --gfextend --chain --gapped --seed=match14 --allocate:traceback=200.0M

Running using command:
purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -b cns_p_ctg.aligned.sd.bam -t 4 -a 60


[12-06-2018 16:51:15] INFO: reading in genome index file: cns_p_ctg.fasta.fai
[12-06-2018 16:51:15] INFO: reading in contig coverage stats file: coverage_stats.csv
[12-06-2018 16:51:15] INFO: "mincing" genome
[12-06-2018 16:51:15] INFO: Getting windowed read depth coverage for each contig
[12-06-2018 16:51:15] RUNNING: bedtools makewindows -g cns_p_ctg.fasta.fai -w 9000 -s 3000 > tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed.tmp 2> tmp_purge_haplotigs/STDERR/bedtools.mkwind.stderr  &&  mv tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed.tmp tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed
[12-06-2018 16:51:15] FINISHED: bedtools makewindows -g cns_p_ctg.fasta.fai -w 9000 -s 3000 > tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed.tmp 2> tmp_purge_haplotigs/STDERR/bedtools.mkwind.stderr  &&  mv tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed.tmp tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed
[12-06-2018 16:51:15] RUNNING: bedtools multicov -bams cns_p_ctg.aligned.sd.bam -bed tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed > tmp_purge_haplotigs/COV/cns_p_ctg.fasta.mcov.tmp 2> tmp_purge_haplotigs/STDERR/bedtools.mcov.stderr  &&  mv tmp_purge_haplotigs/COV/cns_p_ctg.fasta.mcov.tmp tmp_purge_haplotigs/COV/cns_p_ctg.fasta.mcov
[12-06-2018 16:56:38] FINISHED: bedtools multicov -bams cns_p_ctg.aligned.sd.bam -bed tmp_purge_haplotigs/COV/cns_p_ctg.fasta.windows.bed > tmp_purge_haplotigs/COV/cns_p_ctg.fasta.mcov.tmp 2> tmp_purge_haplotigs/STDERR/bedtools.mcov.stderr  &&  mv tmp_purge_haplotigs/COV/cns_p_ctg.fasta.mcov.tmp tmp_purge_haplotigs/COV/cns_p_ctg.fasta.mcov
[12-06-2018 16:56:39] INFO: preparing blastdb
[12-06-2018 16:56:39] RUNNING: makeblastdb -in tmp_purge_haplotigs/assembly.fasta -dbtype nucl -out tmp_purge_haplotigs/BLASTDB/assembly 2>&1 1> tmp_purge_haplotigs/STDERR/makeblastdb.stderr
[12-06-2018 16:56:40] FINISHED: makeblastdb -in tmp_purge_haplotigs/assembly.fasta -dbtype nucl -out tmp_purge_haplotigs/BLASTDB/assembly 2>&1 1> tmp_purge_haplotigs/STDERR/makeblastdb.stderr
[12-06-2018 16:56:40] INFO: Performing blastn hit search
[12-06-2018 16:57:00] INFO: Finished blastn hit search
[12-06-2018 16:57:00] INFO: preparing blastn hit summary
[12-06-2018 16:57:00] INFO: 

###

RUNNING PURGING PASS 1

###

[12-06-2018 16:57:00] INFO: getting contig hits from blastn output
[12-06-2018 16:57:00] INFO: Running lastz analysis on blastn hits
[12-06-2018 16:58:31] INFO: Checking contig assignments for conflicts
[12-06-2018 16:58:31] INFO: conflict: 000010F and it's match 000070F both flagged for reassignment
[12-06-2018 16:58:31] INFO: keeping longer contig 000010F
[12-06-2018 16:58:31] INFO: conflict: 000024F and it's match 000032F both flagged for reassignment
[12-06-2018 16:58:31] INFO: keeping longer contig 000024F
[12-06-2018 16:58:31] INFO: Logging reassignments and checking for convergence
[12-06-2018 16:58:31] INFO: convergence not reached, more passes needed
[12-06-2018 16:58:31] INFO: 

###

RUNNING PURGING PASS 2

###

[12-06-2018 16:58:31] INFO: getting contig hits from blastn output
[12-06-2018 16:58:31] INFO: Running lastz analysis on blastn hits
[12-06-2018 16:59:30] INFO: Checking contig assignments for conflicts
[12-06-2018 16:59:30] INFO: Logging reassignments and checking for convergence
[12-06-2018 16:59:30] INFO: convergence reached!
[12-06-2018 16:59:30] INFO: 

###

GENERATING OUTPUT

###

[12-06-2018 16:59:30] INFO: writing contig association paths
[12-06-2018 16:59:30] INFO: writing the reassignment table and new assembly files
[12-06-2018 16:59:30] INFO: 

###

PURGE HAPLOTIGS HAS COMPLETED SUCCESSFULLY!

###
[user@cn3101 3106949]$ cp curated.haplotigs.fasta reassigned.haplotigs.fasta
[user@cn3101 3106949]$ cat cns_h_ctg.fasta >> curated.haplotigs.fasta
[user@cn3101 3106949]$ exit
salloc.exe: Relinquishing job allocation 3106949
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. purge_haplotigs.sh). For example:

#!/bin/sh
set -e
module load purge_haplotigs

purge_haplotigs  readhist  cns_p_ctg.aligned.sd.bam

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] purge_haplotigs.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. purge_haplotigs.swarm). For example:

purge_haplotigs readhist sample1.bam
purge_haplotigs readhist sample2.bam
purge_haplotigs readhist sample3.bam
purge_haplotigs readhist sample4.bam

Submit this job using the swarm command.

swarm -f purge_haplotigs.swarm [-g #] [-t #] --module purge_haplotigs
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module purge_haplotigs Loads the purge_haplotigs module for each subjob in the swarm