Biowulf High Performance Computing at the NIH
purge_haplotigs on Biowulf

purge_haplotigs is a pipeline to help with curating heterozygous diploid genome assemblies.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session based on the tutorial at https://bitbucket.org/mroachawri/purge_haplotigs/wiki/Tutorial (user input in bold):

[user@biowulf]$ sinteractive --mem 8g --gres lscratch:10
salloc.exe: Pending job allocation 3106949
salloc.exe: job 3106949 queued and waiting for resources
salloc.exe: job 3106949 has been allocated resources
salloc.exe: Granted job allocation 3106949
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3101 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
[user@cn3101 ~]$ module load purge_haplotigs
[+] Loading bedtools  2.27.1 
[+] Loading samtools 1.9  ... 
[+] Loading minimap2, version 2.15... 
[+] Loading mummer  4.0.0beta2  on cn3142 
[+] Loading gcc  7.2.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[+] Loading openmpi 3.0.0  for GCC 7.2.0 
[+] Loading R 3.5.0_build2 
[+] Loading purge_haplotigs, version 1.0.4... 
[user@cn3101 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3101 3106949]$ ln -s /fdb/purge_haplotigs/* .
[user@cn3101 3106949]$ purge_haplotigs readhist -b cns_p_ctg.aligned.sd.bam  -g cns_p_ctg.fasta
    [25-01-2019 15:34:13] CHECKING bedtools... OK
    [25-01-2019 15:34:13] CHECKING Rscript... OK
    [25-01-2019 15:34:13] CHECKING samtools... OK
    [25-01-2019 15:34:13] INFO: ALL DEPENDENCIES OK


    [25-01-2019 15:34:13] INFO: Beginning read-depth histogram generation
    [25-01-2019 15:34:13] INFO: running genome coverage analysis on cns_p_ctg.aligned.sd.bam
    [25-01-2019 15:40:03] INFO: generating histogram
    [25-01-2019 15:40:03] INFO: generating png image
    [25-01-2019 15:40:04] INFO: 

    Pipeline finished! Your histogram is saved to: cns_p_ctg.aligned.sd.bam.histogram.png

    [25-01-2019 15:40:04] INFO: 
    Check your histogram to observe where your haploid and diploid peaks are
    and choose your low, midpoint, and high cutoffs (check the example histogram png 
    in the readme). You will need 'cns_p_ctg.aligned.sd.bam.gencov' and the cutoffs for the next 
    step: 'purge_haplotigs contigcov'

[user@cn3101 3106949]$ purge_haplotigs contigcov -i cns_p_ctg.aligned.sd.bam.gencov  -l 10  -m 70  -h 190

Analysis finished successfully! Contig coverage stats saved to 'coverage_stats.csv'.
[user@cn3101 3106949]$ purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv
[25-01-2019 15:42:17] CHECKING bash... OK
[25-01-2019 15:42:17] CHECKING bedtools... OK
[25-01-2019 15:42:17] CHECKING minimap2... OK
[25-01-2019 15:42:17] CHECKING samtools... OK
[25-01-2019 15:42:17] INFO: 
Beginning Pipeline

PARAMETERS:
Genome fasta:           cns_p_ctg.fasta
Coverage csv:           coverage_stats.csv
Produce dotplots:       FALSE
Threads:                4
I/O intense jobs:       4
Cutoff, alignment:      70 %
Cutoff, repeat:         250 %
Cutoff, suspect:        5 %
Out prefix:             curated
minimap2 parameters:    '-p 1e-5 -f 0.001 -N 100000'

Running using command:
purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv


[25-01-2019 15:42:17] INFO: 

###

PREPARATION

###

[25-01-2019 15:42:17] INFO: Reading cns_p_ctg.fasta.fai
[25-01-2019 15:42:17] INFO: Reading coverage_stats.csv
[25-01-2019 15:42:17] INFO: "Un-concatenating" genome
[25-01-2019 15:42:18] INFO: Building minimap2 index of assembly
[25-01-2019 15:42:21] INFO: Preparing to run minimap2
[25-01-2019 15:42:21] INFO: Performing minimap2 alignments
[25-01-2019 15:42:25] INFO: Indexing minimap2 alignments
[25-01-2019 15:42:25] INFO: Finished minimap2 alignments
[25-01-2019 15:42:25] INFO: Reading index of minimap2 alignments
[25-01-2019 15:42:25] INFO: Preparing contig hit summary
[25-01-2019 15:42:26] INFO: Reading contig hits from hit summary
[25-01-2019 15:42:26] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:42:26] INFO: Checking contig assignments for conflicts
[25-01-2019 15:42:26] INFO: CONFLICT: 000010F and it's match 000070F both flagged for reassignment
[25-01-2019 15:42:26] INFO: 	Keeping longer contig 000010F
[25-01-2019 15:42:26] INFO: CONFLICT: 000024F and it's match 000032F both flagged for reassignment
[25-01-2019 15:42:26] INFO: 	Keeping longer contig 000024F
[25-01-2019 15:42:26] INFO: CONFLICT: 000046F and it's match 000050F both flagged for reassignment
[25-01-2019 15:42:26] INFO: 	Keeping longer contig 000046F
[25-01-2019 15:42:26] INFO: CONFLICT: 000062F and it's match 000072F both flagged for reassignment
[25-01-2019 15:42:26] INFO: 	Keeping longer contig 000062F
[25-01-2019 15:42:26] INFO: CONFLICT: 000071F and it's match 000075F both flagged for reassignment
[25-01-2019 15:42:26] INFO: 	Keeping longer contig 000071F
[25-01-2019 15:42:26] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:42:26] INFO: Convergence not reached, more passes needed
[25-01-2019 15:42:26] INFO: Reading contig hits from hit summary
[25-01-2019 15:42:26] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:42:26] INFO: Checking contig assignments for conflicts
[25-01-2019 15:42:26] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:42:26] INFO: Convergence not reached, more passes needed
[25-01-2019 15:42:26] INFO: Reading contig hits from hit summary
[25-01-2019 15:42:26] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:42:26] INFO: Checking contig assignments for conflicts
[25-01-2019 15:42:26] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:42:26] INFO: Convergence reached!
[25-01-2019 15:42:26] INFO: Checking for over-purging
[25-01-2019 15:42:26] INFO: Fixing over-purged contigs
[25-01-2019 15:42:26] INFO: 

###

GENERATING OUTPUT

###

[25-01-2019 15:42:26] INFO: Writing contig association paths
[25-01-2019 15:42:26] INFO: Writing the reassignment table and new assembly files
[25-01-2019 15:42:27] INFO: 

###

PURGE HAPLOTIGS HAS COMPLETED SUCCESSFULLY!

###

[user@cn3101 3106949]$ purge_haplotigs purge  -g cns_p_ctg.fasta  -c coverage_stats.csv  -a 60
[25-01-2019 15:42:38] CHECKING bash... OK
[25-01-2019 15:42:38] CHECKING bedtools... OK
[25-01-2019 15:42:38] CHECKING minimap2... OK
[25-01-2019 15:42:38] CHECKING samtools... OK
[25-01-2019 15:42:38] INFO: 
Beginning Pipeline

PARAMETERS:
Genome fasta:           cns_p_ctg.fasta
Coverage csv:           coverage_stats.csv
Produce dotplots:       FALSE
Threads:                4
I/O intense jobs:       4
Cutoff, alignment:      60 %
Cutoff, repeat:         250 %
Cutoff, suspect:        5 %
Out prefix:             curated
minimap2 parameters:    '-p 1e-5 -f 0.001 -N 100000'

Running using command:
purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60


[25-01-2019 15:42:38] INFO: 

###

PREPARATION

###

[25-01-2019 15:42:38] INFO: Reading cns_p_ctg.fasta.fai
[25-01-2019 15:42:38] INFO: Reading coverage_stats.csv
[25-01-2019 15:42:38] INFO: Reusing minimap2 alignments from previous run
[25-01-2019 15:42:38] INFO: Reading index of minimap2 alignments
[25-01-2019 15:42:38] INFO: Reusing hit summary file from previous run
[25-01-2019 15:42:38] INFO: Reading contig hits from hit summary
[25-01-2019 15:42:38] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:42:38] INFO: Checking contig assignments for conflicts
[25-01-2019 15:42:38] INFO: CONFLICT: 000010F and it's match 000070F both flagged for reassignment
[25-01-2019 15:42:38] INFO: 	Keeping longer contig 000010F
[25-01-2019 15:42:38] INFO: CONFLICT: 000024F and it's match 000032F both flagged for reassignment
[25-01-2019 15:42:38] INFO: 	Keeping longer contig 000024F
[25-01-2019 15:42:38] INFO: CONFLICT: 000046F and it's match 000050F both flagged for reassignment
[25-01-2019 15:42:38] INFO: 	Keeping longer contig 000046F
[25-01-2019 15:42:38] INFO: CONFLICT: 000062F and it's match 000072F both flagged for reassignment
[25-01-2019 15:42:38] INFO: 	Keeping longer contig 000062F
[25-01-2019 15:42:38] INFO: CONFLICT: 000071F and it's match 000075F both flagged for reassignment
[25-01-2019 15:42:38] INFO: 	Keeping longer contig 000071F
[25-01-2019 15:42:38] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:42:38] INFO: Convergence not reached, more passes needed
[25-01-2019 15:42:38] INFO: Reading contig hits from hit summary
[25-01-2019 15:42:38] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:42:38] INFO: Checking contig assignments for conflicts
[25-01-2019 15:42:38] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:42:38] INFO: Convergence not reached, more passes needed
[25-01-2019 15:42:38] INFO: Reading contig hits from hit summary
[25-01-2019 15:42:38] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:42:38] INFO: Checking contig assignments for conflicts
[25-01-2019 15:42:38] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:42:38] INFO: Convergence reached!
[25-01-2019 15:42:38] INFO: Checking for over-purging
[25-01-2019 15:42:38] INFO: Fixing over-purged contigs
[25-01-2019 15:42:38] INFO: 

###

GENERATING OUTPUT

###

[25-01-2019 15:42:38] INFO: Writing contig association paths
[25-01-2019 15:42:38] INFO: Writing the reassignment table and new assembly files
[25-01-2019 15:42:39] INFO: 

###

PURGE HAPLOTIGS HAS COMPLETED SUCCESSFULLY!

###

[teacher@cn3142 18957422]$ purge_haplotigs purge  -g cns_p_ctg.fasta  -c coverage_stats.csv  -a 60  -d  -b cns_p_ctg.aligned.sd.bam
[25-01-2019 15:42:46] CHECKING bash... OK
[25-01-2019 15:42:46] CHECKING bedtools... OK
[25-01-2019 15:42:46] CHECKING minimap2... OK
[25-01-2019 15:42:46] CHECKING samtools... OK
[25-01-2019 15:42:46] CHECKING Rscript... OK
[25-01-2019 15:42:46] INFO: 
Beginning Pipeline

PARAMETERS:
Genome fasta:           cns_p_ctg.fasta
Coverage csv:           coverage_stats.csv
Produce dotplots:       TRUE
Bam file:               cns_p_ctg.aligned.sd.bam
Min cov window len:     5000 bp
max ctg cov windows:    200
Threads:                4
I/O intense jobs:       4
Cutoff, alignment:      60 %
Cutoff, repeat:         250 %
Cutoff, suspect:        5 %
Out prefix:             curated
minimap2 parameters:    '-p 1e-5 -f 0.001 -N 100000'

Running using command:
purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60 -d -b cns_p_ctg.aligned.sd.bam


[25-01-2019 15:42:46] INFO: 

###

PREPARATION

###

[25-01-2019 15:42:46] INFO: Reading cns_p_ctg.fasta.fai
[25-01-2019 15:42:46] INFO: Reading coverage_stats.csv
[25-01-2019 15:42:46] INFO: Getting windowed read-depth for each contig
[25-01-2019 15:44:44] INFO: Generating log2(read-depth / average read-depth) coverages for plotting later
[25-01-2019 15:44:44] INFO: Reusing minimap2 alignments from previous run
[25-01-2019 15:44:44] INFO: Reading index of minimap2 alignments
[25-01-2019 15:44:44] INFO: Reusing hit summary file from previous run
[25-01-2019 15:44:44] INFO: Reading contig hits from hit summary
[25-01-2019 15:44:44] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:44:44] INFO: Checking contig assignments for conflicts
[25-01-2019 15:44:44] INFO: CONFLICT: 000010F and it's match 000070F both flagged for reassignment
[25-01-2019 15:44:44] INFO: 	Keeping longer contig 000010F
[25-01-2019 15:44:44] INFO: CONFLICT: 000024F and it's match 000032F both flagged for reassignment
[25-01-2019 15:44:44] INFO: 	Keeping longer contig 000024F
[25-01-2019 15:44:44] INFO: CONFLICT: 000046F and it's match 000050F both flagged for reassignment
[25-01-2019 15:44:44] INFO: 	Keeping longer contig 000046F
[25-01-2019 15:44:44] INFO: CONFLICT: 000062F and it's match 000072F both flagged for reassignment
[25-01-2019 15:44:44] INFO: 	Keeping longer contig 000062F
[25-01-2019 15:44:44] INFO: CONFLICT: 000071F and it's match 000075F both flagged for reassignment
[25-01-2019 15:44:44] INFO: 	Keeping longer contig 000071F
[25-01-2019 15:44:44] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:44:44] INFO: Convergence not reached, more passes needed
[25-01-2019 15:44:44] INFO: Reading contig hits from hit summary
[25-01-2019 15:44:44] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:44:44] INFO: Checking contig assignments for conflicts
[25-01-2019 15:44:44] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:44:44] INFO: Convergence not reached, more passes needed
[25-01-2019 15:44:44] INFO: Reading contig hits from hit summary
[25-01-2019 15:44:44] INFO: Performing pairwise comparisons on contig hits
[25-01-2019 15:44:44] INFO: Checking contig assignments for conflicts
[25-01-2019 15:44:44] INFO: Logging reassignments and checking for convergence
[25-01-2019 15:44:44] INFO: Convergence reached!
[25-01-2019 15:44:44] INFO: Checking for over-purging
[25-01-2019 15:45:24] INFO: Fixing over-purged contigs
[25-01-2019 15:45:24] INFO: 

###

GENERATING OUTPUT

###

[25-01-2019 15:45:24] INFO: Writing contig association paths
[25-01-2019 15:45:24] INFO: Writing the reassignment table and new assembly files
[25-01-2019 15:45:24] INFO: 

###

PURGE HAPLOTIGS HAS COMPLETED SUCCESSFULLY!

###

[user@cn3101 3106949]$ exit
salloc.exe: Relinquishing job allocation 3106949
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. purge_haplotigs.sh). For example:

#!/bin/sh
set -e
module load purge_haplotigs

purge_haplotigs  readhist  cns_p_ctg.aligned.sd.bam

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] purge_haplotigs.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. purge_haplotigs.swarm). For example:

purge_haplotigs readhist sample1.bam
purge_haplotigs readhist sample2.bam
purge_haplotigs readhist sample3.bam
purge_haplotigs readhist sample4.bam

Submit this job using the swarm command.

swarm -f purge_haplotigs.swarm [-g #] [-t #] --module purge_haplotigs
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module purge_haplotigs Loads the purge_haplotigs module for each subjob in the swarm