purge_haplotigs on Biowulf
purge_haplotigs is a pipeline to help with curating heterozygous diploid genome assemblies.
References:
- Purge Haplotigs: Synteny Reduction for Third-gen Diploid Genome Assemblies. Michael J Roach, Simon A Schmidt, Anthony R Borneman. bioRxiv 286252; doi:10.1101/286252
Documentation
Important Notes
- Module Name: purge_haplotigs (see the modules page for more information)
- Example data in /fdb/purge_haplotigs
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program.
Sample session based on the tutorial at https://bitbucket.org/mroachawri/purge_haplotigs/wiki/Tutorial (user input in bold):
[user@biowulf]$ sinteractive --mem 8g --gres lscratch:10 salloc.exe: Pending job allocation 3106949 salloc.exe: job 3106949 queued and waiting for resources salloc.exe: job 3106949 has been allocated resources salloc.exe: Granted job allocation 3106949 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3101 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn3101 ~]$ module load purge_haplotigs [+] Loading bedtools 2.27.1 [+] Loading samtools 1.9 ... [+] Loading minimap2, version 2.15... [+] Loading mummer 4.0.0beta2 on cn3142 [+] Loading gcc 7.2.0 ... [+] Loading GSL 2.4 for GCC 7.2.0 ... [+] Loading openmpi 3.0.0 for GCC 7.2.0 [+] Loading R 3.5.0_build2 [+] Loading purge_haplotigs, version 1.0.4... [user@cn3101 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3101 3106949]$ ln -s /fdb/purge_haplotigs/* . [user@cn3101 3106949]$ purge_haplotigs readhist -b cns_p_ctg.aligned.sd.bam -g cns_p_ctg.fasta [25-01-2019 15:34:13] CHECKING bedtools... OK [25-01-2019 15:34:13] CHECKING Rscript... OK [25-01-2019 15:34:13] CHECKING samtools... OK [25-01-2019 15:34:13] INFO: ALL DEPENDENCIES OK [25-01-2019 15:34:13] INFO: Beginning read-depth histogram generation [25-01-2019 15:34:13] INFO: running genome coverage analysis on cns_p_ctg.aligned.sd.bam [25-01-2019 15:40:03] INFO: generating histogram [25-01-2019 15:40:03] INFO: generating png image [25-01-2019 15:40:04] INFO: Pipeline finished! Your histogram is saved to: cns_p_ctg.aligned.sd.bam.histogram.png [25-01-2019 15:40:04] INFO: Check your histogram to observe where your haploid and diploid peaks are and choose your low, midpoint, and high cutoffs (check the example histogram png in the readme). You will need 'cns_p_ctg.aligned.sd.bam.gencov' and the cutoffs for the next step: 'purge_haplotigs contigcov' [user@cn3101 3106949]$ purge_haplotigs contigcov -i cns_p_ctg.aligned.sd.bam.gencov -l 10 -m 70 -h 190 Analysis finished successfully! Contig coverage stats saved to 'coverage_stats.csv'. [user@cn3101 3106949]$ purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv [25-01-2019 15:42:17] CHECKING bash... OK [25-01-2019 15:42:17] CHECKING bedtools... OK [25-01-2019 15:42:17] CHECKING minimap2... OK [25-01-2019 15:42:17] CHECKING samtools... OK [25-01-2019 15:42:17] INFO: Beginning Pipeline PARAMETERS: Genome fasta: cns_p_ctg.fasta Coverage csv: coverage_stats.csv Produce dotplots: FALSE Threads: 4 I/O intense jobs: 4 Cutoff, alignment: 70 % Cutoff, repeat: 250 % Cutoff, suspect: 5 % Out prefix: curated minimap2 parameters: '-p 1e-5 -f 0.001 -N 100000' Running using command: purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv [25-01-2019 15:42:17] INFO: ### PREPARATION ### [25-01-2019 15:42:17] INFO: Reading cns_p_ctg.fasta.fai [25-01-2019 15:42:17] INFO: Reading coverage_stats.csv [25-01-2019 15:42:17] INFO: "Un-concatenating" genome [25-01-2019 15:42:18] INFO: Building minimap2 index of assembly [25-01-2019 15:42:21] INFO: Preparing to run minimap2 [25-01-2019 15:42:21] INFO: Performing minimap2 alignments [25-01-2019 15:42:25] INFO: Indexing minimap2 alignments [25-01-2019 15:42:25] INFO: Finished minimap2 alignments [25-01-2019 15:42:25] INFO: Reading index of minimap2 alignments [25-01-2019 15:42:25] INFO: Preparing contig hit summary [25-01-2019 15:42:26] INFO: Reading contig hits from hit summary [25-01-2019 15:42:26] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:42:26] INFO: Checking contig assignments for conflicts [25-01-2019 15:42:26] INFO: CONFLICT: 000010F and it's match 000070F both flagged for reassignment [25-01-2019 15:42:26] INFO: Keeping longer contig 000010F [25-01-2019 15:42:26] INFO: CONFLICT: 000024F and it's match 000032F both flagged for reassignment [25-01-2019 15:42:26] INFO: Keeping longer contig 000024F [25-01-2019 15:42:26] INFO: CONFLICT: 000046F and it's match 000050F both flagged for reassignment [25-01-2019 15:42:26] INFO: Keeping longer contig 000046F [25-01-2019 15:42:26] INFO: CONFLICT: 000062F and it's match 000072F both flagged for reassignment [25-01-2019 15:42:26] INFO: Keeping longer contig 000062F [25-01-2019 15:42:26] INFO: CONFLICT: 000071F and it's match 000075F both flagged for reassignment [25-01-2019 15:42:26] INFO: Keeping longer contig 000071F [25-01-2019 15:42:26] INFO: Logging reassignments and checking for convergence [25-01-2019 15:42:26] INFO: Convergence not reached, more passes needed [25-01-2019 15:42:26] INFO: Reading contig hits from hit summary [25-01-2019 15:42:26] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:42:26] INFO: Checking contig assignments for conflicts [25-01-2019 15:42:26] INFO: Logging reassignments and checking for convergence [25-01-2019 15:42:26] INFO: Convergence not reached, more passes needed [25-01-2019 15:42:26] INFO: Reading contig hits from hit summary [25-01-2019 15:42:26] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:42:26] INFO: Checking contig assignments for conflicts [25-01-2019 15:42:26] INFO: Logging reassignments and checking for convergence [25-01-2019 15:42:26] INFO: Convergence reached! [25-01-2019 15:42:26] INFO: Checking for over-purging [25-01-2019 15:42:26] INFO: Fixing over-purged contigs [25-01-2019 15:42:26] INFO: ### GENERATING OUTPUT ### [25-01-2019 15:42:26] INFO: Writing contig association paths [25-01-2019 15:42:26] INFO: Writing the reassignment table and new assembly files [25-01-2019 15:42:27] INFO: ### PURGE HAPLOTIGS HAS COMPLETED SUCCESSFULLY! ### [user@cn3101 3106949]$ purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60 [25-01-2019 15:42:38] CHECKING bash... OK [25-01-2019 15:42:38] CHECKING bedtools... OK [25-01-2019 15:42:38] CHECKING minimap2... OK [25-01-2019 15:42:38] CHECKING samtools... OK [25-01-2019 15:42:38] INFO: Beginning Pipeline PARAMETERS: Genome fasta: cns_p_ctg.fasta Coverage csv: coverage_stats.csv Produce dotplots: FALSE Threads: 4 I/O intense jobs: 4 Cutoff, alignment: 60 % Cutoff, repeat: 250 % Cutoff, suspect: 5 % Out prefix: curated minimap2 parameters: '-p 1e-5 -f 0.001 -N 100000' Running using command: purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60 [25-01-2019 15:42:38] INFO: ### PREPARATION ### [25-01-2019 15:42:38] INFO: Reading cns_p_ctg.fasta.fai [25-01-2019 15:42:38] INFO: Reading coverage_stats.csv [25-01-2019 15:42:38] INFO: Reusing minimap2 alignments from previous run [25-01-2019 15:42:38] INFO: Reading index of minimap2 alignments [25-01-2019 15:42:38] INFO: Reusing hit summary file from previous run [25-01-2019 15:42:38] INFO: Reading contig hits from hit summary [25-01-2019 15:42:38] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:42:38] INFO: Checking contig assignments for conflicts [25-01-2019 15:42:38] INFO: CONFLICT: 000010F and it's match 000070F both flagged for reassignment [25-01-2019 15:42:38] INFO: Keeping longer contig 000010F [25-01-2019 15:42:38] INFO: CONFLICT: 000024F and it's match 000032F both flagged for reassignment [25-01-2019 15:42:38] INFO: Keeping longer contig 000024F [25-01-2019 15:42:38] INFO: CONFLICT: 000046F and it's match 000050F both flagged for reassignment [25-01-2019 15:42:38] INFO: Keeping longer contig 000046F [25-01-2019 15:42:38] INFO: CONFLICT: 000062F and it's match 000072F both flagged for reassignment [25-01-2019 15:42:38] INFO: Keeping longer contig 000062F [25-01-2019 15:42:38] INFO: CONFLICT: 000071F and it's match 000075F both flagged for reassignment [25-01-2019 15:42:38] INFO: Keeping longer contig 000071F [25-01-2019 15:42:38] INFO: Logging reassignments and checking for convergence [25-01-2019 15:42:38] INFO: Convergence not reached, more passes needed [25-01-2019 15:42:38] INFO: Reading contig hits from hit summary [25-01-2019 15:42:38] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:42:38] INFO: Checking contig assignments for conflicts [25-01-2019 15:42:38] INFO: Logging reassignments and checking for convergence [25-01-2019 15:42:38] INFO: Convergence not reached, more passes needed [25-01-2019 15:42:38] INFO: Reading contig hits from hit summary [25-01-2019 15:42:38] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:42:38] INFO: Checking contig assignments for conflicts [25-01-2019 15:42:38] INFO: Logging reassignments and checking for convergence [25-01-2019 15:42:38] INFO: Convergence reached! [25-01-2019 15:42:38] INFO: Checking for over-purging [25-01-2019 15:42:38] INFO: Fixing over-purged contigs [25-01-2019 15:42:38] INFO: ### GENERATING OUTPUT ### [25-01-2019 15:42:38] INFO: Writing contig association paths [25-01-2019 15:42:38] INFO: Writing the reassignment table and new assembly files [25-01-2019 15:42:39] INFO: ### PURGE HAPLOTIGS HAS COMPLETED SUCCESSFULLY! ### [teacher@cn3142 18957422]$ purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60 -d -b cns_p_ctg.aligned.sd.bam [25-01-2019 15:42:46] CHECKING bash... OK [25-01-2019 15:42:46] CHECKING bedtools... OK [25-01-2019 15:42:46] CHECKING minimap2... OK [25-01-2019 15:42:46] CHECKING samtools... OK [25-01-2019 15:42:46] CHECKING Rscript... OK [25-01-2019 15:42:46] INFO: Beginning Pipeline PARAMETERS: Genome fasta: cns_p_ctg.fasta Coverage csv: coverage_stats.csv Produce dotplots: TRUE Bam file: cns_p_ctg.aligned.sd.bam Min cov window len: 5000 bp max ctg cov windows: 200 Threads: 4 I/O intense jobs: 4 Cutoff, alignment: 60 % Cutoff, repeat: 250 % Cutoff, suspect: 5 % Out prefix: curated minimap2 parameters: '-p 1e-5 -f 0.001 -N 100000' Running using command: purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60 -d -b cns_p_ctg.aligned.sd.bam [25-01-2019 15:42:46] INFO: ### PREPARATION ### [25-01-2019 15:42:46] INFO: Reading cns_p_ctg.fasta.fai [25-01-2019 15:42:46] INFO: Reading coverage_stats.csv [25-01-2019 15:42:46] INFO: Getting windowed read-depth for each contig [25-01-2019 15:44:44] INFO: Generating log2(read-depth / average read-depth) coverages for plotting later [25-01-2019 15:44:44] INFO: Reusing minimap2 alignments from previous run [25-01-2019 15:44:44] INFO: Reading index of minimap2 alignments [25-01-2019 15:44:44] INFO: Reusing hit summary file from previous run [25-01-2019 15:44:44] INFO: Reading contig hits from hit summary [25-01-2019 15:44:44] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:44:44] INFO: Checking contig assignments for conflicts [25-01-2019 15:44:44] INFO: CONFLICT: 000010F and it's match 000070F both flagged for reassignment [25-01-2019 15:44:44] INFO: Keeping longer contig 000010F [25-01-2019 15:44:44] INFO: CONFLICT: 000024F and it's match 000032F both flagged for reassignment [25-01-2019 15:44:44] INFO: Keeping longer contig 000024F [25-01-2019 15:44:44] INFO: CONFLICT: 000046F and it's match 000050F both flagged for reassignment [25-01-2019 15:44:44] INFO: Keeping longer contig 000046F [25-01-2019 15:44:44] INFO: CONFLICT: 000062F and it's match 000072F both flagged for reassignment [25-01-2019 15:44:44] INFO: Keeping longer contig 000062F [25-01-2019 15:44:44] INFO: CONFLICT: 000071F and it's match 000075F both flagged for reassignment [25-01-2019 15:44:44] INFO: Keeping longer contig 000071F [25-01-2019 15:44:44] INFO: Logging reassignments and checking for convergence [25-01-2019 15:44:44] INFO: Convergence not reached, more passes needed [25-01-2019 15:44:44] INFO: Reading contig hits from hit summary [25-01-2019 15:44:44] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:44:44] INFO: Checking contig assignments for conflicts [25-01-2019 15:44:44] INFO: Logging reassignments and checking for convergence [25-01-2019 15:44:44] INFO: Convergence not reached, more passes needed [25-01-2019 15:44:44] INFO: Reading contig hits from hit summary [25-01-2019 15:44:44] INFO: Performing pairwise comparisons on contig hits [25-01-2019 15:44:44] INFO: Checking contig assignments for conflicts [25-01-2019 15:44:44] INFO: Logging reassignments and checking for convergence [25-01-2019 15:44:44] INFO: Convergence reached! [25-01-2019 15:44:44] INFO: Checking for over-purging [25-01-2019 15:45:24] INFO: Fixing over-purged contigs [25-01-2019 15:45:24] INFO: ### GENERATING OUTPUT ### [25-01-2019 15:45:24] INFO: Writing contig association paths [25-01-2019 15:45:24] INFO: Writing the reassignment table and new assembly files [25-01-2019 15:45:24] INFO: ### PURGE HAPLOTIGS HAS COMPLETED SUCCESSFULLY! ### [user@cn3101 3106949]$ exit salloc.exe: Relinquishing job allocation 3106949 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. purge_haplotigs.sh). For example:
#!/bin/sh set -e module load purge_haplotigs purge_haplotigs readhist cns_p_ctg.aligned.sd.bam
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] purge_haplotigs.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. purge_haplotigs.swarm). For example:
purge_haplotigs readhist sample1.bam purge_haplotigs readhist sample2.bam purge_haplotigs readhist sample3.bam purge_haplotigs readhist sample4.bam
Submit this job using the swarm command.
swarm -f purge_haplotigs.swarm [-g #] [-t #] --module purge_haplotigswhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module purge_haplotigs | Loads the purge_haplotigs module for each subjob in the swarm |