High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Khmer on Biowulf

The khmer software (http://khmer.readthedocs.org/en/v2.0 is a set of command-line tools for working with DNA shotgun sequencing data from genomes, transcriptomes, metagenomes, and single cells. khmer can make de novo assemblies faster, and sometimes better. khmer can also identify (and fix) problems with shotgun data. It is primarily aimed at short-read sequencing data such as that produced by the Illumina platform. khmer takes a k-mer-centric approach to sequence analysis.


Loading the khmer module environment

Before using khmer, you must add the khmer environment module and the other modules it uses into your shell environment. This is most easily done by using the module commands, as in the example below:

[user@biowulf]$ module avail khmer                   (see what versions are available)

-------------------------------------------- /usr/local/lmod/modulefiles ---------------------------------------------
   khmer/1.4.1    khmer/2.0 (D)

[user@biowulf]$ module load khmer                    (load the default version)
[user@biowulf]$
[user@biowulf]$ module list                          (see what versions are loaded)
Currently Loaded Modulefiles:
 1) khmer/2.0
[user@biowulf]$


Running khmer commands

A list of khmer commands is shown below separated into functional groupings.
See Khmer command-line interface for details of command-line options.


k-mer counting and abundance filtering
1. load-into-counting.py— Build a k-mer counting table from the given sequences.
2. abundance-dist.py — Calculate abundance distribution of the k-mers using a pre-made k-mer counting table.
3. abundance-dist-single.py — Calculate the abundance distribution of k-mers from a single sequence file.
4. filter-abund.py — Trim sequences at a minimum k-mer abundance.
5. filter-abund-single.py — Trims sequences at a minimum k-mer abundance (in memory version).
6. trim-low-abund.py — Trim low-abundance k-mers using a streaming algorithm.
7. count-median.py — Count k-mers summary stats for sequences
8. count-overlap.py — Count the overlap k-mers which are the k-mers appearing in two sequence datasets.
Partitioning
9. do-partition.py — Load, partition, and annotate FAST[AQ] sequences
10. load-graph.py — Load sequences into the compressible graph format plus optional tagset.
11. partition-graph.py — Partition a sequence graph based upon waypoint connectivity
12. merge-partition.py — Merge partition map '.pmap' files.
13. annotate-partitions.py — Annotate sequences with partition IDs.
14. extract-partitions.py — Separate sequences that are annotated with partitions into grouped files.
15. make-initial-stoptags.py — Find an initial set of highly connected k-mers.
16. find-knots.py — Find all highly connected k-mers.
17. filter-stoptags.py — Trim sequences at stoptags.
Digital normalization
18. normalize-by-median.py — Do digital normalization (remove mostly redundant sequences)
Read handling: interleaving, splitting, etc.
19. extract-long-sequences.py — Extract FASTQ or FASTA sequences longer than specified length (default: 200 bp).
20. extract-paired-reads.py — Take a mixture of reads and split into pairs and orphans.
21. fastq-to-fasta.py — Converts FASTQ format (.fq) files to FASTA format (.fa).
22. interleave-reads.py — Produce interleaved files from R1/R2 paired files
23. readstats.py — Display summary statistics for one or more FASTA/FASTQ files.
24. sample-reads-randomly.py — Uniformly subsample sequences from a collection of files
25. split-paired-reads.py — Split interleaved reads into two files, left and right.


Using a khmer program on the biowulf cluster

When running a khmer command, on the biowulf cluster, you must have already put the khmer environment in place by running the command "module load khmer". In particular, this means that

  1. For an interactive node session, you must run "module load khmer" before attempting any khmer commands.
  2. For a single khmer batch job, you must include "module load khmer" in your qsub script before any line running a khmer job.
  3. For a swarm of khmer jobs, you must include "module load khmer" in your swarm command file before any line running a khmer job.

Running a single Khmer batch job on Biowulf

(See the section of the same name for application samtools).

Running a swarm of Khmer jobs

(See the section of the same name for application samtools).

For more information regarding running swarm, see swarm.html

Running an interactive Khmer job on Biowulf

(See the section of the same name for application samtools).


Documentation
  1. The khmer software for advanced biological sequencing data analysis
  2. Command-line usage for khmer tools
  3. Setting khmer memory usage
  4. Partitioning large data sets (50m+ reads)
  5. KnownIssues
  6. How to get help
    • For issues with HPC@NIH contact staff@hpc.nih.gov
    • For issues with the khmer application see khmer author's tips How to get help