High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
KMC on Biowulf & Helix

Description

KMC is a program to create and access databases for counting k-mers from fastq or fasta files.

There may be multiple versions of KMC available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail KMC 

To select a module use

module load KMC/[version]

where [version] is the version of choice.

KMC is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

Documentation

On Helix

KMC can be resource intensive. When running on helix, please limit the number of threads to 2 (-t 2) and avoid analyzing large input files. Note that kmc seems to allocate 2 threads more than requested, so this should be taken into consideration when allocating batch jobs (see below).

helix$ module load kmc
helix$ kmc
K-Mer Counter (KMC) ver. 2.2.0 (2015-04-15)
Usage:
 kmc [options] <input_file_name> <output_file_name> <working_directory>
 kmc [options] <@input_file_names> <output_file_name> <working_directory>
Parameters:
  input_file_name - single file in FASTQ format (gziped or not)
  @input_file_names - file name with list of input files in FASTQ format (gziped or not)
Options:
  -v - verbose mode (shows all parameter settings); default: false
  -k<len> - k-mer length (k from 10 to 256; default: 25)
  -m<size> - max amount of RAM in GB (from 1 to 1024); default: 12
  -sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)
  -p<par> - signature length (5, 6, 7, 8); default: 7
  -f<a/q/m> - input in FASTA format (-fa), FASTQ format (-fq) or mulit FASTA (-fm); default: FASTQ
  -q[value] - use Quake's compatible counting with [value] representing lowest quality (default: 33)
  -ci<value> - exclude k-mers occurring less than <value> times (default: 2)
  -cs<value> - maximal value of a counter (default: 255)
  -cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)
  -b - turn off transformation of k-mers into canonical form
  -r - turn on RAM-only mode 
  -n<value> - number of bins 
  -t<value> - total number of threads (default: no. of CPU cores)
  -sf<value> - number of FASTQ reading threads
  -sp<value> - number of splitting threads
  -sr<value> - number of sorter threads
  -so<value> - number of threads per single sorter
Example:
kmc -k27 -m24 NA19238.fastq NA.res \data\kmc_tmp_dir\
kmc -k27 -q -m24 @files.lst NA.res \data\kmc_tmp_dir\

helix$ kmc -t2 \
  /usr/local/apps/kmc/TEST_DATA/ENCFF001KPB.fastq.gz \
  ENCFF001KPB.kmer \
  $(mktemp -d /scratch/XXXXX)
1st stage: 19.0231s
2nd stage: 4.78721s
Total    : 23.8103s
Tmp size : 161MB

Stats:
   No. of k-mers below min. threshold :     66106376
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :     79254698
   No. of unique counted k-mers       :     13148322
   Total no. of k-mers                :    108850013
   Total no. of reads                 :      9157799
   Total no. of super-k-mers          :     18252841
Batch job on Biowulf

Create a batch script like the one below. Note that two less threads than allocated CPUs are used to avoid overloading the node.

#! /bin/bash
#SBATCH --mem=16G
#SBATCH --gres=lscratch:10
set -e

tmp=/lscratch/${SLURM_JOB_ID}

module load kmc
kmc -t$(( SLURM_CPUS_PER_TASK - 2 )) -m15 -sm \
  /usr/local/apps/kmc/TEST_DATA/ENCFF001KPB.fastq.gz \
  ENCFF001KPB.kmer \
  ${tmp}

Submit to the queue

biowulf$ sbatch --cpus-per-task=6 kmc_batch.sh
Swarm of jobs on Biowulf

Create a swarm file:

tmp=/lscratch/${SLURM_JOB_ID} \
  && kmc -t$(( SLURM_CPUS_PER_TASK - 2 )) -m15 -sm 1.fastq.gz 1.kmer ${tmp} 
tmp=/lscratch/${SLURM_JOB_ID} \
  && kmc -t$(( SLURM_CPUS_PER_TASK - 2 )) -m15 -sm 2.fastq.gz 2.kmer ${tmp}

and submit to the queue with

biowulf$ swarm -f kmc_swarm -t6 -g16 --gres=lscratch:10
Interactive job on Biowulf

Allocate an interactive node as usual

biowulf$ sinteractive -c6 --mem=16G
salloc.exe: Granted job allocation 1199941
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn0314 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
cn0043$ module load kmc
cn0043$ kmc ...
cn0043$ exit
salloc.exe: Relinquishing job allocation 1199941
salloc.exe: Job allocation 1199941 has been revoked.
biowulf$