High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
kneaddata on Biowulf & Helix


From the KneadData user manual:

KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments. In these experiments, samples are typically taken from a host in hopes of learning something about the microbial community on the host. However, sequencing data from such experiments will often contain a high ratio of host to bacterial reads. This tool aims to perform principled in silico separation of bacterial reads from these "contaminant" reads, be they from the host, from bacterial 16S sequences, or other user-defined sources. Additionally, KneadData can be used for other filtering tasks. For example, if one is trying to clean data derived from a human sequencing experiment, KneadData can be used to separate the human and the non-human reads.

There may be multiple versions of kneaddata available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail kneaddata 

To select a module use

module load kneaddata/[version]

where [version] is the version of choice.

kneaddata is a multithreaded/multiprocess application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set


All dependencies are loaded automatically by the kneaddata module.



Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as shown below. In this case we will use test data that is an artificial mixture of 1M human exome reads and 1M environmental metagenomic reads. The 50% human reads is treated as an artificial contamination and removed:

biowulf$ sinteractive --mem=12g --cpus-per-task=8 --gres=lscratch:10
salloc.exe: Pending job allocation 33247354
salloc.exe: job 33247354 queued and waiting for resources
salloc.exe: job 33247354 has been allocated resources
salloc.exe: Granted job allocation 33247354
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn2692 are ready for job
srun: error: x11: no local DISPLAY defined, skipping

node$ module load kneaddata
[+] Loading java 1.8.0_11 ...
[+] Loading samtools 1.3.1 ...
[+] Loading kneaddata 0.5.4
node$ cd /lscratch/${SLURM_JOB_ID}
node$ # copy test data
node$ # the test data is 50% human, 50% environmental metagenomic data
node$ # and the read names are labelled accordingly
node$ zcat test_R1.fastq.gz \
             | awk '/@meta/ {m++} /@human/ {h++} END {printf("human: %i\nmeta:  %i\n", h, m)}'
human: 1000000
meta:  1000000
node$ # run kneaddata on the paired end test data
node$ kneaddata -i test_R1.fastq.gz -i test_R2.fastq.gz \
  --reference-db $KNEADDATA_DB/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \
  --output-prefix test --output test_out \
  -p 2 -t 4 --run-fastqc-end 
Final output files created: 

node$ # check the composition of human/metagenome reads in the cleand data
node$ cat /lscratch/33247583/test_out/test_paired_1.fastq \
             | awk '/@meta/ {m++} /@human/ {h++} END {printf("human: %i\nmeta:  %i\n", h, m)}'
human: 67619
meta:  933563
node$ exit

So the 50% artificial contamination with human reads was reduced to 7%.

Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is kneaddata.batch

module load kneaddata || exit 1
cd /lscratch/${SLURM_JOB_ID} || exit 1

if [[ ! -e test_R1.fastq.gz ]]; then
rm -rf test_out

kneaddata -i test_R1.fastq.gz -i test_R2.fastq.gz \
  --reference-db $KNEADDATA_DB/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \
  --output-prefix test --output test_out \
  -p 2 -t 4 --run-fastqc-end

cp -r test_out /data/$USER/important_project

Submit to the queue with sbatch:

biowulf$ sbatch --cpus-per-task=8 --mem=12g --gres=lscratch:10 kneaddata.batch