Biowulf High Performance Computing at the NIH
kaiju on Biowulf

Kaiju is used for taxonomic classification of sequencing reads based on a BWT protein database. It can be run in two modes: MEM (maximum exact match), the default mode, allows only exact matches of the read against the databases. Greedy mode allows mismatches which increases the sensitivity of the classification at the cost of decreased speed.

Kaiju reads the entire database into memory which can be rather large (~50GB in the case of the 'nr + euk' database). That means it may be advantageous to copy the database to lscratch and use the copy to classify several samples as part of a single job.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=20g --cpus-per-task=6 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load kaiju
[user@cn3144 ~]$ db=/fdb/kaiju/2017-05-16/pro_genomes
[user@cn3144 ~]$ cd /lscratch/${SLURM_JOBID}
[user@cn3144 ~]$ cp ${KAIJU_TEST_DATA}/*.fastq.gz .
[user@cn3144 ~]$ # kaiju in default MEM mode with paired end data
[user@cn3144 ~]$ kaiju -t ${db}/nodes.dmp -f ${db}/kaiju_db.fmi \
                      -z ${SLURM_CPUS_PER_TASK} \
                      -i <(zcat SRR579274_1.fastq.gz) \
                      -j <(zcat SRR579274_2.fastq.gz) \
                      -o SRR579274.kaiju_out
[user@cn3144 ~]$ # output: one line per pair with three columns: (1) Unclassified/Classified, (2) read, (3) taxid
[user@cn3144 ~]$ head SRR579274.kaiju_out
C	SRR579274.1	186802
C	SRR579274.2	186802
U	SRR579274.3	0
C	SRR579274.4	657323
C	SRR579274.5	411463
C	SRR579274.6	853
C	SRR579274.7	717962
C	SRR579274.8	186802
C	SRR579274.9	742738
C	SRR579274.10	411469

[user@cn3144 ~]$ ### for kaiju < 1.7.0
[user@cn3144 ~]$ kaijuReport -t ${db}/nodes.dmp -n ${db}/names.dmp \
                       -i SRR579274.kaiju_out -r genus -m 1
[user@cn3144 ~]$ ### for kaiju >= 1.7.0
[user@cn3144 ~]$ kaiju2table -t ${db}/nodes.dmp -n ${db}/names.dmp \
                       -r genus -m 1 SRR579274.kaiju_out

        %	    reads	genus
 7.425068	    18000	Faecalibacterium
 5.914892	    14339	Bacteroides
 5.287474	    12818	Blautia
 4.814744	    11672	Ruminococcus
 3.104504	     7526	Eubacterium
 1.973418	     4784	Dorea
 1.662803	     4031	Anaerostipes
 1.348063	     3268	Roseburia
 1.079935	     2618	Subdoligranulum
 0.014850	       36	Viruses
29.897451	    72478	cannot be assigned to a genus 
11.469256	    27804	belong to a genus with less than 1% of all reads
26.007542	    63048	unclassified

[user@cn3144 ~]$ # kaiju in default Greedy mode allowing 2 mismatches with an E value cutoff of 0.05
[user@cn3144 ~]$ kaiju -t ${db}/nodes.dmp -f ${db}/kaiju_db.fmi \
                      -z ${SLURM_CPUS_PER_TASK} \
                      -a greedy -e 2 -E 0.05 \
                      -i <(zcat SRR579274_1.fastq.gz) \
                      -j <(zcat SRR579274_2.fastq.gz) \
                      -o SRR579274.kaiju_out

The results can also be visualized with Krona:

[user@cn3144 ~]$ module load kronatools
[user@cn3144 ~]$ kaiju2krona -t ${db}/nodes.dmp -n ${db}/names.dmp \
                       -i SRR579274.kaiju_out -o SRR579274.kaiju_krona
[user@cn3144 ~]$ ktImportText -o SRR579274.kaiju.html SRR579274.kaiju_krona
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

You can view the resulting chart here.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. similar to this example:

#! /bin/bash

module load kaiju || exit 1
module load kronatools || exit 1


cd /lscratch/${SLURM_JOBID}
cp ${db}/* .

for sample in sample1 sample2 sample3 sample4; do
    kaiju -t nodes.dmp -f kaiju_db.fmi \
        -z ${SLURM_CPUS_PER_TASK} \
        -i <(zcat /data/$USER/fastq/${sample}_R1.fastq.gz) \
        -j <(zcat /data/$USER/fastq/${sample}_R2.fastq.gz) \
        -o ${sample}.out
    # genus level report filtering general with less than 1% abundance
    #### for kaiju < 1.7.0
    kaijuReport -t nodes.dmp -n names.dmp \
        -i ${sample}.out -r genus -m 1 -o ${sample}.kaiju_summary
    #### for kaiju >= 1.7.0
    kaiju2table -t nodes.dmp -n names.dmp \
        -r genus -m 1 -o ${sample}.kaiju_summary ${sample}.out
    mv ${sample}.out ${sample}.kaiju_summary /data/$USER/kaiju

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:50

Note that in this case we copied the database to the node and processed several input files against that same database.

Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. kaiju.swarm).

For example:

d=/fdb/kaiju/2017-05-16/pro_genomes; kaiju -t ${d}/nodes.dmp -f ${d}/kaiju_db.fmi -z ${SLURM_CPUS_PER_TASK} \
    -i <(zcat /data/$USER/fastq/sample1_R1.fastq.gz) -j <(zcat /data/$USER/fastq/sample1_R2.fastq.gz) \
    -o /data/$USER/kaiju/sample1.out
d=/fdb/kaiju/2017-05-16/pro_genomes; kaiju -t ${d}/nodes.dmp -f ${d}/kaiju_db.fmi -z ${SLURM_CPUS_PER_TASK} \
    -i <(zcat /data/$USER/fastq/sample2_R1.fastq.gz) -j <(zcat /data/$USER/fastq/sample2_R2.fastq.gz) \
    -o /data/$USER/kaiju/sample2.out
d=/fdb/kaiju/2017-05-16/pro_genomes; kaiju -t ${d}/nodes.dmp -f ${d}/kaiju_db.fmi -z ${SLURM_CPUS_PER_TASK} \
    -i <(zcat /data/$USER/fastq/sample3_R1.fastq.gz) -j <(zcat /data/$USER/fastq/sample3_R2.fastq.gz) \
    -o /data/$USER/kaiju/sample3.out

Submit this job using the swarm command.

swarm -f kaiju.swarm -g 20 -t 6 --module kaiju

Note that this may work best with smaller (especially custom) databases.