Kaiju is used for taxonomic classification of sequencing reads based on a BWT protein database. It can be run in two modes: MEM (maximum exact match), the default mode, allows only exact matches of the read against the databases. Greedy mode allows mismatches which increases the sensitivity of the classification at the cost of decreased speed.
Kaiju reads the entire database into memory which can be rather large (~50GB in the case of the 'nr + euk' database). That means it may be advantageous to copy the database to lscratch and use the copy to classify several samples as part of a single job.
References:
- P. Menzel, K. L Ng, and A. Krogh. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 2016, 7:11257. PubMed | PMC | Journal
- Kaiju on GitHub: https://github.com/bioinformatics-centre/kaiju
- Kaiju web server: http://kaiju.binf.ku.dk/server
- Module Name: kaiju (see the modules page for more information)
- Kaiju is a multithreaded application
- Example files in
$KAIJU_TEST_DATA
- Reference data in
/fdb/kaiju/
/db
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=20g --cpus-per-task=6 --gres=lscratch:10 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load kaiju [user@cn3144 ~]$ db=/fdb/kaiju/2017-05-16/pro_genomes [user@cn3144 ~]$ cd /lscratch/${SLURM_JOBID} [user@cn3144 ~]$ cp ${KAIJU_TEST_DATA}/*.fastq.gz . [user@cn3144 ~]$ # kaiju in default MEM mode with paired end data [user@cn3144 ~]$ kaiju -t ${db}/nodes.dmp -f ${db}/kaiju_db.fmi \ -z ${SLURM_CPUS_PER_TASK} \ -i <(zcat SRR579274_1.fastq.gz) \ -j <(zcat SRR579274_2.fastq.gz) \ -o SRR579274.kaiju_out [user@cn3144 ~]$ # output: one line per pair with three columns: (1) Unclassified/Classified, (2) read, (3) taxid [user@cn3144 ~]$ head SRR579274.kaiju_out C SRR579274.1 186802 C SRR579274.2 186802 U SRR579274.3 0 C SRR579274.4 657323 C SRR579274.5 411463 C SRR579274.6 853 C SRR579274.7 717962 C SRR579274.8 186802 C SRR579274.9 742738 C SRR579274.10 411469 [user@cn3144 ~]$ ### for kaiju < 1.7.0 [user@cn3144 ~]$ kaijuReport -t ${db}/nodes.dmp -n ${db}/names.dmp \ -i SRR579274.kaiju_out -r genus -m 1 [user@cn3144 ~]$ ### for kaiju >= 1.7.0 [user@cn3144 ~]$ kaiju2table -t ${db}/nodes.dmp -n ${db}/names.dmp \ -r genus -m 1 SRR579274.kaiju_out % reads genus ------------------------------------------- 7.425068 18000 Faecalibacterium 5.914892 14339 Bacteroides 5.287474 12818 Blautia 4.814744 11672 Ruminococcus 3.104504 7526 Eubacterium 1.973418 4784 Dorea 1.662803 4031 Anaerostipes 1.348063 3268 Roseburia 1.079935 2618 Subdoligranulum ------------------------------------------- 0.014850 36 Viruses 29.897451 72478 cannot be assigned to a genus 11.469256 27804 belong to a genus with less than 1% of all reads ------------------------------------------- 26.007542 63048 unclassified [user@cn3144 ~]$ # kaiju in default Greedy mode allowing 2 mismatches with an E value cutoff of 0.05 [user@cn3144 ~]$ kaiju -t ${db}/nodes.dmp -f ${db}/kaiju_db.fmi \ -z ${SLURM_CPUS_PER_TASK} \ -a greedy -e 2 -E 0.05 \ -i <(zcat SRR579274_1.fastq.gz) \ -j <(zcat SRR579274_2.fastq.gz) \ -o SRR579274.kaiju_out
The results can also be visualized with Krona:
[user@cn3144 ~]$ module load kronatools [user@cn3144 ~]$ kaiju2krona -t ${db}/nodes.dmp -n ${db}/names.dmp \ -i SRR579274.kaiju_out -o SRR579274.kaiju_krona [user@cn3144 ~]$ ktImportText -o SRR579274.kaiju.html SRR579274.kaiju_krona [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
You can view the resulting chart here.
Create a batch input file (e.g. kaiju.sh) similar to this example:
#! /bin/bash module load kaiju || exit 1 module load kronatools || exit 1 db=/fdb/kaiju/2017-05-16/pro_genomes cd /lscratch/${SLURM_JOBID} cp ${db}/* . for sample in sample1 sample2 sample3 sample4; do kaiju -t nodes.dmp -f kaiju_db.fmi \ -z ${SLURM_CPUS_PER_TASK} \ -i <(zcat /data/$USER/fastq/${sample}_R1.fastq.gz) \ -j <(zcat /data/$USER/fastq/${sample}_R2.fastq.gz) \ -o ${sample}.out # genus level report filtering general with less than 1% abundance #### for kaiju < 1.7.0 kaijuReport -t nodes.dmp -n names.dmp \ -i ${sample}.out -r genus -m 1 -o ${sample}.kaiju_summary #### for kaiju >= 1.7.0 kaiju2table -t nodes.dmp -n names.dmp \ -r genus -m 1 -o ${sample}.kaiju_summary ${sample}.out mv ${sample}.out ${sample}.kaiju_summary /data/$USER/kaiju done
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:50 kaiju.sh
Note that in this case we copied the database to the node and processed several input files against that same database.
Create a swarmfile (e.g. kaiju.swarm).
For example:
d=/fdb/kaiju/2017-05-16/pro_genomes; kaiju -t ${d}/nodes.dmp -f ${d}/kaiju_db.fmi -z ${SLURM_CPUS_PER_TASK} \ -i <(zcat /data/$USER/fastq/sample1_R1.fastq.gz) -j <(zcat /data/$USER/fastq/sample1_R2.fastq.gz) \ -o /data/$USER/kaiju/sample1.out d=/fdb/kaiju/2017-05-16/pro_genomes; kaiju -t ${d}/nodes.dmp -f ${d}/kaiju_db.fmi -z ${SLURM_CPUS_PER_TASK} \ -i <(zcat /data/$USER/fastq/sample2_R1.fastq.gz) -j <(zcat /data/$USER/fastq/sample2_R2.fastq.gz) \ -o /data/$USER/kaiju/sample2.out d=/fdb/kaiju/2017-05-16/pro_genomes; kaiju -t ${d}/nodes.dmp -f ${d}/kaiju_db.fmi -z ${SLURM_CPUS_PER_TASK} \ -i <(zcat /data/$USER/fastq/sample3_R1.fastq.gz) -j <(zcat /data/$USER/fastq/sample3_R2.fastq.gz) \ -o /data/$USER/kaiju/sample3.out
Submit this job using the swarm command.
swarm -f kaiju.swarm -g 20 -t 6 --module kaiju
Note that this may work best with smaller (especially custom) databases.