Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. It uses exact matches of kmers against a reference database.
Do not run Kraken with the database in the default /fdb
location or a shared /data
directory. It must be run with the database in /lscratch
or /dev/shm
.
$KRAKEN_TEST_DATA
/fdb/kraken/
which is
also the value of $KRAKEN_DB_PATH
. The standard database
contains genomes of archaea, bacteria, and viruses from NCBI./fdb/kraken
end in _kraken2
./lscratch
. Shrunken databases with
subsets of kmers are provided to accomodate running with smaller memory requirements.
The current standard Kraken 2 database is about 60GB in size./fdb/kraken
since this can cause some severe file system strain.
Database should either be copied to memory or lscratch.Allocate an interactive session and run the program. In the following sample session we will use kraken to build a custom database and run a classification with the custom data base. Shown below is the same example for kraken and kraken2. kraken2 differs from kraken 1 in several important ways.
[user@biowulf]$ sinteractive --mem=24g --cpus-per-task=4 --gres=lscratch:100 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load kraken/1.1 ### ### build a custom kraken database with some 400+ genomes ### [user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144 ~]$ mkdir fa [user@cn3144 ~]$ cd fa [user@cn3144 ~]$ wget --quiet --input-file $KRAKEN_TEST_DATA/genomes_for_custom_db.urls [user@cn3144 ~]$ gunzip *.gz [user@cn3144 ~]$ ls -lh | head -5 total 1.3G -rw-r--r-- 1 user group 3.9M May 22 2016 GCA_000006745.1_ASM674v1_genomic.fna -rw-r--r-- 1 user group 2.1M May 26 2016 GCA_000006885.1_ASM688v1_genomic.fna -rw-r--r-- 1 user group 2.0M May 27 2016 GCA_000007045.1_ASM704v1_genomic.fna -rw-r--r-- 1 user group 5.3M Aug 4 2014 GCA_000007825.1_ASM782v1_genomic.fna [user@cn3144 ~]$ cd .. [user@cn3144 ~]$ kraken-build --download-taxonomy --db custom_db [user@cn3144 ~]$ for f in fa/*.fna; do kraken-build --add-to-library $f --db custom_db done Added "fa/GCA_000006745.1_ASM674v1_genomic.fna" to library (custom_db) Added "fa/GCA_000006885.1_ASM688v1_genomic.fna" to library (custom_db) [...snip...] Added "fa/GCA_900129335.1_PP_HGAG_QV30_2SC_T4_genomic.fna" to library (custom_db) Added "fa/GCA_900185995.1_BK1071_genomic.fna" to library (custom_db) [user@cn3144 ~]$ kraken-build --build --threads $SLURM_CPUS_PER_TASK --db custom_db Kraken build set to minimize disk writes. Creating k-mer set (step 1 of 6)... Found jellyfish v1.1.11 Hash size not specified, using '253231360' K-mer set created. [6m11.633s] Skipping step 2, no database reduction requested. Sorting k-mer set (step 3 of 6)... K-mer set sorted. [2m20.480s] Skipping step 4, GI number to seqID map now obsolete. Creating seqID to taxID map (step 5 of 6)... 721 sequences mapped to taxa. [3m48.166s] Setting LCAs in database (step 6 of 6)... Finished processing 721 sequences Database LCAs set. [19m6.065s] Database construction complete. [Total: 31m26.381s] [user@cn3144 ~]$ kraken-build --clean --db custom_db [user@cn3144 ~]$ du -sh custom_db 10G custom_db/ [user@cn3144 ~]$ tree custom_db custom_db/ |-- [user 14K] accmap_file.tmp |-- [user 8.0G] database.idx |-- [user 1.7G] database.kdb `-- [user 4.0K] taxonomy |-- [user 142M] names.dmp `-- [user 109M] nodes.dmp [user@cn3144 ~]$ cp -r custom_db /data/$USER/kraken_databases ### ### Classification ### [user@cn3144 ~]$ cp -r $KRAKEN_TEST_DATA/accuracy . [user@cn3144 ~]$ cd accuracy [user@cn3144 ~]$ ls -lh total 7.5M -rw-rw-r-- 1 user group 2.2K Aug 14 09:55 ACCURACY.README -rw-rw-r-- 1 user group 1.2M Aug 14 09:55 HiSeq_accuracy.fa -rw-rw-r-- 1 user group 1.8M Aug 14 09:55 MiSeq_accuracy.fa -rw-rw-r-- 1 user group 4.6M Aug 14 09:55 simBA5_accuracy.fa -rw-rw-r-- 1 user group 396 Aug 14 09:55 taxonomyDocSum.xslt
Now we will use the in memory temp file system /dev/shm
to store the database for classification. Note that unlike lscratch
,
/dev/shm
is not automatically cleaned at exit and care has to
be taken to clean up explicitly at the end of a job.
[user@cn3144 ~]$ cp -r custom_db /dev/shm [user@cn3144 ~]$ kraken --db /dev/shm/custom_db --fasta-input \ --output HiSeq.kraken HiSeq_accuracy.fa 10000 sequences (0.92 Mbp) processed in 0.422s (1421.8 Kseq/m, 131.15 Mbp/m). 8140 sequences classified (81.40%) 1860 sequences unclassified (18.60%) [user@cn3144 ~]$ head -5 HiSeq.kraken C A_hydrophila_HiSeq.922 380703 101 644:22 380703:1 644:30 0:18 C A_hydrophila_HiSeq.1263 644 101 1236:3 644:1 2:1 644:8 1236:8 644:50 C A_hydrophila_HiSeq.2905 1448139 101 644:8 1448139:51 0:1 644:11 C A_hydrophila_HiSeq.4866 1448139 101 644:61 1448139:10 C A_hydrophila_HiSeq.7009 644 101 0:13 644:14 0:31 644:13 ### CLEANUP [user@cn3144 ~]$ rm -rf /dev/shm/custom_db
The same approach of copying to /dev/shm
can be taken with the databases
from /fdb/kraken
as long as they fit. Note that /dev/shm
as
a whole is limited to 50% of memory and per job by the memory allocation. For example,
using a reduced size 'minikraken' database obtained from the kraken maintainers:
[user@cn3144 ~]$ cp -r /fdb/kraken/20180220_standard_8GB /dev/shm [user@cn3144 ~]$ kraken --db /dev/shm/20180220_standard_8GB --fasta-input \ --output HiSeq.kraken.2 HiSeq_accuracy.fa 10000 sequences (0.92 Mbp) processed in 0.602s (997.0 Kseq/m, 91.96 Mbp/m). 7523 sequences classified (75.23%) 2477 sequences unclassified (24.77%)
Column three is the predicted taxid. Let's get the name and rank of the unique taxids from NCBI eutils and compare them to the predicted taxa:
[user@cn3144 ~]$ cut -f3 HiSeq.kraken.2 | sort -u | grep -wv 0 > taxid.uniq [user@cn3144 ~]$ esum="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi" [user@cn3144 ~]$ for taxid in $(cat taxid.uniq); do curl -s -G "${esum}?db=taxonomy&id=${taxid}" | xsltproc taxonomyDocSum.xslt - done 2> /dev/null > taxa.uniq [user@cn3144 ~]$ join -1 3 -2 1 -t '|' \ <(sort -k3,3 HiSeq.kraken.2 | tr '\t' '|') \ <(paste -d '|' taxid.uniq taxa.uniq | sort -t'|' -k1,1) \ | awk -F'|' '{printf("%s prediction: %s(%s)\n", $3, $6, $7)}' \ | sort > HiSeq.kraken.pred [user@cn3144 ~]$ tail -n -100 HiSeq.kraken.pred | head -n 5 X_axonopodis_HiSeq.875312 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,() X_axonopodis_HiSeq.87911 prediction: Xanthomonas citri group,species group() X_axonopodis_HiSeq.881987 prediction: Xanthomonas,genus() X_axonopodis_HiSeq.882508 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,() X_axonopodis_HiSeq.882724 prediction: Xanthomonas,genus() ### CLEANUP [user@cn3144 ~]$ rm -rf /dev/shm/20180220_standard_8GB [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
[user@biowulf]$ sinteractive --mem=36g --cpus-per-task=4 --gres=lscratch:100 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load kraken/2.1.2 ### ### build a custom kraken database with some 400+ genomes ### [user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144 ~]$ mkdir fa [user@cn3144 ~]$ cd fa [user@cn3144 ~]$ wget --quiet --input-file $KRAKEN_TEST_DATA/genomes_for_custom_db.urls [user@cn3144 ~]$ gunzip *.gz [user@cn3144 ~]$ ls -lh | head -5 total 1.3G -rw-r--r-- 1 user group 3.9M May 22 2016 GCA_000006745.1_ASM674v1_genomic.fna -rw-r--r-- 1 user group 2.1M May 26 2016 GCA_000006885.1_ASM688v1_genomic.fna -rw-r--r-- 1 user group 2.0M May 27 2016 GCA_000007045.1_ASM704v1_genomic.fna -rw-r--r-- 1 user group 5.3M Aug 4 2014 GCA_000007825.1_ASM782v1_genomic.fna [user@cn3144 ~]$ cd .. [user@cn3144 ~]$ kraken2-build --download-taxonomy --db custom_db [user@cn3144 ~]$ for f in fa/*.fna; do kraken2-build --add-to-library $f --db custom_db done Added "fa/GCA_000006745.1_ASM674v1_genomic.fna" to library (custom_db) Added "fa/GCA_000006885.1_ASM688v1_genomic.fna" to library (custom_db) [...snip...] Added "fa/GCA_900129335.1_PP_HGAG_QV30_2SC_T4_genomic.fna" to library (custom_db) Added "fa/GCA_900185995.1_BK1071_genomic.fna" to library (custom_db) # larger databases would need more threads to build [user@cn3144 ~]$ kraken2-build --build --threads 2 --db custom_db Creating sequence ID to taxonomy ID map (step 1)... Found 721/721 targets, searched through 203908352 accession IDs, search complete. Sequence ID to taxonomy ID map complete. [18.874s] Estimating required capacity (step 2)... Estimated hash table requirement: 258445896 bytes Capacity estimation complete. [22.086s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 8 bits reserved for taxid. Completed processing of 721 sequences, 1343692302 bp Writing data to disk... complete. Database files completed. [1m7.904s] Database construction complete. [Total: 1m48.906s] [user@cn3144 ~]$ kraken2-build --clean --db custom_db Database disk usage: 34G After cleaning, database uses 247M [user@cn3144 ~]$ tree custom_db custom_db/ |-- [user 246M] hash.k2d |-- [user 48] opts.k2d |-- [user 16K] taxo.k2d `-- [user 16K] unmapped.txt [user@cn3144 ~]$ cp -r custom_db /data/$USER/kraken_databases ### ### Classification ### [user@cn3144 ~]$ cp -r $KRAKEN_TEST_DATA/accuracy . [user@cn3144 ~]$ cd accuracy [user@cn3144 ~]$ ls -lh total 7.5M -rw-rw-r-- 1 user group 2.2K Aug 14 09:55 ACCURACY.README -rw-rw-r-- 1 user group 1.2M Aug 14 09:55 HiSeq_accuracy.fa -rw-rw-r-- 1 user group 1.8M Aug 14 09:55 MiSeq_accuracy.fa -rw-rw-r-- 1 user group 4.6M Aug 14 09:55 simBA5_accuracy.fa -rw-rw-r-- 1 user group 396 Aug 14 09:55 taxonomyDocSum.xslt [user@cn3144 ~]$ kraken2 --db ../custom_db --output HiSeq.kraken HiSeq_accuracy.fa Loading database information... done. 10000 sequences (0.92 Mbp) processed in 0.100s (5993.0 Kseq/m, 552.81 Mbp/m). 8381 sequences classified (83.81%) 1619 sequences unclassified (16.19%) [user@cn3144 ~]$ head -5 HiSeq.kraken C A_hydrophila_HiSeq.922 644 101 644:56 0:3 644:4 0:4 C A_hydrophila_HiSeq.1263 644 101 2:2 644:5 1236:3 2:4 1236:4 644:24 1236:5 644:1 1236:5 644:14 C A_hydrophila_HiSeq.2905 1448139 101 644:7 1448139:28 644:5 1448139:5 644:4 1448139:10 644:8 C A_hydrophila_HiSeq.4866 1448139 101 644:60 9 C A_hydrophila_HiSeq.7009 644 101 0:5 644:19 0:3 644:9 0:18 644:13 [user@cn3144 ~]$ cp -r /fdb/kraken/20210223_standard_16GB_kraken2 .. [user@cn3144 ~]$ kraken2 --db ../20210223_standard_16GB_kraken2 \ --output HiSeq.kraken.2 HiSeq_accuracy.fa Loading database information... done. 10000 sequences (0.92 Mbp) processed in 0.148s (4062.8 Kseq/m, 374.76 Mbp/m). 9033 sequences classified (90.33%) 967 sequences unclassified (9.67%)
Column three is the predicted taxid. Let's get the name and rank of the unique taxids from NCBI eutils and compare them to the predicted taxa:
[user@cn3144 ~]$ cut -f3 HiSeq.kraken.2 | sort -u | grep -wv 0 > taxid.uniq [user@cn3144 ~]$ esum="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi" [user@cn3144 ~]$ for taxid in $(cat taxid.uniq); do curl -s -G "${esum}?db=taxonomy&id=${taxid}" | xsltproc taxonomyDocSum.xslt - done 2> /dev/null > taxa.uniq [user@cn3144 ~]$ join -1 3 -2 1 -t '|' \ <(sort -k3,3 HiSeq.kraken.2 | tr '\t' '|') \ <(paste -d '|' taxid.uniq taxa.uniq | sort -t'|' -k1,1) \ | awk -F'|' '{printf("%s prediction: %s(%s)\n", $3, $6, $7)}' \ | sort > HiSeq.kraken.pred [user@cn3144 ~]$ tail -n -100 HiSeq.kraken.pred | head -n 5 X_axonopodis_HiSeq.875312 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,() X_axonopodis_HiSeq.87911 prediction: Xanthomonas citri group,species group() X_axonopodis_HiSeq.881987 prediction: Xanthomonas,genus() X_axonopodis_HiSeq.882508 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,() X_axonopodis_HiSeq.882724 prediction: Xanthomonas,genus() [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. kraken.sh) which copies a reduced size standard database
to /lscratch
(or /dev/shm
for version 1) and runs several fastq samples against a local copy of the database.
#! /bin/bash # USAGE: kraken.sh dbname infile [infile ...] module load kraken/1.1 db=$1 dbname=$(basename $1) shift [[ ${1:-none} != "none" ]] || exit 1 # make sure the database in /dev/shm is cleaned up trap 'rm -rf /dev/shm/${dbname}' EXIT [[ -d ${db} ]] && cp -r $db /dev/shm || exit 1 for infile in "$@"; do [[ -f ${infile} ]] \ && kraken --db /dev/shm/${dbname} --threads ${SLURM_CPUS_PER_TASK} \ --fastq-input --gzip-compressed --output ${infile}.kraken done
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=16 --mem=38g kraken.sh \ /fdb/kraken/20171019_standard_32GB sample0[0-9].fastq.gz
#! /bin/bash # USAGE: kraken2.sh dbname infile [infile ...] module load kraken/2.1.2 db=$1 dbname=$(basename $1) shift [[ ${1:-none} != "none" ]] || exit 1 [[ -d ${db} ]] && cp -r $db /lscratch/$SLURM_JOB_ID || exit 1 for infile in "$@"; do [[ -f ${infile} ]] \ && kraken2 --db /lscratch/$SLURM_JOB_ID/${dbname} --threads ${SLURM_CPUS_PER_TASK} \ --output ${infile}.kraken ${infile} done
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=16 --mem=72g kraken2.sh \ /fdb/kraken/20220803_standard_kraken2 sample0[0-9].fastq.gz
Create a swarmfile (e.g. kraken2.swarm) that uses the script above to process several fastq files in each job
kraken2.sh /fdb/kraken/20220803_standard_kraken2 sample0[0-9].fastq.gz kraken2.sh /fdb/kraken/20220803_standard_kraken2 sample1[0-9].fastq.gz kraken2.sh /fdb/kraken/20220803_standard_kraken2 sample2[0-9].fastq.gz
Submit this job using the swarm command.
swarm -f kraken2.swarm -g 72 -t 16 --module kraken/2.1.2where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module kraken | Loads the kraken module for each subjob in the swarm |