Krakenuniq is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. It uses exact matches of kmers against a reference database. It is an upgraded form of Kraken version 1.
Do not run Kraken with the database in the default /fdb
location or a shared /data
directory. It must be run with the database in /lscratch
or /dev/shm
.
$KRAKEN_TEST_DATA
and is shared with
kraken/fdb/kraken/
which is
also the value of $KRAKEN_DB_PATH
. The standard database
contains genomes of archaea, bacteria, and viruses from NCBI._kraken2
./fdb/kraken
since this can cause some severe file system strain.
Database should either be copied to memory or lscratch.Allocate an interactive session and run the program. In the following sample session we run a minimal classification. See the documentation for Kraken (version 1) for more detailed guidance on custom databases.
[user@biowulf]$ sinteractive --mem=24g --cpus-per-task=4 --gres=lscratch:100 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load krakenuniq/1.0.4 ### ### Copy database into lscratch ### [user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144 ~]$ cp -r $KRAKEN_DB/20200428_bacteria_32GB . ### ### Classification ### [user@cn3144 ~]$ cp -r $KRAKEN_TEST_DATA/accuracy . [user@cn3144 ~]$ cd accuracy [user@cn3144 ~]$ ls -lh total 7.5M -rw-rw-r-- 1 user group 2.2K Aug 14 09:55 ACCURACY.README -rw-rw-r-- 1 user group 1.2M Aug 14 09:55 HiSeq_accuracy.fa -rw-rw-r-- 1 user group 1.8M Aug 14 09:55 MiSeq_accuracy.fa -rw-rw-r-- 1 user group 4.6M Aug 14 09:55 simBA5_accuracy.fa -rw-rw-r-- 1 user group 396 Aug 14 09:55 taxonomyDocSum.xslt
[user@cn3144 ~]$ krakenuniq --db ../20200428_bacteria_32GB \ --output HiSeq.kraken \ HiSeq_accuracy.fa Taxonomy database not at ./20200428_bacteria_32GB/taxDB - creating it .../usr/local/apps/krakenuniq/1.0.4/bin/build_taxdb ./20200428_bacteria_32GB/taxonomy/names.dmp ./20200428_bacteria_32GB/taxonomy/nodes.dmp > ./20200428_bacteria_32GB/taxDB Building taxonomy index from ./20200428_bacteria_32GB/taxonomy/nodes.dmp and ./20200428_bacteria_32GB/taxonomy/names.dmp. Done, got 2242898 taxa /usr/local/apps/krakenuniq/1.0.4/bin/classify -d ./20200428_bacteria_32GB/database.kdb -i ./20200428_bacteria_32GB/database.idx -o HiSeq.krakenuniq -a ./20200428_bacteria_32GB/taxDB -p 12 Database ./20200428_bacteria_32GB/database.kdb Loaded database with 2505397588 keys with k of 31 [val_len 4, key_len 8]. Reading taxonomy index from ./20200428_bacteria_32GB/taxDB. Done. Writing Kraken output to HiSeq.krakenuniq 10000 sequences (0.92 Mbp) processed in 0.677s (886.5 Kseq/m, 81.78 Mbp/m). 8586 sequences classified (85.86%) 1414 sequences unclassified (14.14%) Finishing up ... [user@cn3144 ~]$ head -5 HiSeq.krakenuniq C A_hydrophila_HiSeq.922 644 101 0:6 1224:1 0:3 1224:1 0:16 1224:1 0:2 1224:1 0:4 1224:1 0:6 642:1 0:8 644:1 0:19 C A_hydrophila_HiSeq.1263 1236 101 0:2 2:1 0:12 2:1 0:9 1236:1 0:6 1236:2 0:37 C A_hydrophila_HiSeq.2905 1448139 101 0:8 1448139:1 0:5 1448139:1 0:16 1448139:1 0:24 642:1 0:14 C A_hydrophila_HiSeq.4866 642 101 0:4 642:2 0:8 642:1 0:6 642:1 0:2 642:2 0:5 642:1 0:5 642:1 0:5 642:1 0:14 642:1 0:4 642:1 0:2 642:1 0:4 C A_hydrophila_HiSeq.7009 642 101 0:49 571:1 0:14 1224:1 0:4 642:2 ### CLEANUP [user@cn3144 ~]$ rm -rf /dev/shm/custom_db
The same approach of copying to /dev/shm
can be taken with the databases
from /fdb/kraken
as long as they fit. Note that /dev/shm
as
a whole is limited to 50% of memory and per job by the memory allocation. For example,
using a reduced size 'minikraken' database obtained from the kraken maintainers:
[user@cn3144 ~]$ cp -r /fdb/kraken/20180220_standard_8GB /dev/shm [user@cn3144 ~]$ krakenuniq --db /dev/shm/20180220_standard_8GB --fasta-input \ --output HiSeq.kraken.2 HiSeq_accuracy.fa 10000 sequences (0.92 Mbp) processed in 0.602s (997.0 Kseq/m, 91.96 Mbp/m). 7523 sequences classified (75.23%) 2477 sequences unclassified (24.77%)
Create a batch input file (e.g. krakenuniq.sh) which copies a reduced size standard database
to /lscratch
or /dev/shm
and runs several fastq samples against a local copy of the database.
#! /bin/bash # USAGE: krakenuniq.sh dbname infile [infile ...] module load krakenuniq/1.0.4 db=$1 dbname=$(basename $1) shift [[ ${1:-none} != "none" ]] || exit 1 # make sure the database in /dev/shm is cleaned up trap 'rm -rf /dev/shm/${dbname}' EXIT [[ -d ${db} ]] && cp -r $db /dev/shm || exit 1 for infile in "$@"; do [[ -f ${infile} ]] \ && krakenuniq --db /dev/shm/${dbname} --threads ${SLURM_CPUS_PER_TASK} \ --output ${infile}.kraken.gz done
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=16 --mem=38g krakenuniq.sh \ /fdb/kraken/20171019_standard_32GB sample0[0-9].fastq.gz
Create a swarmfile (e.g. krakenuniq.swarm) that uses the script above to process several fastq files in each job
krakenuniq.sh /fdb/kraken/20200429_standard_16GB sample0[0-9].fastq.gz krakenuniq.sh /fdb/kraken/20200429_standard_16GB sample1[0-9].fastq.gz krakenuniq.sh /fdb/kraken/20200429_standard_16GB sample2[0-9].fastq.gz
Submit this job using the swarm command.
swarm -f krakenuniq.swarm -g 72 -t 16 --module krakenuniq/1.0.4where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module krakenuniq | Loads the kraken module for each subjob in the swarm |