KrakenUniq is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It combines the fast k-mer-based classification of Kraken (Wood and Salzberg, Genome Biology 2014) with an efficient algorithm for assessing the coverage of unique k-mers found in each species in a dataset. On various test datasets, KrakenUniq gives better recall and precision than other methods and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog, KrakenUniq runs as fast as Kraken and requires little additional memory.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --gres=lscratch:100 --mem=128g -c10 --mem-per-cpu=40g [user@cn3329 ~]$ module load krakenuniq [+] Loading kraken 2.0.7-beta [+] Loading ncbi-toolkit 21.0.0 on cn3355 [+] Loading KrakenUniq 0.5.8 ...Creatting a KrakenUniq database involves three steps:
[user@cn3329 user]$ krakenuniq-download -db archaea_db taxonomy Storing taxonomy timestamp [date > archaea_db/taxonomy/timestamp] ... done (took 0s) Extracting nodes file [tar -C archaea_db/taxonomy -zxvf archaea_db/taxonomy/taxdump.tar.gz nodes.dmp 1>&2] ...nodes.dmp done (took 3s) archaea_db/taxonomy/nodes.dmp check [137.46 MB] Extracting names file [tar -C archaea_db/taxonomy -zxvf archaea_db/taxonomy/taxdump.tar.gz names.dmp 1>&2] ...names.dmp done (took 2s) archaea_db/taxonomy/names.dmp check [172.85 MB]2) Install a genomic library:
[user@cn3329 user]$ krakenuniq-download -db archaea_db --threads 10 --dust refseq/archaea Downloading assembly summary file for archaea genomes, and filtering to assembly level Complete_Genome. Downloading archaea genomes: 295/295 ... Found 295 files.3) Build the database:
[user@cn3329 ~]$ krakenuniq-build --db archaea_db --kmer-len 31 --threads 10 --taxids-for-genomes --taxids-for-sequences Found jellyfish v1.1.12 Kraken build set to minimize disk writes. Finding all library files Found 295 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory. Creating k-mer set (step 1 of 6)... Using jellyfish Hash size not specified, using '894851007' K-mer set created. [1m45.475s] Skipping step 2, no database reduction requested. Sorting k-mer set (step 3 of 6)... db_sort: Getting database into memory ...Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8]. Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8]. db_sort: Sorting ... db_sort: Sorting complete - writing database to disk ... K-mer set sorted. [8m21.860s] Creating seqID to taxID map (step 4 of 6).. 477 sequences mapped to taxa. [0.597s] Creating taxDB (step 5 of 6)... Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2118940 taxa taxDB construction finished. [2m23.614s] Building KrakenUniq LCA database (step 6 of 6)... Adding taxonomy IDs for sequences Adding taxonomy IDs for genomes Reading taxonomy index from taxDB. Done. Getting database0.kdb into memory (6.088 GB) ... Done Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8]. Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 477 mappings. Finished processing 477 sequences (skipping 0 empty sequences, and 0 sequences with no taxonomy mapping) Writing kmer counts to database.kdb.counts... Writing database from RAM back to database.kdb ... Writing new TaxDB ... LCA database created. [3m12.701s] Creating database summary report database.report.tsv ... /usr/local/Anaconda/envs_app/KrakenUniq/0.5.8/libexec/classify -d /fdb/krakenuniq/./database.kdb -i /fdb/krakenuniq/./database.idx -t 10 -r database.report.tsv -a /fdb/krakenuniq/./taxDB -p 12 /dev/fd/63 Database /fdb/krakenuniq/./database.kdb [user@cn3329 ~]$ ls archaea_db database0.kdb database.jdb database.kraken.tsv library-files.txt taxDB database-build.log database.kdb database.report.tsv seqid2taxid.map taxDB.orig database.idx database.kdb.counts library seqid2taxid.map.orig taxonomyOnce the database is built, for convenience of its subsequent use, update the environment variable KRAKEN_DB_PATH by prepending to it the parent directory of the database. For example:
[user@cn3329 ~]$ export KRAKEN_DB_PATH=$PWD:$KRAKEN_DB_PATHNow, to classify a set of sample sequences (reads), use the commands:
[user@cn3329 ~]$ cp $KU_DATA/* . [user@cn3329 ~]$ krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna /usr/local/Anaconda/envs_app/KrakenUniq/0.5.8/libexec/classify -d /fdb/krakenuniq/archaea_db/database.kdb -i /fdb/krakenuniq/archaea_db/database.idx -t 10 -r res_archaea.tsv -a /fdb/krakenuniq/archaea_db/taxDB -p 12 test_archaea.fna Database /fdb/krakenuniq/archaea_db/database.kdb Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8]. Reading taxonomy index from /fdb/krakenuniq/archaea_db/taxDB. Done. 10 sequences (0.01 Mbp) processed in 0.009s (65.4 Kseq/m, 71.84 Mbp/m). 7 sequences classified (70.00%) 3 sequences unclassified (30.00%) Writing report file to res_archaea.tsv .. Reading genome sizes from /fdb/krakenuniq/archaea_db/database.kdb.counts ... done Setting values in the taxonomy tree ... done Printing classification report ... done Report finished in 0.004 seconds.A classification report stored in file res_archaea.tsv will be produced:
[user@cn3329 ~]$ cat res_archaea.tsv # KrakenUniq v0.5.7 DATE:2019-06-11T11:42:42Z DB:/fdb/krakenuniq/archaea_db DB_SIZE:6536773372 WD:/fdb/krakenuniq # CL:perl /usr/local/apps/krakenuniq/0.5.8/bin/krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna % reads taxReads kmers dup cov taxID rank taxName 30 3 3 6627 1.13 NA 0 no rank unclassified 70 7 0 1003 3 1.841e-06 1 no rank root 70 7 0 1003 3 1.841e-06 131567 no rank cellular organisms 70 7 0 1003 3 1.841e-06 2157 superkingdom Archaea 70 7 0 898 3.02 1.997e-06 28890 phylum Euryarchaeota 70 7 0 786 3.04 1.319e-05 2283794 no rank Methanomada group 70 7 0 727 3.05 1.938e-05 183925 class Methanobacteria 70 7 0 727 3.05 1.938e-05 2158 order Methanobacteriales 70 7 0 671 3.04 1.84e-05 2159 family Methanobacteriaceae 70 7 0 292 3.41 1.955e-05 2160 genus Methanobacterium 40 4 0 92 3.9 3.859e-05 877455 species Methanobacterium lacus 40 4 0 92 3.9 3.859e-05 1000000594 assembly GCF_000191585.1 Methanobacterium lacus strain=AL-21 40 4 4 92 3.9 3.859e-05 1000000595 sequence NC_015216.1 Methanobacterium lacus strain AL-21, complete genome 20 2 0 29 2.93 1.706e-05 1379702 species Methanobacterium sp. MB1 20 2 0 29 2.93 1.706e-05 1000000656 assembly GCF_000499765.1 Methanobacterium sp. MB1 20 2 2 29 2.93 1.706e-05 1000000657 sequence NC_023044.1 Methanobacterium sp. MB1 complete sequence 10 1 1 7 1 2.384e-06 2162 species Methanobacterium formicicumEnd the interactive session:
[user@cn3329 ~]$ exit salloc.exe: Relinquishing job allocation 46116226