KrakenUniq: metagenomics classifier with unique k-mer counting for more specific results

KrakenUniq is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It combines the fast k-mer-based classification of Kraken (Wood and Salzberg, Genome Biology 2014) with an efficient algorithm for assessing the coverage of unique k-mers found in each species in a dataset. On various test datasets, KrakenUniq gives better recall and precision than other methods and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog, KrakenUniq runs as fast as Kraken and requires little additional memory.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=lscratch:100 --mem=128g -c10 --mem-per-cpu=40g

[user@cn3329 ~]$ module load krakenuniq 
[+] Loading kraken  2.0.7-beta
[+] Loading ncbi-toolkit 21.0.0 on cn3355
[+] Loading KrakenUniq 0.5.8  ...
Creatting a KrakenUniq database involves three steps:
1) installing a taxonomy,
2) installing a genomic library, and
3) building the database.

For example, here are the steps to build a custom archaea database:
1) Install a taxonomy:
[user@cn3329 user]$ krakenuniq-download -db archaea_db taxonomy 
Storing taxonomy timestamp [date > archaea_db/taxonomy/timestamp] ... done (took 0s)
Extracting nodes file [tar -C archaea_db/taxonomy -zxvf archaea_db/taxonomy/taxdump.tar.gz nodes.dmp 1>&2] ...nodes.dmp
 done (took 3s)
archaea_db/taxonomy/nodes.dmp                      check [137.46 MB]
Extracting names file [tar -C archaea_db/taxonomy -zxvf archaea_db/taxonomy/taxdump.tar.gz names.dmp 1>&2] ...names.dmp
 done (took 2s)
archaea_db/taxonomy/names.dmp                      check [172.85 MB]
2) Install a genomic library:
[user@cn3329 user]$ krakenuniq-download -db archaea_db --threads 10 --dust refseq/archaea 
Downloading assembly summary file for archaea genomes, and filtering to assembly level Complete_Genome.
 Downloading archaea genomes:  295/295 ...   Found 295 files.
3) Build the database:
[user@cn3329 ~]$ krakenuniq-build --db archaea_db --kmer-len 31 --threads 10 --taxids-for-genomes --taxids-for-sequences
Found jellyfish v1.1.12
Kraken build set to minimize disk writes.
Finding all library files
Found 295 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '894851007'
K-mer set created. [1m45.475s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ... db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [8m21.860s]
Creating seqID to taxID map (step 4 of 6)..
477 sequences mapped to taxa. [0.597s]
Creating taxDB (step 5 of 6)... 
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2118940 taxa
taxDB construction finished. [2m23.614s]
Building  KrakenUniq LCA database (step 6 of 6)...
 Adding taxonomy IDs for sequences
 Adding taxonomy IDs for genomes
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (6.088 GB) ... Done
Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8].
Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 477 mappings.
Finished processing 477 sequences (skipping 0 empty sequences, and 0 sequences with no taxonomy mapping)
Writing kmer counts to database.kdb.counts...
Writing database from RAM back to database.kdb ...
Writing new TaxDB ...
LCA database created. [3m12.701s]
Creating database summary report database.report.tsv ...
/usr/local/Anaconda/envs_app/KrakenUniq/0.5.8/libexec/classify -d /fdb/krakenuniq/./database.kdb -i /fdb/krakenuniq/./database.idx -t 10 -r database.report.tsv -a /fdb/krakenuniq/./taxDB -p 12 /dev/fd/63
 Database /fdb/krakenuniq/./database.kdb
[user@cn3329 ~]$ ls archaea_db
database0.kdb       database.jdb         database.kraken.tsv  library-files.txt     taxDB
database-build.log  database.kdb         database.report.tsv  seqid2taxid.map       taxDB.orig
database.idx        database.kdb.counts  library              seqid2taxid.map.orig  taxonomy
Once the database is built, for convenience of its subsequent use, update the environment variable KRAKEN_DB_PATH by prepending to it the parent directory of the database. For example:
[user@cn3329 ~]$ export KRAKEN_DB_PATH=$PWD:$KRAKEN_DB_PATH
Now, to classify a set of sample sequences (reads), use the commands:
[user@cn3329 ~]$ cp $KU_DATA/* .
[user@cn3329 ~]$ krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10  test_archaea.fna 
/usr/local/Anaconda/envs_app/KrakenUniq/0.5.8/libexec/classify -d /fdb/krakenuniq/archaea_db/database.kdb -i /fdb/krakenuniq/archaea_db/database.idx -t 10 -r res_archaea.tsv -a /fdb/krakenuniq/archaea_db/taxDB -p 12 test_archaea.fna
 Database /fdb/krakenuniq/archaea_db/database.kdb
Loaded database with 544731025 keys with k of 31 [val_len 4, key_len 8].
Reading taxonomy index from /fdb/krakenuniq/archaea_db/taxDB. Done.
10 sequences (0.01 Mbp) processed in 0.009s (65.4 Kseq/m, 71.84 Mbp/m).
  7 sequences classified (70.00%)
  3 sequences unclassified (30.00%)
Writing report file to res_archaea.tsv  ..
Reading genome sizes from /fdb/krakenuniq/archaea_db/database.kdb.counts ... done
Setting values in the taxonomy tree ... done
Printing classification report ...  done
Report finished in 0.004 seconds.
A classification report stored in file res_archaea.tsv will be produced:
[user@cn3329 ~]$ cat res_archaea.tsv
# KrakenUniq v0.5.7 DATE:2019-06-11T11:42:42Z DB:/fdb/krakenuniq/archaea_db DB_SIZE:6536773372 WD:/fdb/krakenuniq
# CL:perl /usr/local/apps/krakenuniq/0.5.8/bin/krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna

%       reads   taxReads        kmers   dup     cov     taxID   rank    taxName
30      3       3       6627    1.13    NA      0       no rank unclassified
70      7       0       1003    3       1.841e-06       1       no rank root
70      7       0       1003    3       1.841e-06       131567  no rank   cellular organisms
70      7       0       1003    3       1.841e-06       2157    superkingdom        Archaea
70      7       0       898     3.02    1.997e-06       28890   phylum        Euryarchaeota
70      7       0       786     3.04    1.319e-05       2283794 no rank         Methanomada group
70      7       0       727     3.05    1.938e-05       183925  class             Methanobacteria
70      7       0       727     3.05    1.938e-05       2158    order               Methanobacteriales
70      7       0       671     3.04    1.84e-05        2159    family                Methanobacteriaceae
70      7       0       292     3.41    1.955e-05       2160    genus                   Methanobacterium
40      4       0       92      3.9     3.859e-05       877455  species                   Methanobacterium lacus
40      4       0       92      3.9     3.859e-05       1000000594      assembly                            GCF_000191585.1 Methanobacterium lacus strain=AL-21
40      4       4       92      3.9     3.859e-05       1000000595      sequence                              NC_015216.1 Methanobacterium lacus strain AL-21, complete genome
20      2       0       29      2.93    1.706e-05       1379702 species                   Methanobacterium sp. MB1
20      2       0       29      2.93    1.706e-05       1000000656      assembly                            GCF_000499765.1 Methanobacterium sp. MB1
20      2       2       29      2.93    1.706e-05       1000000657      sequence                              NC_023044.1 Methanobacterium sp. MB1 complete sequence
10      1       1       7       1       2.384e-06       2162    species                   Methanobacterium formicicum
End the interactive session:
[user@cn3329 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226