High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
kraken on Biowulf

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. It uses exact matches of kmers against a reference database.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. In the following sample session we will use kraken to build a custom database and run a classification with the custom data base.

[user@biowulf]$ sinteractive --mem=16g --cpus-per-task=4 --gres=lscratch:100
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load kraken

###
### build a custom kraken database with some 400+ genomes
###
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ mkdir fa
[user@cn3144 ~]$ cd fa
[user@cn3144 ~]$ wget --quiet --input-file $KRAKEN_TEST_DATA/genomes_for_custom_db.urls
[user@cn3144 ~]$ gunzip *.gz
[user@cn3144 ~]$ ls -lh | head -5
total 1.3G
-rw-r--r-- 1 user group 3.9M May 22  2016 GCA_000006745.1_ASM674v1_genomic.fna
-rw-r--r-- 1 user group 2.1M May 26  2016 GCA_000006885.1_ASM688v1_genomic.fna
-rw-r--r-- 1 user group 2.0M May 27  2016 GCA_000007045.1_ASM704v1_genomic.fna
-rw-r--r-- 1 user group 5.3M Aug  4  2014 GCA_000007825.1_ASM782v1_genomic.fna

[user@cn3144 ~]$ cd ..
[user@cn3144 ~]$ kraken-build --download-taxonomy --db custom_db
[user@cn3144 ~]$ for f in fa/*.fna; do
                       kraken-build --add-to-library $f --db custom_db
                    done
Added "fa/GCA_000006745.1_ASM674v1_genomic.fna" to library (custom_db)
Added "fa/GCA_000006885.1_ASM688v1_genomic.fna" to library (custom_db)
[...snip...]
Added "fa/GCA_900129335.1_PP_HGAG_QV30_2SC_T4_genomic.fna" to library (custom_db)
Added "fa/GCA_900185995.1_BK1071_genomic.fna" to library (custom_db)

[user@cn3144 ~]$ kraken-build --build --threads 4 --db custom_db \
                        --jellyfish-hash-size=5000M
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
K-mer set created. [6m11.633s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [2m20.480s]
Skipping step 4, GI number to seqID map now obsolete.
Creating seqID to taxID map (step 5 of 6)...
721 sequences mapped to taxa. [3m48.166s]
Setting LCAs in database (step 6 of 6)...
Finished processing 721 sequences
Database LCAs set. [19m6.065s]
Database construction complete. [Total: 31m26.381s]

[user@cn3144 ~]$ kraken-build --clean --db custom_db
[user@cn3144 ~]$ du -sh custom_db
10G     custom_db/
[user@cn3144 ~]$ tree custom_db
custom_db/
|-- [user    14K]  accmap_file.tmp
|-- [user   8.0G]  database.idx
|-- [user   1.7G]  database.kdb
`-- [user   4.0K]  taxonomy
    |-- [user   142M]  names.dmp
    `-- [user   109M]  nodes.dmp

[user@cn3144 ~]$ cp -r custom_db /data/$USER/kraken_databases

###
### Classification
###
[user@cn3144 ~]$ cp -r $KRAKEN_TEST_DATA/accuracy .
[user@cn3144 ~]$ cd accuracy
[user@cn3144 ~]$ ls -lh
total 7.5M
-rw-rw-r-- 1 user group 2.2K Aug 14 09:55 ACCURACY.README
-rw-rw-r-- 1 user group 1.2M Aug 14 09:55 HiSeq_accuracy.fa
-rw-rw-r-- 1 user group 1.8M Aug 14 09:55 MiSeq_accuracy.fa
-rw-rw-r-- 1 user group 4.6M Aug 14 09:55 simBA5_accuracy.fa
-rw-rw-r-- 1 user group  396 Aug 14 09:55 taxonomyDocSum.xslt

Now we will use the in memory temp file system /dev/shm to store the database for classification. Note that unlike lscratch, /dev/shm is not automatically cleaned at exit and care has to be taken to clean up explicitly at the end of a job.

[user@cn3144 ~]$ cp -r custom_db /dev/shm
[user@cn3144 ~]$ kraken --db /dev/shm/custom_db --fasta-input \
                        --output HiSeq.kraken HiSeq_accuracy.fa
10000 sequences (0.92 Mbp) processed in 0.422s (1421.8 Kseq/m, 131.15 Mbp/m).
  8140 sequences classified (81.40%)
  1860 sequences unclassified (18.60%)

[user@cn3144 ~]$ head -5 HiSeq.kraken
C       A_hydrophila_HiSeq.922  380703  101     644:22 380703:1 644:30 0:18
C       A_hydrophila_HiSeq.1263 644     101     1236:3 644:1 2:1 644:8 1236:8 644:50
C       A_hydrophila_HiSeq.2905 1448139 101     644:8 1448139:51 0:1 644:11
C       A_hydrophila_HiSeq.4866 1448139 101     644:61 1448139:10
C       A_hydrophila_HiSeq.7009 644     101     0:13 644:14 0:31 644:13

### CLEANUP
[user@cn3144 ~]$ rm -rf /dev/shm/custom_db

The same approach of copying to /dev/shm can be taken with the databases from /fdb/kraken as long as they fit. Note that /dev/shm as a whole is limited to 50% of memory and per job by the memory allocation. For example, using a reduced size 'minikraken' database obtained from the kraken maintainers:

[user@cn3144 ~]$ cp -r /fdb/kraken/20180220_standard_8GB /dev/shm
[user@cn3144 ~]$ kraken --db /dev/shm/20180220_standard_8GB --fasta-input \
                        --output HiSeq.kraken.2 HiSeq_accuracy.fa
10000 sequences (0.92 Mbp) processed in 0.602s (997.0 Kseq/m, 91.96 Mbp/m).
  7523 sequences classified (75.23%)
  2477 sequences unclassified (24.77%)

Column three is the predicted taxid. Let's get the name and rank of the unique taxids from NCBI eutils and compare them to the predicted taxa:

[user@cn3144 ~]$ cut -f3 HiSeq.kraken.2 | sort -u | grep -wv 0 > taxid.uniq
[user@cn3144 ~]$ esum="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
[user@cn3144 ~]$ for taxid in $(cat taxid.uniq); do
                        curl -s -G "${esum}?db=taxonomy&id=${taxid}" | xsltproc taxonomyDocSum.xslt -
                    done 2> /dev/null > taxa.uniq
[user@cn3144 ~]$ join -1 3 -2 1 -t '|' \
                        <(sort -k3,3 HiSeq.kraken.2 | tr '\t' '|') \
                        <(paste -d '|' taxid.uniq taxa.uniq | sort -t'|' -k1,1) \
                    | awk -F'|' '{printf("%s prediction: %s(%s)\n", $3, $6, $7)}' \
                    | sort > HiSeq.kraken.pred
[user@cn3144 ~]$ tail -n -100 HiSeq.kraken.pred | head -n 5
X_axonopodis_HiSeq.875312 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,()
X_axonopodis_HiSeq.87911 prediction: Xanthomonas citri group,species group()
X_axonopodis_HiSeq.881987 prediction: Xanthomonas,genus()
X_axonopodis_HiSeq.882508 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,()
X_axonopodis_HiSeq.882724 prediction: Xanthomonas,genus()

### CLEANUP
[user@cn3144 ~]$ rm -rf /dev/shm/20180220_standard_8GB
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. kraken.sh) which copies a reduced size standard database to /dev/shm and runs several fastq samples against the in-memory database. Note that it uses a bash exit trap to ensure cleanup.

#! /bin/bash
# USAGE: kraken.sh dbname infile [infile ...]

module load kraken/1.1
db=$1
dbname=$(basename $1)
shift
[[ ${1:-none} != "none" ]] || exit 1

# make sure the database in /dev/shm is cleaned up
trap 'rm -rf /dev/shm/${dbname}' EXIT

[[ -d ${db} ]] && cp -r $db /dev/shm || exit 1

for infile in "$@"; do
    [[ -f ${infile} ]] \
    && kraken --db /dev/shm/${dbname} --threads ${SLURM_CPUS_PER_TASK} \
        --fastq-input --gzip-compressed --output ${infile}.kraken
done

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=16 --mem=38g kraken.sh \
    /fdb/kraken/20171019_standard_32GB sample0[0-9].fastq.gz
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. kraken.swarm) that uses the script above to process several fastq files in each job

kraken.sh /fdb/kraken/20180220_standard_32GB sample0[0-9].fastq.gz
kraken.sh /fdb/kraken/20180220_standard_32GB sample1[0-9].fastq.gz
kraken.sh /fdb/kraken/20180220_standard_32GB sample2[0-9].fastq.gz

Submit this job using the swarm command.

swarm -f kraken.swarm -g 38 -t 16 --module kraken/1.1
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module kraken Loads the kraken module for each subjob in the swarm