kraken on Biowulf

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. It uses exact matches of kmers against a reference database.

References:

Do not run Kraken with the database in the default /fdb location or a shared /data directory. It must be run with the database in /lscratch or /dev/shm.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. In the following sample session we will use kraken to build a custom database and run a classification with the custom data base. Shown below is the same example for kraken and kraken2. kraken2 differs from kraken 1 in several important ways.

[user@biowulf]$ sinteractive --mem=24g --cpus-per-task=4 --gres=lscratch:100
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load kraken/1.1

###
### build a custom kraken database with some 400+ genomes
###
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ mkdir fa
[user@cn3144 ~]$ cd fa
[user@cn3144 ~]$ wget --quiet --input-file $KRAKEN_TEST_DATA/genomes_for_custom_db.urls
[user@cn3144 ~]$ gunzip *.gz
[user@cn3144 ~]$ ls -lh | head -5
total 1.3G
-rw-r--r-- 1 user group 3.9M May 22  2016 GCA_000006745.1_ASM674v1_genomic.fna
-rw-r--r-- 1 user group 2.1M May 26  2016 GCA_000006885.1_ASM688v1_genomic.fna
-rw-r--r-- 1 user group 2.0M May 27  2016 GCA_000007045.1_ASM704v1_genomic.fna
-rw-r--r-- 1 user group 5.3M Aug  4  2014 GCA_000007825.1_ASM782v1_genomic.fna

[user@cn3144 ~]$ cd ..
[user@cn3144 ~]$ kraken-build --download-taxonomy --db custom_db
[user@cn3144 ~]$ for f in fa/*.fna; do
                       kraken-build --add-to-library $f --db custom_db
                    done
Added "fa/GCA_000006745.1_ASM674v1_genomic.fna" to library (custom_db)
Added "fa/GCA_000006885.1_ASM688v1_genomic.fna" to library (custom_db)
[...snip...]
Added "fa/GCA_900129335.1_PP_HGAG_QV30_2SC_T4_genomic.fna" to library (custom_db)
Added "fa/GCA_900185995.1_BK1071_genomic.fna" to library (custom_db)

[user@cn3144 ~]$ kraken-build --build --threads $SLURM_CPUS_PER_TASK --db custom_db
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
Hash size not specified, using '253231360'
K-mer set created. [6m11.633s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [2m20.480s]
Skipping step 4, GI number to seqID map now obsolete.
Creating seqID to taxID map (step 5 of 6)...
721 sequences mapped to taxa. [3m48.166s]
Setting LCAs in database (step 6 of 6)...
Finished processing 721 sequences
Database LCAs set. [19m6.065s]
Database construction complete. [Total: 31m26.381s]

[user@cn3144 ~]$ kraken-build --clean --db custom_db
[user@cn3144 ~]$ du -sh custom_db
10G     custom_db/
[user@cn3144 ~]$ tree custom_db
custom_db/
|-- [user    14K]  accmap_file.tmp
|-- [user   8.0G]  database.idx
|-- [user   1.7G]  database.kdb
`-- [user   4.0K]  taxonomy
    |-- [user   142M]  names.dmp
    `-- [user   109M]  nodes.dmp

[user@cn3144 ~]$ cp -r custom_db /data/$USER/kraken_databases

###
### Classification
###
[user@cn3144 ~]$ cp -r $KRAKEN_TEST_DATA/accuracy .
[user@cn3144 ~]$ cd accuracy
[user@cn3144 ~]$ ls -lh
total 7.5M
-rw-rw-r-- 1 user group 2.2K Aug 14 09:55 ACCURACY.README
-rw-rw-r-- 1 user group 1.2M Aug 14 09:55 HiSeq_accuracy.fa
-rw-rw-r-- 1 user group 1.8M Aug 14 09:55 MiSeq_accuracy.fa
-rw-rw-r-- 1 user group 4.6M Aug 14 09:55 simBA5_accuracy.fa
-rw-rw-r-- 1 user group  396 Aug 14 09:55 taxonomyDocSum.xslt

Now we will use the in memory temp file system /dev/shm to store the database for classification. Note that unlike lscratch, /dev/shm is not automatically cleaned at exit and care has to be taken to clean up explicitly at the end of a job.

[user@cn3144 ~]$ cp -r custom_db /dev/shm
[user@cn3144 ~]$ kraken --db /dev/shm/custom_db --fasta-input \
                        --output HiSeq.kraken HiSeq_accuracy.fa
10000 sequences (0.92 Mbp) processed in 0.422s (1421.8 Kseq/m, 131.15 Mbp/m).
  8140 sequences classified (81.40%)
  1860 sequences unclassified (18.60%)

[user@cn3144 ~]$ head -5 HiSeq.kraken
C       A_hydrophila_HiSeq.922  380703  101     644:22 380703:1 644:30 0:18
C       A_hydrophila_HiSeq.1263 644     101     1236:3 644:1 2:1 644:8 1236:8 644:50
C       A_hydrophila_HiSeq.2905 1448139 101     644:8 1448139:51 0:1 644:11
C       A_hydrophila_HiSeq.4866 1448139 101     644:61 1448139:10
C       A_hydrophila_HiSeq.7009 644     101     0:13 644:14 0:31 644:13

### CLEANUP
[user@cn3144 ~]$ rm -rf /dev/shm/custom_db

The same approach of copying to /dev/shm can be taken with the databases from /fdb/kraken as long as they fit. Note that /dev/shm as a whole is limited to 50% of memory and per job by the memory allocation. For example, using a reduced size 'minikraken' database obtained from the kraken maintainers:

[user@cn3144 ~]$ cp -r /fdb/kraken/20180220_standard_8GB /dev/shm
[user@cn3144 ~]$ kraken --db /dev/shm/20180220_standard_8GB --fasta-input \
                        --output HiSeq.kraken.2 HiSeq_accuracy.fa
10000 sequences (0.92 Mbp) processed in 0.602s (997.0 Kseq/m, 91.96 Mbp/m).
  7523 sequences classified (75.23%)
  2477 sequences unclassified (24.77%)

Column three is the predicted taxid. Let's get the name and rank of the unique taxids from NCBI eutils and compare them to the predicted taxa:

[user@cn3144 ~]$ cut -f3 HiSeq.kraken.2 | sort -u | grep -wv 0 > taxid.uniq
[user@cn3144 ~]$ esum="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
[user@cn3144 ~]$ for taxid in $(cat taxid.uniq); do
                        curl -s -G "${esum}?db=taxonomy&id=${taxid}" | xsltproc taxonomyDocSum.xslt -
                    done 2> /dev/null > taxa.uniq
[user@cn3144 ~]$ join -1 3 -2 1 -t '|' \
                        <(sort -k3,3 HiSeq.kraken.2 | tr '\t' '|') \
                        <(paste -d '|' taxid.uniq taxa.uniq | sort -t'|' -k1,1) \
                    | awk -F'|' '{printf("%s prediction: %s(%s)\n", $3, $6, $7)}' \
                    | sort > HiSeq.kraken.pred
[user@cn3144 ~]$ tail -n -100 HiSeq.kraken.pred | head -n 5
X_axonopodis_HiSeq.875312 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,()
X_axonopodis_HiSeq.87911 prediction: Xanthomonas citri group,species group()
X_axonopodis_HiSeq.881987 prediction: Xanthomonas,genus()
X_axonopodis_HiSeq.882508 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,()
X_axonopodis_HiSeq.882724 prediction: Xanthomonas,genus()

### CLEANUP
[user@cn3144 ~]$ rm -rf /dev/shm/20180220_standard_8GB
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
[user@biowulf]$ sinteractive --mem=36g --cpus-per-task=4 --gres=lscratch:100
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load kraken/2.1.2

###
### build a custom kraken database with some 400+ genomes
###
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ mkdir fa
[user@cn3144 ~]$ cd fa
[user@cn3144 ~]$ wget --quiet --input-file $KRAKEN_TEST_DATA/genomes_for_custom_db.urls
[user@cn3144 ~]$ gunzip *.gz
[user@cn3144 ~]$ ls -lh | head -5
total 1.3G
-rw-r--r-- 1 user group 3.9M May 22  2016 GCA_000006745.1_ASM674v1_genomic.fna
-rw-r--r-- 1 user group 2.1M May 26  2016 GCA_000006885.1_ASM688v1_genomic.fna
-rw-r--r-- 1 user group 2.0M May 27  2016 GCA_000007045.1_ASM704v1_genomic.fna
-rw-r--r-- 1 user group 5.3M Aug  4  2014 GCA_000007825.1_ASM782v1_genomic.fna

[user@cn3144 ~]$ cd ..
[user@cn3144 ~]$ kraken2-build --download-taxonomy --db custom_db
[user@cn3144 ~]$ for f in fa/*.fna; do
                       kraken2-build --add-to-library $f --db custom_db
                    done
Added "fa/GCA_000006745.1_ASM674v1_genomic.fna" to library (custom_db)
Added "fa/GCA_000006885.1_ASM688v1_genomic.fna" to library (custom_db)
[...snip...]
Added "fa/GCA_900129335.1_PP_HGAG_QV30_2SC_T4_genomic.fna" to library (custom_db)
Added "fa/GCA_900185995.1_BK1071_genomic.fna" to library (custom_db)

# larger databases would need more threads to build
[user@cn3144 ~]$ kraken2-build --build --threads 2 --db custom_db
Creating sequence ID to taxonomy ID map (step 1)...
Found 721/721 targets, searched through 203908352 accession IDs, search complete.
Sequence ID to taxonomy ID map complete. [18.874s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 258445896 bytes
Capacity estimation complete. [22.086s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 8 bits reserved for taxid.
Completed processing of 721 sequences, 1343692302 bp
Writing data to disk...  complete.
Database files completed. [1m7.904s]
Database construction complete. [Total: 1m48.906s]

[user@cn3144 ~]$ kraken2-build --clean --db custom_db
Database disk usage: 34G
After cleaning, database uses 247M
[user@cn3144 ~]$ tree custom_db
custom_db/
|-- [user   246M]  hash.k2d
|-- [user     48]  opts.k2d
|-- [user    16K]  taxo.k2d
`-- [user    16K]  unmapped.txt


[user@cn3144 ~]$ cp -r custom_db /data/$USER/kraken_databases

###
### Classification
###
[user@cn3144 ~]$ cp -r $KRAKEN_TEST_DATA/accuracy .
[user@cn3144 ~]$ cd accuracy
[user@cn3144 ~]$ ls -lh
total 7.5M
-rw-rw-r-- 1 user group 2.2K Aug 14 09:55 ACCURACY.README
-rw-rw-r-- 1 user group 1.2M Aug 14 09:55 HiSeq_accuracy.fa
-rw-rw-r-- 1 user group 1.8M Aug 14 09:55 MiSeq_accuracy.fa
-rw-rw-r-- 1 user group 4.6M Aug 14 09:55 simBA5_accuracy.fa
-rw-rw-r-- 1 user group  396 Aug 14 09:55 taxonomyDocSum.xslt

[user@cn3144 ~]$ kraken2 --db ../custom_db --output HiSeq.kraken HiSeq_accuracy.fa
Loading database information... done.
10000 sequences (0.92 Mbp) processed in 0.100s (5993.0 Kseq/m, 552.81 Mbp/m).
  8381 sequences classified (83.81%)
  1619 sequences unclassified (16.19%)


[user@cn3144 ~]$ head -5 HiSeq.kraken
C       A_hydrophila_HiSeq.922  644     101     644:56 0:3 644:4 0:4
C       A_hydrophila_HiSeq.1263 644     101     2:2 644:5 1236:3 2:4 1236:4 644:24 1236:5 644:1 1236:5 644:14
C       A_hydrophila_HiSeq.2905 1448139 101     644:7 1448139:28 644:5 1448139:5 644:4 1448139:10 644:8
C       A_hydrophila_HiSeq.4866 1448139 101     644:60 9
C       A_hydrophila_HiSeq.7009 644     101     0:5 644:19 0:3 644:9 0:18 644:13

[user@cn3144 ~]$ cp -r /fdb/kraken/20210223_standard_16GB_kraken2 ..
[user@cn3144 ~]$ kraken2 --db ../20210223_standard_16GB_kraken2 \
                        --output HiSeq.kraken.2 HiSeq_accuracy.fa
Loading database information... done.
10000 sequences (0.92 Mbp) processed in 0.148s (4062.8 Kseq/m, 374.76 Mbp/m).
  9033 sequences classified (90.33%)
  967 sequences unclassified (9.67%)

Column three is the predicted taxid. Let's get the name and rank of the unique taxids from NCBI eutils and compare them to the predicted taxa:

[user@cn3144 ~]$ cut -f3 HiSeq.kraken.2 | sort -u | grep -wv 0 > taxid.uniq
[user@cn3144 ~]$ esum="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
[user@cn3144 ~]$ for taxid in $(cat taxid.uniq); do
                        curl -s -G "${esum}?db=taxonomy&id=${taxid}" | xsltproc taxonomyDocSum.xslt -
                    done 2> /dev/null > taxa.uniq
[user@cn3144 ~]$ join -1 3 -2 1 -t '|' \
                        <(sort -k3,3 HiSeq.kraken.2 | tr '\t' '|') \
                        <(paste -d '|' taxid.uniq taxa.uniq | sort -t'|' -k1,1) \
                    | awk -F'|' '{printf("%s prediction: %s(%s)\n", $3, $6, $7)}' \
                    | sort > HiSeq.kraken.pred
[user@cn3144 ~]$ tail -n -100 HiSeq.kraken.pred | head -n 5
X_axonopodis_HiSeq.875312 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,()
X_axonopodis_HiSeq.87911 prediction: Xanthomonas citri group,species group()
X_axonopodis_HiSeq.881987 prediction: Xanthomonas,genus()
X_axonopodis_HiSeq.882508 prediction: Xanthomonas phaseoli pv. dieffenbachiae LMG 695,()
X_axonopodis_HiSeq.882724 prediction: Xanthomonas,genus()

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. kraken.sh) which copies a reduced size standard database to /lscratch (or /dev/shm for version 1) and runs several fastq samples against a local copy of the database.

#! /bin/bash
# USAGE: kraken.sh dbname infile [infile ...]

module load kraken/1.1
db=$1
dbname=$(basename $1)
shift
[[ ${1:-none} != "none" ]] || exit 1

# make sure the database in /dev/shm is cleaned up
trap 'rm -rf /dev/shm/${dbname}' EXIT

[[ -d ${db} ]] && cp -r $db /dev/shm || exit 1

for infile in "$@"; do
    [[ -f ${infile} ]] \
    && kraken --db /dev/shm/${dbname} --threads ${SLURM_CPUS_PER_TASK} \
        --fastq-input --gzip-compressed --output ${infile}.kraken
done

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=16 --mem=38g kraken.sh \
    /fdb/kraken/20171019_standard_32GB sample0[0-9].fastq.gz
#! /bin/bash
# USAGE: kraken2.sh dbname infile [infile ...]

module load kraken/2.1.2
db=$1
dbname=$(basename $1)
shift
[[ ${1:-none} != "none" ]] || exit 1
    
[[ -d ${db} ]] && cp -r $db /lscratch/$SLURM_JOB_ID || exit 1

for infile in "$@"; do
    [[ -f ${infile} ]] \
    && kraken2 --db /lscratch/$SLURM_JOB_ID/${dbname} --threads ${SLURM_CPUS_PER_TASK} \
        --output ${infile}.kraken ${infile}
done

Submit this job using the Slurm sbatch command.

  sbatch --cpus-per-task=16 --mem=72g kraken2.sh \
      /fdb/kraken/20220803_standard_kraken2 sample0[0-9].fastq.gz
  
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. kraken2.swarm) that uses the script above to process several fastq files in each job

kraken2.sh /fdb/kraken/20220803_standard_kraken2 sample0[0-9].fastq.gz
kraken2.sh /fdb/kraken/20220803_standard_kraken2 sample1[0-9].fastq.gz
kraken2.sh /fdb/kraken/20220803_standard_kraken2 sample2[0-9].fastq.gz

Submit this job using the swarm command.

swarm -f kraken2.swarm -g 72 -t 16 --module kraken/2.1.2
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module kraken Loads the kraken module for each subjob in the swarm