High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
kraken on Biowulf & Helix

Description

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. It uses exact matches of kmers against a reference database.

There may be multiple versions of kraken available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail kraken 

To select a module use

module load kraken/[version]

where [version] is the version of choice.

kraken is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

References

Documentation

Databases

The kraken scripts used to generate the classification databases had to be patched to use the old refseq genomes archive since they rely on gi-to-taxonomy mappings. For bacteria, for example, that would be genomes/archive/old_refseq/Bacteria/all.fna.tar.gz on the NCBI ftp site which does not include all RefSeq bacterial genome sequences and is not updated any more.

Kraken databases used to classify sequences can be found in

/fdb/kraken which is also stored in the KRAKEN_DB environment variable.

The standard database contains genomes of archaea, bacteria and viruses from NCBI. There are also separate database for viruses and bacteria.

kraken accesses the large database file many times. Therefore it is advantageous to use the smallest possible database and allocate sufficient memory for the operating system to cache the whole database in memory. In addition, to avoid strainging the shared file systems, it would be ideal to copy the database to local disk on a node and then run multiple kraken jobs against that databse. Example code for how to do this is given below.

On Helix

Only small scale jobs should be run on helix. Set up the environment for kraken

helix$ module avail kraken
------------------- /usr/local/lmod/modulefiles ------------------------
   kraken/0.10.5-beta

helix$ module load kraken
[+] Loading kraken 0.10.5-beta
helix$ echo $KRAKEN_DB
/fdb/kraken
helix$ kraken -h
Usage: kraken [options] 

Options:
  --db NAME               Name for Kraken DB
                          (default: none)
  --threads NUM           Number of threads (default: 1)
  --fasta-input           Input is FASTA format
  --fastq-input           Input is FASTQ format
  --gzip-compressed       Input is gzip compressed
  --bzip2-compressed      Input is bzip2 compressed
  --quick                 Quick operation (use first hit or hits)
  --min-hits NUM          In quick op., number of hits req'd for classification
                          NOTE: this is ignored if --quick is not specified
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --output FILENAME       Print output to filename (default: stdout); "-" will
                          suppress normal output
  --only-classified-output
                          Print no Kraken output for unclassified sequences
  --preload               Loads DB into memory before classification
  --paired                The two filenames provided are paired-end reads
  --check-names           Ensure each pair of reads have names that agree
                          with each other; ignored if --paired is not specified
  --help                  Print this message
  --version               Print version information

If none of the *-input or *-compressed flags are specified, and the 
file is a regular file, automatic format detection is attempted.

Test a set of sequences with known classifications against the standard database. The data for this is found in /usr/local/apps/kraken/TEST_DATA

:
helix$ cp -r /usr/local/apps/kraken/TEST_DATA/accuracy . && cd accuracy
helix$ ls -lh
total 7.5M
-rw-rw-r-- 1 wresch staff 2.2K Aug 14 09:55 ACCURACY.README
-rw-rw-r-- 1 wresch staff 1.2M Aug 14 09:55 HiSeq_accuracy.fa
-rw-rw-r-- 1 wresch staff 1.8M Aug 14 09:55 MiSeq_accuracy.fa
-rw-rw-r-- 1 wresch staff 4.6M Aug 14 09:55 simBA5_accuracy.fa
-rw-rw-r-- 1 wresch staff  396 Aug 14 09:55 taxonomyDocSum.xslt
helix$ kraken --db /fdb/kraken/standard --fasta-input \
  --output HiSeq.kraken HiSeq_accuracy.fa
10000 sequences (0.92 Mbp) processed in 187.989s (3.2 Kseq/m, 0.29 Mbp/m).
  7891 sequences classified (78.91%)
  2109 sequences unclassified (21.09%)
helix$ head HiSeq.kraken
C       A_hydrophila_HiSeq.922  380703  101     642:23 1224:14 642:5 380703:1 0:28
C       A_hydrophila_HiSeq.1263 642     101     2:19 1236:36 642:15 2:1
C       A_hydrophila_HiSeq.2905 380703  101     380703:8 0:52 644:9 642:2
C       A_hydrophila_HiSeq.4866 642     101     642:46 0:25
C       A_hydrophila_HiSeq.7009 380703  101     0:13 380703:14 0:31 1224:9 644:4

Column three is the predicted taxid. Let's get the name and rank of the unique taxids from NCBI eutils and compare them to the predicted taxa:

helix$ cut -f3 HiSeq.kraken | sort -u | grep -wv 0 > taxid.uniq
helix$ esum="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
helix$ for taxid in `cat taxid.uniq`; do
 curl -s "${esum}?db=taxonomy&id=${taxid}" | xsltproc taxonomyDocSum.xslt -
done > taxa.uniq
helix$ join -1 3 -2 1 -t '|' \
     <(sort -k3,3 HiSeq.kraken | tr '\t' '|') \
     taxa.uniq \
  | awk -F'|' '{printf("%s prediction: %s(%s)\n", $3, $6, $7)}' \
  | sort > HiSeq.kraken.pred
helix$ tail -n -100 HiSeq.kraken.pred | head -n 5
X_axonopodis_HiSeq.873332 prediction: Xanthomonas(genus)
X_axonopodis_HiSeq.874847 prediction: Xanthomonas oryzae pv. oryzae()
X_axonopodis_HiSeq.875312 prediction: Xanthomonas(genus)
X_axonopodis_HiSeq.87911 prediction: Xanthomonas fuscans subsp. fuscans(subspecies)
X_axonopodis_HiSeq.881987 prediction: Xanthomonas(genus)

Browse the HiSeq.kraken.pred file to compare actual and predicted taxonomic classifications.

Batch job on Biowulf

Create a batch file similar to the following to run a number of kraken jobs against a given database. This particular script uses a virus only database and takes a file listing fasta files to be processed as an argument. It copies the database to local disk before running jobs with gnu parallel

#! /bin/bash
#SBATCH --cpus-per-task=32
#SBATCH --mem=120G
#SBATCH --gres=lscratch:200
#SBATCH --mail-type=BEGIN,END,FAIL

################################################################################
#                                  Functions                                   #
################################################################################

function log() {
    local level=$1
    shift
    echo "$level:$(date '+%F %T'):$@" >&2
}
function info() {
    log INFO "$@"
}
function error() {
    log ERR "$@"
}
function fail() {
    log ERR "$@"
    exit 1
}

################################################################################
#                                    Setup                                     #
################################################################################

# database to be used - adapt to your needs
KDB_PATH=/fdb/kraken
KDB_NAME=20160111_viruses

if [[ $# -ne 1 ]]; then
    echo "USAGE:"
    echo "  sbatch test.sh seqfile_list"
    echo ""
    echo "where seqfile_list is a file listing input files (one per line)"
    exit 1
fi
if [[ -e $1 ]]; then
    info "will process $(cat $1 | wc -l) files in this run"
    input_files="$1"
else
    fail "'$1' does not exist"
fi

# load modules
module load kraken/0.10.5-beta || fail "unable to load kraken module"
module load parallel || fail "unable to load gnu parallel module"

# copy the standard kraken db to local disk

if [[ ! -d /lscratch/${SLURM_JOBID}/${KDB_NAME} ]]; then
    info "START: copy kraken database '${KDB_PATH}/${KDB_NAME}'"
    cp -r ${KDB_PATH}/${KDB_NAME} /lscratch/${SLURM_JOBID} \
        || fail "Unable to copy kraken db to local disk"
    info "DONE:  copy kraken database"
else
    info "SKIP:  local kraken database already exists"
fi

# set up a temp dir for gnu parallel
tmpd=$(mktemp -d /lscratch/${SLURM_JOBID}/XXXX)
info "gnu parallel temp dir: ${tmpd}"


################################################################################
#                                   Run jobs                                   #
################################################################################
# run kraken on the sequence files provided on the command line

cmd="kraken --db /lscratch/${SLURM_JOBID}/${KDB_NAME} --fasta-input --threads=8 --output {/.} {} &> {/.}.log"
info "command: ${cmd}"

info "START: gnu parallel run"
parallel -j4 \
    --tmpdir=${tmpd} \
    --tagstring {/.} \
    --joblog=./parallel.log \
    --keep-order \
    --verbose \
    -a ${input_files} \
    "$cmd"
info "DONE:  gnu parallel run"

and submit it to the queue with

biowulf2$ sbatch kraken_batch.sh file_listing_target_files
Interactive job on Biowulf

Allocate an interactive session and work as described above for helix. The same cautions about using databasees on networked file systems apply here. Please copy the database to be used to local disk space first

biowulf$ sinteractive --cpus-per-task=16 --mem=80G --gres=lscratch:200
salloc.exe: Granted job allocation 1296827
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn0021 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
cn0021$ module load kraken
cn0021$ cp -r $KRAKEN_DB/standard /lscratch/$SLURM_JOBID
cn0021$ kraken --db /lscratch/$SLURM_JOBID/standard --fasta-input \
  --threads $SLURM_CPUS_PER_TASK \
  --output HiSeq.kraken HiSeq_accuracy.fa
cn0021$ exit
biowulf$