High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
jellyfish on Biowulf & Helix


Jellyfish counts k-mers in fasta or fastq files. k-mer counts are saved in a binary format that can be queried or dumped to text based format.

There may be multiple versions of jellyfish available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail jellyfish 

To select a module use

module load jellyfish/[version]

where [version] is the version of choice.

jellyfish is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set



Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

biowulf$ sinteractive --mem=20g --cpus-per-task=6
node$ module load jellyfish
[+] Loading jellyfish 2.2.6
node$ cp $JELLYFISH_TEST_DATA/SRR786692.fastq.gz .
node$ # count 8-mers in fastq file
node$ # jellyfish does not natively read compressed data - uncompress on the fly
node$ jellyfish count -t $SLURM_CPUS_PER_TASK \
        -m 8 -s 100M -C <(zcat SRR786692.fastq.gz)
node$ jellyfish histo -t $SLURM_CPUS_PER_TASK mer_counts.jf > hist
node$ head hist
32 1
33 1
51 1
53 1
56 1
61 1
69 2
78 1
79 2
83 1

node$ jellyfish query mer_counts.jf GCGGCCGC
node$ jellyfish dump -L 2 -o mer_counts.fa mer_counts.jf
node$ head mer_counts.fa

node$ exit
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is jellyfish.batch

module load jellyfish/2.2.6 || exit 1
jellyfish count -m 21 -s 100M -o SRR786692.jf \
    -t $SLURM_CPUS_PER_TASK -C <(zcat SRR786692.fastq.gz)

Submit to the queue with sbatch:

biowulf$ sbatch --mem=15g --cpus-per-task=10 jellyfish.batch