High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
jellyfish on Biowulf & Helix

Description

Jellyfish counts k-mers in fasta or fastq files. k-mer counts are saved in a binary format that can be queried or dumped to text based format.

There may be multiple versions of jellyfish available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail jellyfish 

To select a module use

module load jellyfish/[version]

where [version] is the version of choice.

jellyfish is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

References

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

biowulf$ sinteractive --mem=20g --cpus-per-task=6
node$ module load jellyfish
[+] Loading jellyfish 2.2.6
node$ cp $JELLYFISH_TEST_DATA/SRR786692.fastq.gz .
node$ # count 8-mers in fastq file
node$ # jellyfish does not natively read compressed data - uncompress on the fly
node$ jellyfish count -t $SLURM_CPUS_PER_TASK \
        -m 8 -s 100M -C <(zcat SRR786692.fastq.gz)
node$ jellyfish histo -t $SLURM_CPUS_PER_TASK mer_counts.jf > hist
node$ head hist
32 1
33 1
51 1
53 1
56 1
61 1
69 2
78 1
79 2
83 1

node$ jellyfish query mer_counts.jf GCGGCCGC
GCGGCCGC 1446
node$ jellyfish dump -L 2 -o mer_counts.fa mer_counts.jf
node$ head mer_counts.fa
>25859
AAAAAAAA
>1277
CGAACCAC
>2811
AAGGAACG
>4573
GTTTTTCA
>4210
CTAACACA

node$ exit
biowulf$
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is jellyfish.batch

module load jellyfish/2.2.6 || exit 1
jellyfish count -m 21 -s 100M -o SRR786692.jf \
    -t $SLURM_CPUS_PER_TASK -C <(zcat SRR786692.fastq.gz)

Submit to the queue with sbatch:

biowulf$ sbatch --mem=15g --cpus-per-task=10 jellyfish.batch