High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
mash on Biowulf & Helix

Description

Mash uses MinHash hashing to reduce large sequences to a representative sketch. Distances between sketches of sequences can be calculated very rapidly and can provide an estimate of average nucleotide identity. Sketches of all the genomes in RefSeq 70 are only ~90MB (at a kmer size of 16 using 400 hashes).

References

Web sites

On Helix

Set up the environment to put the single executable (mash) on the path

helix$ module load mash

Some example data is included in the TEST_DATA directory inside the application directory.

helix$ td=/usr/local/apps/mash/TEST_DATA
helix$ ls $td
BA000007.2.fna  CP009273.1.fna  NC_000913.2.fna  RefSeqGenomes_V70.msh  SRR292770_1.fastq.gz

Estimate the distance between two E. coli genomes

helix$ mash dist $td/BA000007.2.fna $td/NC_000913.2.fna
/usr/local/apps/mash/TEST_DATA/BA000007.2.fna   /usr/local/apps/mash/TEST_DATA/NC_000913.2.fna
0.0222766      0       456/1000

The result shows the reference sequence, the query sequence, the distance estimate, the p value, and the number of matching hashes.

Instead of calculating sketches each time they can be precalculated. For example, we can sketch the two genomes from above

helix$ mash sketch -o 2coli.msh $td/BA000007.2.fna $td/NC_000913.2.fna
Sketching /usr/local/apps/mash/TEST_DATA/BA000007.2.fna...
Sketching /usr/local/apps/mash/TEST_DATA/NC_000913.2.fna...
Writing to 2coli.msh...
helix$ ll 2coli.msh
-rw-r--r-- 1 user group  16K Dec  4 09:56 2coli.msh

So the sketch for the 10MB genomes takes up 16kB. Now we can compare our query against the sketches with

helix$ mash dist 2coli.msh $td/CP009273.1.fna
BA000007.2.fna   CP009273.1.fna   0.0222766      0       456/1000
NC_000913.2.fna  CP009273.1.fna   0              0      1000/1000
Batch job on Biowulf

Create a batch script similar to the following

#! /bin/bash

function fail {
  echo "$@" >&2
  exit 1
}

module load mash || fail "could not load mash module"
mash sketch -i -k 21 -s 1000 -o coli.msh -l list_of_ecoli_genomes

Submit the job to the queue

b2$ sbatch mash_batch_script.sh
Swarm of jobs on Biowulf

Mash can efficiently concatenate sketches. Therefore, one way of parallelizing mash is to run a swarm of jobs each sketching a single genome and then merge all the sketches into a single file. For example create the following swarm command file

mash sketch -o NC_000913.2.msh /usr/local/apps/mash/TEST_DATA/NC_000913.2.fna
mash sketch -o BA000007.2.msh /usr/local/apps/mash/TEST_DATA/BA000007.2.fna
mash sketch -o CP009273.1.msh /usr/local/apps/mash/TEST_DATA/CP009273.1.fna

Submit the jobs to the queue with

b2$ swarm -f mash_sketch.swarm -J mash --module mash

And create a concatenation job with

#! /bin/bash
module load mash || exit 1
mash paste ecoli NC_000913.2.msh BA000007.2.msh CP009273.1.msh

And use dependencies to run this mapping job after all the individual sketches finish

b2$ sbatch --dependency=singleton -J mash paste.sh
b2$ squeue -u $USER
        JOBID PARTITION  NAME  USER ST   TIME  NODES NODELIST(REASON)
7323657_[0-2]      norm  mash  user PD   0:00      1 (None)
      7323660      norm  mash  user PD   0:00      1 (Dependency)
Interactive job on Biowulf

Allocate an interactive session and then use as described above

b2$ sinteractive
alloc.exe: Pending job allocation 7323689
salloc.exe: job 7323689 queued and waiting for resources
salloc.exe: job 7323689 has been allocated resources
salloc.exe: Granted job allocation 7323689
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1516 are ready for job
cn1516$ module laod mash
cn1516$ mash dist ecoli.msh /usr/local/apps/mash/TEST_DATA/NZ_CP007025.1.fna