High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Datamash on Biowulf & Helix

Description

datamash is a command-line program which performs basic numeric,textual and statistical operations on input textual data files.

There may be multiple versions available on our systems. An easy way of selecting the version is to use modules. To see the modules available, type

module avail datamash

To select a module use

module load datamash/[version]

where [version] is the version of choice.

Examples

/usr/local/apps/datamash/examples

Environment variables set
Documentation

https://www.gnu.org/software/datamash/

Interactive Job on Biowulf

Allocate an interactive session with sinteractive and use as described below

biowulf$ sinteractive --mem=10g
salloc.exe: Pending job allocation 38978697
[...snip...]
salloc.exe: Nodes cn2273 are ready for job
node$ module load datamash
[+] Loading datamash
node$ datamash --sort --headers groupby 2 mean 3 sstdev 3 < scores_h.txt
[...snip...]
node$ exit
biowulf$

 

Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is file.batch

module load datamash || exit 1
cd /data/$USER
datamash --sort --headers groupby 2 mean 3 sstdev 3 < scores_h.txt

Submit to the queue with sbatch:

biowulf$ sbatch file.batch

 

Swarm of Jobs on Biowulf

Create a swarmfile (e.g. script.swarm). For example:

# this file is called script.swarm
cd dir1;datamash command 1;datamash command 2
cd dir2;datamash command 1;datamash command 2
cd dir3;datamash command 1;datamash command 2
[...]

Submit this job using the swarm command.

swarm -f script.swarm --module datamash

For more information regarding swarm: https://hpc.nih.gov/apps/swarm.html#usage