Biowulf High Performance Computing at the NIH
Online class: Introduction to Biowulf

Hands-On: Bundling a swarm of jobs

In the previous Data Storage hands-on section, you should have copied the class scripts to your /data area. If you skipped or missed that section, type

hpc-classes biowulf
now. This command will copy the scripts and input files used in this online class to your /data area, and will take about 5 minutes.

In the following session, you will resubmit the Blat swarm that you ran earlier, but will submit it as a swarm bundle. If you're not familiar with Blat, don't worry -- this is just an example. The basic principles of job submission are not specific for Blat.

cd /data/$USER/hpc-classes/biowulf/swarm

# submit the swarm
swarm -f blat.swarm -g 4 --module blat -b 4


How many jobs were created?


1 job array with 30/4 = 8 subjobs. Each subjob corresponds to 4 lines in the blat.swarm file, and those 4 blat runs will run sequentially on the same allocated CPUs.

Will the subjobs take more or less time than if submitted without the '-b 4' flag?


This is actually more complicated than you might think.

Each subjob is running 4 commands sequentially, so could take 4 times as long as if submitted without the '-b 4' flag. However, blat is a pattern matching program that reads the file /fdb/genome/hg19/chr10.fa. With a bundled swarm, this file will be read into memory on the node, so the 2nd, 3rd and 4th runs will be faster (this is especially true for a larger database file such as the entire human genome, /fdb/genome/hg19/chr_all.fa, which is 3 GB). Thus, if I/O is a major factor in the job, and the jobs are set up correctly (sufficient memory), a bundled swarm could take less time than an unbundled swarm.

Another factor to consider is how busy the cluster is. If there are lots of free resources, all the subjobs of an unbundled swarm may start up right away and complete quickly. But if the cluster is busy, only a few of the jobs may run at a time. In that case, you may be better off bundling your swarm, so that you have only a few subjobs and they all start quickly.

Will all the subjobs take the same amount of time?


One of the subjobs has only 2 commands, so it will run faster than the jobs which have 4 commands. You can check the time taken for the subjobs with 'jobhist jobid'. If all the subjobs take about the same amount of time, that indicates that the major factor in the walltime is the job Input/Output, rather than the processing time.

What are the advantages and disadvantages of increasing the bundle factor, e.g. '-b 10'?


Advantage: if the cluster is very busy, a smaller number of jobs will get scheduled faster.
Disadvantage:: each subjob will take longer.