High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Biowulf User Guide
Information about job submission, job management and job monitoring on the NIH HPC Biowulf cluster.

Acknowledgement/Citation

The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the NIH Intramural Research Program. If you publish research that involved significant use of Biowulf, please cite the cluster. Suggested citation text:

This work utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov)

Connecting, Passwords, Email, Disk

Use 'ssh biowulf.nih.gov' to connect from the command line. See Connecting to the NIH HPC systems.

Your username and password are your NIH login username and password, as on Helix.

Computationally demanding or memory intensive processes are not permitted on the Biowulf login node. See Interactive jobs below.

Email from the Biowulf batch system goes to user@helix.nih.gov, which you probably have forwarded to your main email address.

Your /home, /data and shared space is set up exactly the same on Helix and Biowulf. See the storage section for details.

Slurm Job Submission Summary

A summary of Biowulf job submission is available for download or printing (PDF).

Job Submission

Use the 'sbatch' command to submit a batch script.

Important sbatch flags:

--partition=partname Job to run on partition 'partname'. (default: 'norm')
--ntasks=# Number of tasks (processes) to be run
--cpus-per-task=# Number of CPUs required for each task (e.g. '8' for an 8-way multithreaded job)
--ntasks-per-core=1 Do not use hyperthreading (this flag typically used for parallel jobs)
--mem=#g Memory required for the job (Note the g (GB) in this option)
--exclusive Allocate the node exclusively
--no-requeue | --requeue If an allocated node hangs, whether the job should be requeued or not.
--error=/path/to/dir/filename Location of stderr file (by default, slurm######.out in the submitting directory)
--output=/path/to/dir/filename Location of stdout file (by default, slurm######.out in the submitting directory)
--license=idl:6 Request 6 IDL licenses (Minimum necessary for an instance of IDL)

More useful flags and environment variables are detailed in the sbatch manpage, which can be read on the system by invoking man sbatch.

Single-threaded batch job
[biowulf ~] sbatch jobscript
This job will be allocated 2 CPUs and 4 GB of memory.

Multi-threaded batch job

[biowulf ~] sbatch --cpus-per-task=# jobscript
The above job will be allocated '#' CPUs, and (# * 2) GB of memory. e.g. with --cpus-per-task=4, the default memory allocation is 8 GB of memory.

You should use the Slurm environment variable $SLURM_CPUS_PER_TASK within your script to specify the number of threads to the program. For example, to run a Novoalign job with 8 threads, set up a batch script like this:

#!/bin/bash

module load novocraft
novoalign -c $SLURM_CPUS_PER_TASK  -f s_1_sequence.txt -d celegans -o SAM > out.sam

and submit with:

sbatch --cpus-per-task=8  jobscript

Note: when jobs are submitted without specifying the number of CPUs per task explicitly the $SLURM_CPUS_PER_TASK environment variable is not set.

Allocating more memory:

[biowulf ~] sbatch --mem=#g jobscript
The above job will be allocated # GB of memory. Note the g (GB) following the memory specification. Without this addition the job will allocate # MB of memory. Unless your job uses very little memory this will likely cause it to fail.

Exclusively allocating nodes

Add --exclusive to your sbatch command line to exclusively allocate a node. Note that the batch system will still limit you to the requested CPUs and memory, or to the default 2 CPUs and 4 GB if you do not specifically request CPUs and memory.

Auto-threading apps

Programs that 'auto-thread' (i.e. attempt to use all available CPUs on a node) should be run with the --exclusive flag. This will give you an exclusively allocated node with at least 16 CPUs.

A major change in the new cluster is that the batch system will allocate by core, rather than by node. Thus, your job may be allocated 4 cores and 8 GB of memory on a node which has 16 cores and 32 GB of memory. Other jobs may be utilizing the remaining 12 cores and 24 GB of memory, so that your jobs may not have exclusive use of the node. Slurm will not allow any job to utilize more memory or cores than were allocated.

The default Slurm allocation is 1 physical core (2 CPUs) and 4 GB of memory. For any jobs that require more memory or CPU, you need to specify these requirements when submitting the job. Examples:

Slurm on Biowulf
Command Allocation
sbatch jobscript 2 CPUs, 4 GB memory on a shared node.
sbatch --mem=8g 2 CPUs, 8 GB memory on a shared node.
sbatch --mem=8g --cpus-per-task=4 4 CPUs, 200 GB memory on a shared node
sbatch --mem=24g --cpus-per-task=16 16 CPUs, 24 GB memory on a shared node.
sinteractive --mem=Mg --cpus-per-task=C interactive job with 4 CPUs and M GB of memory on a shared node.

Note: add --exclusive if you want the node allocated exclusively.

Swarm

Job arrays can be submitted on Biowulf using swarm. e.g.

swarm -g G -t T -f swarmfile --module afni
will submit a swarm job with each command (a single line in the swarm command file) allocated T cpus (for T threads) and G GB of memory. You can use the environment variable $SLURM_CPUS_PER_TASK within the swarm command file to specify the number of threads to the program. See the swarm webpage for details.

Parallel Jobs

Parallel (MPI) jobs that run on more than 1 node: Use the environment variable $SLURM_NTASKS within the script to specify the number of MPI processes. For example:

#!/bin/bash

module load meep/1.2/mpi/gige
cd /data/$USER/mydir
meme infile params -p $SLURM_NTASKS

Submit with, for example:

sbatch --ntasks=C --constraint=nodetype --exclusive --ntasks-per-core=1 [--mem-per-cpu=Gg] jobscript
where:
--ntasks=C number of tasks (MPI processes) to run
--constraint=nodetype all nodes should be of the same type, e.g. 'x2650'.
--exclusive for jobs with interprocess communication, it is best to allocate the nodes exclusively
--ntasks-per-core=1 Most parallel jobs do better running only 1 process per physical CPU
[optional] --mem-per-cpu=Gg only needed if each process needs more than the default 2 GB per hyperthreaded core

See the webpage for the application for more details.

Partitions

Biowulf nodes are grouped into partitions. A partition can be specified when submitting a job. The default partition is 'norm'. The freen command can be used to see free nodes and CPUs, and available types of nodes on each partition.

Nodes available to all users
norm the default partition. Restricted to single-node jobs
multinode Intended to be used for large-scale parallel jobs. Single node jobs are not allowed. See here for detailed information.
largemem Large memory nodes. Reserved for jobs with memory requirements that cannot fit on the norm partition.
unlimited Reserved for jobs that require more than the default 10-day walltime. Note that this is a small partition with a low CPUs-per-user limit. Only jobs that absolutely require more than 10 days runtime, that cannot be split into shorter subjobs, or that are a first-time run where the walltime is unknown, should be run on this partition.
quick For jobs < 4 hours long. These jobs will run on the buyin nodes when they are free.
gpu GPU nodes reserved for applications that are built for GPUs.
Buyin nodes
ccr for NCI CCR users
niddk for NIDDK users
nimh for NIMH users
Allocating GPUs

To make use of GPUs, jobs have to be submitted to the gpu partition and specifically request the type and number of GPUs. For example:

# request one k20x GPU
[biowulf ~]$ sbatch --partition=gpu --gres=gpu:k20x:1 script.sh

# request two k20x GPUs
[biowulf ~]$ sbatch --partition=gpu --gres=gpu:k20x:2 script.sh

# request 1 k80 GPU and 8 CPUs on a single K80 node
[biowulf ~]$ sbatch --partition=gpu --cpus-per-task=8 --gres=gpu:k80:1 script.sh

# request all 4 k80 and 56 CPUs on a single K80 node
[biowulf ~]$ sbatch --partition=gpu --cpus-per-task=56 --gres=gpu:k80:4 script.sh

The k20x nodes have 2 GPUs and the K80s have 4 GPUs. The 'freen' command can be used to see the CPUs and memory on each type of node. For example:

biowulf% freen | grep gpu
gpu         66/68        3752/3808         28    56    250g   800g   cpu56,core28,g256,ssd800,x2680,ibfdr,gpuk80
gpu         0/20         284/640           16    32    123g   800g   cpu32,core16,g128,ssd800,x2650,gpuk20x
will show you that the k80x nodes have 16 cores (32 CPUs) and 123 GB of memory, and the K80s have 28 cores (56 CPUs) and 250 GB of memory. If you request a single k20x GPU, you should request no more than 16 CPUs (half the CPUs on the node). Likewise, if you allocate one K80 GPU, you should allocate no more than 14 CPUs (1/4 the CPUs on the node).

The request for the GPU resource is in the form resourceName:resourceType:number.

To allocate a GPU for an interactive session, e.g. to compile a program, use:

[biowulf ~]$ sinteractive --gres=gpu:k20x:1 
To request more than the default 2 CPUs, use
[biowulf ~]$ sinteractive --gres=gpu:k20x:1 --cpus-per-task=8 

Interactive Jobs

To allocate resources for an interactive job, use the sinteractive command. The options are essentially the same as for the sbatch command. e.g.

[biowulf ~]$ sinteractive
salloc.exe: Granted job allocation 22261

[cn0004 ~]$ ...some interactive commands....

[cn0004 ~]$exit
exit
salloc.exe: Relinquishing job allocation 22261
salloc.exe: Job allocation 22261 has been revoked.

[biowulf ~]$

The default 'sinteractive' allocation is 1 core (2 CPUs) and 4 GB of memory. You can request additional resources. e.g.

Command Allocation
sinteractive --cpus-per-task=4 4 CPUs (2 cores) on a single node
sinteractive --constraint=multinode --ntasks=64 --exclusive IB FDR nodes, 2 nodes exclusively allocated
sinteractive --constraint=x2650 --ntasks=16 --ntasks-per-core=1 16 cores on an x2650 node
sinteractive --mem=5g --cpus-per-task=8 8 CPUs and 5 Gigabytes of memory in the norm (default) partition

Use sinteractive -h to see available options. The batchlim command will tell you the limits for interactive jobs.

Note: when interactive jobs are submitted without specifying the number of CPUs per task explicitly the $SLURM_CPUS_PER_TASK environment variable is not set.

Walltime Limits

Most partitions have walltime limits. Use batchlim to see the default and max walltime limits for each partition.

If no walltime is requested on the command line, the walltime set for the job will be the default walltime in the table above. To request a specific walltime, use the --time option to sbatch. For example:

sbatch  --time=24:00:00  jobscript
will submit a job to the norm partition, and request a walltime of 24 hrs. If the job runs over 24 hrs it will be killed by the batch system.

To see the walltime limits and current runtimes for jobs, you can use the 'squeue' command.

[user@biowulf ~]$ squeue -O jobid,timelimit,timeused -u  username
JOBID               TIME_LIMIT          TIME
1418444             10-00:00:00         5-05:44:09
1563535             5-00:00:00          1:35:12
1493019             3-00:00:00          2-17:03:27
1501256             5-00:00:00          2-03:08:42
1501257             5-00:00:00          2-03:08:42
1501258             5-00:00:00          2-03:08:42
1501259             5-00:00:00          2-03:08:42
1501260             5-00:00:00          2-03:08:42
1501261             5-00:00:00          2-03:08:42
For many more squeue options, see the squeue man page.

Licenses

Several licensed software products are available on the cluster, including MATLAB, IDL, and Mathematica. Starting in June 2016, MATLAB licenses can only be allocated to interactive jobs. (See this announcement.) To use other licensed software in your batch job, you must specify the --license flag when submitting your job. This flag ensures that the batch system will wait until a license is available before starting the job. If you do not specify this flag, there is a risk that the batch system will start your job, the job will then be unable to get a license, and will exit immediately.

Example:

sbatch --license=idl:6  jobscript       (request the 6 licenses necessary to run a single instance of IDL)
sinteractive --license=idl:6 	        (interactive job that needs to run IDL)

It is no longer necessary to specify MATLAB licenses when running MATLAB in an interactive session via the sinteractive command. The current availability of licenses can be seen on the Systems Status page, or by typing 'licenses' on the command line.

Deleting Jobs

The scancel command is used to delete jobs. Examples:

scancel  232323				(delete job 232323)
scancel -u username			(delete all jobs belonging to user)
scancel --name=JobName			(delete job with the name JobName)
scancel --state=PENDING                 (delete all PENDING jobs)
scancel --state=RUNNING                 (delete all RUNNING jobs)
scancel --nodelist=cn0005               (delete any jobs running on node cn0005)

Job States

Common job states:

Job State Code Means
R Running
PD Pending (Queued). Some possible reasons:
QOSMaxCpusPerUserLimit (User has reached the maximum allocation)
Dependency (Job is dependent on another job which has not completed)
Resources (currently not enough resources to run the job)
Licenses (job is waiting for a license, e.g. Matlab)
CG Completing
CA Cancelled
F Failed
TO Timeout
NF Node failure

Use the sacct command to check on the states of completed jobs.

Show all your jobs in any state since midnight:

sacct

Show all jobs that failed since midnight

sacct --state f

Show all jobs that failed this month

sacct --state f --starttime 2015-07-01 

Exit codes

The completion status of a job is essentially the exit status of the job script with all the complications that entails. For example take the following job script:

#! /bin/bash

module load GATK/2.3.4
GATK -m 5g -T RealignerTargetCreator ...
echo "DONE"

This script tries to load a non-existent GATK version and then calls GATK. This will fail. However, bash by default keeps executing even if commands fail, so the script will eventually print 'DONE'. Since the exit status of a bash script is the exit status of the last command and echo returns 0 (SUCCESS), the script as a whole will exit with an exit code of 0, signalling sucess and the job state will show COMPLETED since SLURM uses the exit code to judge if a job completed sucessfully.

Similarly, if a command in the middle of the job script were killed for exceeding memory, the rest of the job script would still be executed and could potentially return an exit code of 0 (SUCCESS), resulting again in a state of COMPLETED.

Conversely, in the following example a sucessful analysis is followed by a command that fails

#! /bin/bash
module load GATK/3.4.0
GATK -m 5g -T RealignerTargetCreator ...
touch /file/in/non/existing/directory/DONE

Even though the actual analysis (here the GATK call) finished sucessfully, the last command will fail, resulting in a final state of FAILED for the batch job.

Some defensive bash programming techniques can help ensure that a job script will show a final state of FAILED if anything goes wrong.

Use set -e

Starting a bash script with set -e will tell bash to stop executing a script if a command fails and signal failiure with a non-zero exit code which will be reflected as a FAILED state in SLURM.

#! /bin/bash
set -e

module load GATK/3.4.0
GATK -m 5g -T RealignerTargetCreator ...
echo "DONE"

One complication with this approach is that some commands will return non-zero exit codes. For example grepping for a string that does not exist.

Check errors for individual commands

A more selective approach involves carefully checking the exit codes of the important parts of a job script. This can be done with conventional if/else statements or with conditional short circuit evaluation often seen in scripts. For example:

#! /bin/bash

function fail {
    echo "FAIL: $@" >&2
    exit 1  # signal failure
}

module load GATK/3.4.0 || fail "Could not load GATK module"
GATK -m 5g -T RealignerTargetCreator ... || fail "RealignerTargetCreator failed"
echo "DONE"
Job Dependencies

You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using Slurm's job dependencies options. For example, if you have two jobs, Job1.bat and Job2.bat, you can utilize job dependencies as in the example below.

[user@biowulf]$ sbatch Job1.bat
123213

[user@biowulf]$ sbatch --dependency=afterany:123213 Job2.bat
123214

The flag --dependency=afterany:123213 tells the batch system to start the second job only after completion of the first job. afterany indicates that Job2 will run regardless of the exit status of Job1, i.e. regardless of whether the batch system thinks Job1 completed successfully or unsuccessfully.

Once job 123213 completes, job 123214 will be released by the batch system and then will run as the appropriate nodes become available. Exit status: The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of '0' means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.

There are several options for the '--dependency' flag that depend on the status of Job1. e.g.

--dependency=afterany:Job1Job2 will start after Job1 completes with any exit status
--dependency=after:Job1Job2 will start any time after Job1 starts
--dependency=afterok:Job1Job2 will run only if Job1 completed with an exit status of 0
--dependency=afternotok:Job1Job2 will run only if Job1 completed with a non-zero exit status

Making several jobs depend on the completion of a single job is trivial. This is accomplished in the example below:

[user@biowulf]$ sbatch Job1.bat
13205

[user@biowulf]$ sbatch --dependency=afterany:13205 Job2.bat
13206

[user@biowulf]$ sbatch --dependency=afterany:13205 Job3.bat
13207

[user@biowulf]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13205        Job1.bat        R                                  
13206        Job2.bat        PD   afterany:13205                
13207        Job3.bat        PD   afterany:13205                

Making a job depend on the completion of several other jobs: example below.

[user@biowulf]$ sbatch Job1.bat
13201

[user@biowulf]$ sbatch Job2.bat
13202

[user@biowulf]$ sbatch --dependency=afterany:13201,13202 Job3.bat
13203

[user@biowulf]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13201        Job1.bat        R                                  
13202        Job2.bat        R                                  
13203        Job3.bat        PD   afterany:13201,afterany:13202 

Chaining jobs is most easily done by submitting the second dependent job from within the first job. Example batch script:

#!/bin/bash

cd /data/mydir
run_some_command
sbatch --dependency=afterany:$SLURM_JOBID  my_second_job

More detailed examples are shown on a separate page.

Using local disk

Each Biowulf node has some amount of local disk available for use. For new nodes this is generally 800GB of fast solid state storage. For older nodes this is generally a smaller amount of spinning disk. Use the freen command to see how much is available on each node type. For jobs that read/write lots of temporary files during the run, it may be advantageous to use the local disk as scratch or temp space.

The command

sbatch --gres=lscratch:500    jobscript

will allocate 500 GB of local scratch space from the /lscratch directory. Other jobs may allocate from the remaining 300 GB on that node.

For multi-node jobs, each node will have the amount specified in the command line reserved for the job.

To access the directory allocated to the job, refer to it as /lscratch/$SLURM_JOBID. Users will no longer be able to read/write to the top level of /lscratch but have full read/write access to the /lscratch/$SLURM_JOBID set up for the job.

When the job is terminated, all data in /lscratch/$SLURM_JOBID directory will be automatically deleted. Any data that needs to be saved should be copied to your /data directory before the job concludes.

Performance of lscratch will suffer for all users on a node when large numbers of files are created in a single directory. Please avoid these situations by either removing files no longer needed for the ongoing job , or structure your data differently (subdirectories, sqlite3 database, python shelf, ...).

See also our video tutorial on using lscratch.

Setting TMPDIR

TMPDIR is a near-universally agreed upon environment variable that defines where a program will write temporary files. By default, Unix systems set the value of TMPDIR to /tmp. On the Biowulf cluster, leaving TMPDIR set to /tmp can lead to problems due to:

Because of this, users are strongly encouraged to allocate local scratch disk for their jobs, as well as setting TMPDIR to that local scratch disk. Because local scratch is not defined until the job begins running, setting TMPDIR must be done either within the batch script:

#!/bin/bash
export TMPDIR=/lscratch/$SLURM_JOB_ID

... run batch commands here ...

or once an interactive session begins:

[biowulf ~]$ sinteractive --gres=lscratch:5
salloc.exe: Granted job allocation 12345
[cn1234 ~]$ export TMPDIR=/lscratch/$SLURM_JOB_ID
[cn1234 ~]$ ...some interactive commands....

[cn1234 ~]$ exit
Requesting more than one GRES

To request more than one Generic Resource (GRES) like local scratch or GPUs, use the following format:

[biowulf ~]$ sinteractive --constraint=gpuk80 --gres=lscratch:10,gpu:k80:1

Note that this is not the same as using the --gres option multiple times in which case only the last will be honored.

Cluster Status

Cluster status info is available on the System Status page. The partitions page shows free and allocated cores for each partition over the last 24 hrs.

On the command line, freen will report free nodes/cores on the cluster, and batchlim will report the current per-user limits and walltime limits on the partitions. e.g.

Monitoring Jobs

squeue will report all jobs on the cluster. squeue -u username will report your running jobs. An in-house variant of squeue is sjobs, which provides the information in a different format. Slurm commands like squeue are very flexible, so that you can easily create your own aliases.

Examples of squeue and sjobs:

[biowulf ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22392     norm meme_sho   susanc PD       0:00      1 (Dependency)
             22393     norm meme_sho   susanc PD       0:00      1 (Dependency)
             22404     norm   stmv-1   susanc  R      10:04      1 cn0414
             22391     norm meme_sho   susanc  R      10:06      1 cn0413

[biowulf ~]$ sjobs
                                                       ................Requested.................
User     JobId  JobName    Part  St    Runtime  Nodes CPUs     Mem        Dependency    Features    Nodelist
susanc   22391  meme_short norm  R      10:09   1     32   1.0GB/cpu                     (null)    cn0413         
susanc   22404  stmv-1     norm  R      10:07   1     32   1.0GB/cpu                     (null)    cn0414         
susanc   22392  meme_short norm  PD      0:00   1      2   1.0GB/cpu   afterany:22391    (null)    (Dependency)   
susanc   22393  meme_short norm  PD      0:00   1      4   1.0GB/cpu   afterany:22392    (null)    (Dependency)   
[More about sjobs]

jobload will report running jobs, the %CPU usage, and the memory usage. See here for details and an example.

Jobhist will report the CPU and memory usage of completed jobs. Details and example.

Email notifications

Using the --mail-type=<type> option to sbatch, users can request email notifications from SLURM as certain events occur. Multiple event types can be specified as a comma separated list. For example

[user@biowulf]$ sbatch --mail-type=BEGIN,TIME_LIMIT_90,END batch_script.sh

Available event types:

Event type Description
BEGINJob started
ENDJob finished
FAILJob failed
REQUEUEJob was requeued
ALLBEGIN,END,FAIL,REQUEUE
TIME_LIMIT_50Job reached 50% of its time limit
TIME_LIMIT_80Job reached 80% of its time limit
TIME_LIMIT_90Job reached 90% of its time limit
TIME_LIMITJob reached its time limit
Modifying a job after submission

After a job is submitted, some of the submission parameters can be modified using the scontrol command. Examples:

Change the job dependency:

  scontrol update JobId=181766 dependency=afterany:18123                  
Request a matlab license:
  scontrol update JobId=181755 licenses=matlab                                
Job was submitted to the norm partition, resend to ccr partition:
  scontrol update JobID=181755 partition=ccr QOS=ccr                                  
Walltimes on pending and running jobs can also be increased or decreased using the newwall command. Examples:

Reduce the walltime for job id 12345 to 2 hours.

  newwall --jobid 12345 --time 2:00:00                                 
Increase the walltime for job id 12345 to 8 hours.
  newwall --jobid 12345 --time 8:00:00                                 

See newwall --help for usage details.

Note: Users can only increase walltimes up to the walltime limit of the partition. If you need a longer walltime increase, contact staff@hpc.nih.gov