Biowulf User Guide

Information about job submission, job management and job monitoring on the NIH HPC Biowulf cluster.

Acknowledgement/Citation

The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the NIH Intramural Research Program. If you publish research that involved significant use of Biowulf, please cite the cluster. Suggested citation text:

This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Connecting, Passwords, Email, Disk

Use 'ssh biowulf.nih.gov' to connect from the command line. See Connecting to the NIH HPC systems

Your username and password are your NIH login username and password, as on Helix.

Computationally demanding or memory intensive processes are not permitted on the Biowulf login node. See Interactive jobs below.

Email from the Biowulf batch system goes to $USER@biowulf.nih.gov, which is forwarded to your NIH email address.

Your /home, /data and shared space is set up exactly the same on Helix and Biowulf. See the storage section for details.

Slurm Job Submission Summary

A summary of Biowulf job submission is available for download or printing (PDF).

Job Submission

Use the 'sbatch' or 'swarm' command to submit a batch script.

Important sbatch flags:

--partition=partname Job to run on partition 'partname'. (default: 'norm')
--ntasks=# Number of tasks (processes) to be run
--cpus-per-task=# Number of CPUs required for each task (e.g. '8' for an 8-way multithreaded job)
--ntasks-per-core=1 Do not use hyperthreading (this flag typically used for parallel jobs)
--mem=#g Memory required for the job (Note the g (GB) in this option)
--exclusive Allocate the node exclusively
--no-requeue | --requeue If an allocated node hangs, whether the job should be requeued or not.
--error=/path/to/dir/filename Location of std class="softBottom"err file (by default, slurm######.out in the submitting directory)
--output=/path/to/dir/filename Location of stdout file (by default, slurm######.out in the submitting directory)
--wrap="command arg1 arg2" Submit a single command with arguments instead of a script (note quotes)
--license=idl:6 Request 6 IDL licenses (Minimum necessary for an instance of IDL)

More useful flags and environment variables are detailed in the sbatch manpage, which can be read on the system by invoking man sbatch.

Single-threaded batch job

[biowulf ~] sbatch jobscript

This job will be allocated 2 CPUs and 4 GB of memory.

Multi-threaded batch job

[biowulf ~] sbatch --cpus-per-task=# jobscript

The above job will be allocated '#' CPUs, and (# * 2) GB of memory. e.g. with --cpus-per-task=4, the default memory allocation is 8 GB of memory.

You should use the Slurm environment variable $SLURM_CPUS_PER_TASK within your script to specify the number of threads to the program. For example, to run a Novoalign job with 8 threads, set up a batch script like this:

#!/bin/bash

module load novocraft
novoalign -c $SLURM_CPUS_PER_TASK  -f s_1_sequence.txt -d celegans -o SAM > out.sam

and submit with:

sbatch --cpus-per-task=8  jobscript

Note: when jobs are submitted without specifying the number of CPUs per task explicitly the $SLURM_CPUS_PER_TASK environment variable is not set.

Allocating more memory:

[biowulf ~] sbatch --mem=#g jobscript

The above job will be allocated # GB of memory. Note the g (GB) following the memory specification. Without this addition the job will allocate # MB of memory. Unless your job uses very little memory this will likely cause it to fail.

Exclusively allocating nodes

Add --exclusive to your sbatch command line to exclusively allocate all CPUs/GPUs from a node. Note that the batch system will still limit you to the requested memory, or to the default 2 GB (batch job) /0.75 GB (interactive session) per CPUs if you do not specifically request memory. Alernatively, adding --mem=0 along with --exclusive will allocate all the memory available on the node to the job.

Auto-threading apps

Programs that 'auto-thread' (i.e. attempt to use all available CPUs on a node) should be run with the --exclusive flag. This will give you an exclusively allocated node with at least 16 CPUs.

The Biowulf batch system will allocate by core, rather than by node. Thus, your job may be allocated 4 cores and 8 GB of memory on a node which has 16 cores and 32 GB of memory. Other jobs may be utilizing the remaining 12 cores and 24 GB of memory, so that your jobs may not have exclusive use of the node. Slurm will not allow any job to utilize more memory or cores than were allocated.

The default Slurm allocation is 1 physical core (2 CPUs) and 4 GB of memory. For any jobs that require more memory or CPU, you need to specify these requirements when submitting the job. Examples:

Command Allocation

sbatch jobscript 2 CPUs, 4 GB memory on a shared node.

sbatch --mem=8g 2 CPUs, 8 GB memory on a shared node.

sbatch --mem=200g --cpus-per-task=4 4 CPUs, 200 GB memory on a shared node

sbatch --mem=24g --cpus-per-task=16 16 CPUs, 24 GB memory on a shared node.

sinteractive --mem=Mg --cpus-per-task=C interactive job with C cpus and M GB of memory on a shared node.

Note: add --exclusive if you want the node allocated exclusively.

Job Directives

Options to sbatch that can be given on the command line can also be embedded into the job script as job directives. These are specified one to a line at the top of the job script file, immediately after the #!/bin/bash line, by the string #SBATCH at the start of the line, followed by the option that is to be set. For example, to have stdout captured in a file called "myjob.out" in your home directory, and stderr captured in a file called "myjob.err", the job file would start out as:

#!/bin/bash

#SBATCH -o ~/myjob.out
#SBATCH -e ~/myjob.err

Note that the #SBATCH must be in the first column of the file. Also, if an option is given on the command line that conflicts with a job directive inside the job script, the value given on the command line takes precedence.

Swarm

Job arrays can be submitted on Biowulf using swarm. e.g.

swarm -g G -t T -f swarmfile --module afni

will submit a swarm job with each command (a single line in the swarm command file) allocated T cpus (for T threads) and G GB of memory. You can use the environment variable $SLURM_CPUS_PER_TASK within the swarm command file to specify the number of threads to the program. See the swarm webpage for details or watch the videos and go through the hands-on exercises in the Swarm section of the Biowulf Online Class

Parallel Jobs

Video: Multinode parallel jobs on Biowulf (27 mins)

Making efficient use of Biowulf's multinode partition

Parallel (MPI) jobs that run on more than 1 node: Use the environment variable $SLURM_NTASKS within the script to specify the number of MPI processes. For example:

#!/bin/bash

module load meep/1.2/mpi/gige
cd /data/$USER/mydir
meme infile params -p $SLURM_NTASKS

Submit with, for example:

sbatch --ntasks=C --constraint=nodetype --exclusive --ntasks-per-core=1 [--mem-per-cpu=Gg] jobscript

where:

--ntasks=C	number of tasks (MPI processes) to run
--constraint=nodetype	all nodes should be of the same type, e.g. 'x2650'.
--exclusive	for jobs with interprocess communication, it is best to allocate the nodes exclusively
--ntasks-per-core=1	Most parallel jobs do better running only 1 process per physical CPU
[optional] --mem-per-cpu=Gg	only needed if each process needs more than the default 2 GB per hyperthreaded core

See the webpage for the application for more details.

Partitions

Video: Slurm Resources, Partitions and Scheduling on Biowulf (14 mins).

Biowulf nodes are grouped into partitions. A partition can be specified when submitting a job. The default partition is 'norm'. The freen command can be used to see free nodes and CPUs, and available types of nodes on each partition.

Nodes available to all users
norm	the default partition. Restricted to single-node jobs
multinode	Intended to be used for large-scale parallel jobs. Single node jobs are not allowed. See here for detailed information.
largemem	Large memory nodes. Reserved for jobs with memory requirements that cannot fit on the norm partition. Jobs in the largemem partition must request a memory allocation of at least 350GB.
unlimited	Reserved for jobs that require more than the default 10-day walltime. Note that this is a small partition with a low CPUs-per-user limit. Only jobs that absolutely require more than 10 days runtime, that cannot be split into shorter subjobs, or that are a first-time run where the walltime is unknown, should be run on this partition.
quick	For jobs < 4 hours long. These jobs are scheduled at higher priority. They may run on the dedicated quick partition nodes, or on the buy-in nodes when they are free.
gpu	GPU nodes reserved for applications that are built for GPUs.
visual	Small number of GPU nodes reserved for jobs that require hardware accelerated graphics for data visualization.
Buy-in nodes
ccr*	for NCI CCR users
forgo	for individual groups from NHLBI and NINDS
persist	for NIMH users

Jobs and job arrays can be submitted to a single partition (e.g. --partition=ccr) or to two partitions (e.g. --partition=norm,ccr), in which case they will be run on the first partition where the job(s) can be scheduled. Please note:

Using more than two partitions may result in scheduling problems due to additional scheduling overhead. Therefore submissions are limited to no more than two partitions..
When using two partitions make sure that the job(s) can run on both partitions. For example, if submitting with --partition=norm,quick, make sure that the walltime limit is no more than the maximal walltime of the quick partition.

Allocating GPUs

Video: Allocating GPUs on Biowulf (7 mins)

To make use of GPUs, jobs have to be submitted to the gpu partition and specifically request the type and number of GPUs. For example:

# request one k80 GPU
[biowulf ~]$ sbatch --partition=gpu --gres=gpu:k80:1 script.sh

# request two k80 GPUs
[biowulf ~]$ sbatch --partition=gpu --gres=gpu:k80:2 script.sh

# request 1 k80 GPU and 8 CPUs on a single K80 node
[biowulf ~]$ sbatch --partition=gpu --cpus-per-task=8 --gres=gpu:k80:1 script.sh

# request all 4 k80 GPUs and 56 CPUs on a single K80 node
[biowulf ~]$ sbatch --partition=gpu --cpus-per-task=56 --gres=gpu:k80:4 script.sh

# request 2 P100 GPUs
[biowulf ~]$ sbatch --partition=gpu --gres=gpu:p100:2 script.sh

All GPU nodes have 4 GPUs. The 'freen' command can be used to see the CPUs and memory on each type of node. For example:

[biowulf ~]$ freen | grep -E 'Partition|----|gpu'
                                                    .......Per-Node Resources......
Partition      FreeNds      FreeCPUs      FreeGPUs  Cores  CPUs  GPUs    Mem   Disk
-----------------------------------------------------------------------------------
gpu (v100x)     2 / 53     2042 / 3816    17 / 212     36    72     4   373g  1600g
gpu (v100)      0 / 8       178 / 448      3 / 32      28    56     4   121g   800g
gpu (p100)      9 / 48     2234 / 2688    75 / 192     28    56     4   121g   650g
gpu (k80)       5 / 67     1904 / 3752    25 / 268     28    56     4   247g   800g

will show you, for example, that the K80 nodes have 28 cores (56 CPUs) and 247 GB of allocatable memory. For each allocated GPU, no more than #CPUS / #GPUs on a node can be allocated. For example, for each allocated k20x GPU 32 / 2 = 16 CPUs can be allocated. Likewise, 56 / 4 = 14 CPUs can be allocated for each P100. Slurm will accept jobs with a higher number of CPUs than possible, but the job will remain in the queue indefinitely.

The request for the GPU resource is in the form resourceName:resourceType:number.

To allocate a GPU for an interactive session, e.g. to compile a program, use:

[biowulf ~]$ sinteractive --gres=gpu:k80:1

To request more than the default 2 CPUs, use

[biowulf ~]$ sinteractive --gres=gpu:k80:1 --cpus-per-task=8

To monitor your GPU utilization, login to the compute node where your job is running and use

[cn2345 ~]$ nvidia-smi

Interactive Jobs

Video: Interactive Jobs Biowulf (11 mins)

To allocate resources for an interactive job, use the sinteractive command. The options are largely the same as for the sbatch command. e.g.

[biowulf ~]$ sinteractive
salloc.exe: Granted job allocation 22261

[cn0004 ~]$ ...some interactive commands....

[cn0004 ~]$exit
exit
salloc.exe: Relinquishing job allocation 22261
salloc.exe: Job allocation 22261 has been revoked.

[biowulf ~]$

The default sinteractive allocation is 1 core (2 CPUs) and 768 MB/CPU (1.5 GB) of memory. You can request additional resources. e.g.

Command	Allocation
`sinteractive --cpus-per-task=4`	4 CPUs (2 cores) on a single node
`sinteractive --constraint=ibfdr --ntasks=64 --exclusive`	IB FDR nodes, 2 nodes exclusively allocated
`sinteractive --constraint=x2650 --ntasks=16 --ntasks-per-core=1`	16 cores on an x2650 node
`sinteractive --mem=5g --cpus-per-task=8`	8 CPUs and 5 Gigabytes of memory in the norm (default) partition

sinteractive supports, via the -T/--tunnel option, automatically creating SSH tunnels that can be used to access application servers you run within your job. See SSH Tunneling on Biowulf for details.

Use sinteractive -h to see all available options.

The number of concurrent interactive jobs is currently limited to 2 and the longest walltime is 36 hours. To see all up-to-date limits that apply to sinteractive sessions use the batchlim command

Re-connecting to interactive sessions: Interactive sessions are terminated if the controlling Biowulf session exits (e.g. laptop drops off the VPN). To maintain the interactive sessions even when you disconnect, we recommend tmux (tmux crash course, quick guide) or screen for text-based sessions. Start your sinteractive session from a tmux/screen window and then disconnect from the tmux/screen session before logging out. Then, when you reconnect to the Biowulf login node, you can re-attach to the tmux/screen session where your interactive session will be waiting for you. To reconnect to graphical sessions, use NX. Please do not run tmux or screen inside of NX.

Note: When interactive jobs are submitted without specifying the number of CPUs per task explicitly, the $SLURM_CPUS_PER_TASK environment variable is not set.

Data Visualization Jobs

To allocate resources for an interactive job using a node that has been configured for remote visualization, use the svis command. These nodes are allocated in their entirety so you will probably not need to supply any additional options. As demonstrated by the example below, the svis command prompts you to open an additional terminal and create an ssh tunnel to biowulf. Detailed instructions for using the visual partiton can be found here

[biowulf ~]$ svis
salloc.exe: Pending job allocation 11051463
salloc.exe: job 11051463 queued and waiting for resources
salloc.exe: job 11051463 has been allocated resources
salloc.exe: Granted job allocation 11051463
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn0655 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
error: unable to open file /tmp/slurm-spank-x11.11051463.0
slurmstepd: error: x11: unable to read DISPLAY value
[+] Loading TurboVNC
Starting VNC server ... please be patient...
VNC server started on display 1 port 5901
VNC configured with SSH forwarding.

After creating a tunnel from your workstation to biowulf.nih.gov
port 38130, connect your VNC client to localhost port
38130. See https://hpc.nih.gov/nih/vnc for details.

The VNC connection will terminate when this shell exits.


Please create a SSH tunnel from your workstation to these ports on biowulf.
On Linux/MacOS, open a terminal and run:

    ssh  -L 38130:localhost:38130 user@biowulf.nih.gov

    For Windows instructions, see https://hpc.nih.gov/docs/tunneling

After establishing a TurboVNC session you can run your application with hardware graphics acceleration like so:

[cn0655 ~]$ module load virtualgl graphics-benchmarks
[+] Loading VirtualGL
[+] Loading graphics-benchmarks  0.0.1  on cn0655
[+] Loading singularity  3.7.2  on cn0655

[cn0655 ~]$ vglrun valley

Please see the visual partition user guide for detailed instructions and examples.

Walltime Limits

Video: Setting Walltimes on Biowulf (4 mins)

Most partitions have walltime limits. Use batchlim to see the default and max walltime limits for each partition.

If no walltime is requested on the command line, the walltime set for the job will be the default walltime in the table above. To request a specific walltime, use the --time option to sbatch. For example:

sbatch  --time=24:00:00  jobscript

will submit a job to the norm partition, and request a walltime of 24 hrs. If the job runs over 24 hrs it will be killed by the batch system.

To see the walltime limits and current runtimes for jobs, you can use the 'squeue' command.

[user@biowulf ~]$ squeue -O jobid,timelimit,timeused -u  username
JOBID               TIME_LIMIT          TIME
1418444             10-00:00:00         5-05:44:09
1563535             5-00:00:00          1:35:12
1493019             3-00:00:00          2-17:03:27
1501256             5-00:00:00          2-03:08:42
1501257             5-00:00:00          2-03:08:42
1501258             5-00:00:00          2-03:08:42
1501259             5-00:00:00          2-03:08:42
1501260             5-00:00:00          2-03:08:42
1501261             5-00:00:00          2-03:08:42

For many more squeue options, see the squeue man page.

Licenses

Several licensed software products are available on the cluster, including MATLAB, IDL, and Mathematica. Starting in June 2016, MATLAB licenses can only be allocated to interactive jobs. (See this announcement.) To use other licensed software in your batch job, you must specify the --license flag when submitting your job. This flag ensures that the batch system will wait until a license is available before starting the job. If you do not specify this flag, there is a risk that the batch system will start your job, the job will then be unable to get a license, and will exit immediately.

Example:

sbatch --license=idl:6  jobscript       (request the 6 licenses necessary to run a single instance of IDL)
sinteractive --license=idl:6 	        (interactive job that needs to run IDL)

It is no longer necessary to specify MATLAB licenses when running MATLAB in an interactive session via the sinteractive command. The current availability of licenses can be seen on the Systems Status page, or by typing 'licenses' on the command line.

Deleting Jobs

The scancel command is used to delete jobs. Examples:

scancel  232323					(delete job 232323)
scancel --user=username				(delete all jobs belonging to user)
scancel --user=username --state=PENDING		(delete pending jobs belonging to user)
scancel --user=username --state=RUNNING		(delete running jobs belonging to user)
scancel --name=JobName				(delete job with the name JobName)
scancel --nodelist=cn0005			(delete any jobs running on node cn0005)

Job States

Common job states:

Job State Code	Means
R	Running
PD	Pending (Queued). Some possible reasons: QOSMaxCpusPerUserLimit (User has reached the maximum allocation) Dependency (Job is dependent on another job which has not completed) Resources (currently not enough resources to run the job) Licenses (job is waiting for a license, e.g. Matlab)
CG	Completing
CA	Cancelled
F	Failed
TO	Timeout
NF	Node failure

Use the sacct command to check on the states of completed jobs.

Show all your jobs in any state since midnight:

sacct

Show all jobs that failed since midnight

sacct --state f

Show all jobs that failed this month

sacct --state f --starttime 2015-07-01

Slurm Job Reason

Slurm Jobs will display reasons for either not running, or ending prematurely. For example QOSMaxCpuPerUserLimit or ReqNodeNotAvail. An explanation for those reasons can be found here: https://slurm.schedmd.com/job_reason_codes.html

Exit codes

The completion status of a job is essentially the exit status of the job script with all the complications that entails. For example take the following job script:

#! /bin/bash

module load GATK/2.3.4
GATK -m 5g -T RealignerTargetCreator ...
echo "DONE"

This script tries to load a non-existent GATK version and then calls GATK. This will fail. However, bash by default keeps executing even if commands fail, so the script will eventually print 'DONE'. Since the exit status of a bash script is the exit status of the last command and echo returns 0 (SUCCESS), the script as a whole will exit with an exit code of 0, signalling sucess and the job state will show COMPLETED since SLURM uses the exit code to judge if a job completed sucessfully.

Similarly, if a command in the middle of the job script were killed for exceeding memory, the rest of the job script would still be executed and could potentially return an exit code of 0 (SUCCESS), resulting again in a state of COMPLETED.

Conversely, in the following example a sucessful analysis is followed by a command that fails

#! /bin/bash
module load GATK/3.4.0
GATK -m 5g -T RealignerTargetCreator ...
touch /file/in/non/existing/directory/DONE

Even though the actual analysis (here the GATK call) finished sucessfully, the last command will fail, resulting in a final state of FAILED for the batch job.

Some defensive bash programming techniques can help ensure that a job script will show a final state of FAILED if anything goes wrong.

Use set -e

Starting a bash script with set -e will tell bash to stop executing a script if a command fails and signal failure with a non-zero exit code which will be reflected as a FAILED state in SLURM.

#! /bin/bash
set -e

module load GATK/3.4.0
GATK -m 5g -T RealignerTargetCreator ...
echo "DONE"

One complication with this approach is that some commands will return non-zero exit codes. For example grepping for a string that does not exist.

Check errors for individual commands

A more selective approach involves carefully checking the exit codes of the important parts of a job script. This can be done with conventional if/else statements or with conditional short circuit evaluation often seen in scripts. For example:

#! /bin/bash

function fail {
    echo "FAIL: $@" >&2
    exit 1  # signal failure
}

module load GATK/3.4.0 || fail "Could not load GATK module"
GATK -m 5g -T RealignerTargetCreator ... || fail "RealignerTargetCreator failed"
echo "DONE"

Special exit code when Slurm maintenance occuring

Please note that the Slurm batch system may occasionally be shut down either briefly for a necessary configuration change or for longer periods when system maintenance is underway. In these situations, a "downtime maintenance" script will be installed in place of the normal Slurm commands (sbatch, squeue, etc.). This downtime script will terminate with an exit code of 123. This provides an easy way for job scripts, workflows and/or pipelines to test if the batch system if offline for maintenance. You can test for exit code 123 and, if found, can know to try your request again later. For example:

#!/bin/bash

SLEEPTIME=120

sbatch job_script.sh 
while [ $? -eq 123 ] ; do 
	echo "Batch system currently unavailable, trying again in $SLEEPTIME seconds..."
	sleep $SLEEPTIME
	sbatch job_script.sh 
done

Job Dependencies

Video: Job Dependencies (9 mins)

You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using Slurm's job dependencies options. For example, if you have two jobs, Job1.bat and Job2.bat, you can utilize job dependencies as in the example below.

[user@biowulf]$ sbatch Job1.bat
123213

[user@biowulf]$ sbatch --dependency=afterany:123213 Job2.bat
123214

The flag --dependency=afterany:123213 tells the batch system to start the second job only after completion of the first job. afterany indicates that Job2 will run regardless of the exit status of Job1, i.e. regardless of whether the batch system thinks Job1 completed successfully or unsuccessfully.

Once job 123213 completes, job 123214 will be released by the batch system and then will run as the appropriate nodes become available. Exit status: The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of '0' means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.

There are several options for the '--dependency' flag that depend on the status of Job1. e.g.

--dependency=afterany:Job1 Job2 will start after Job1 completes with any exit status
--dependency=after:Job1 Job2 will start any time after Job1 starts
--dependency=afterok:Job1 Job2 will run only if Job1 completed with an exit status of 0
--dependency=afternotok:Job1 Job2 will run only if Job1 completed with a non-zero exit status

Making several jobs depend on the completion of a single job is trivial. This is accomplished in the example below:

[user@biowulf]$ sbatch Job1.bat
13205

[user@biowulf]$ sbatch --dependency=afterany:13205 Job2.bat
13206

[user@biowulf]$ sbatch --dependency=afterany:13205 Job3.bat
13207

[user@biowulf]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13205        Job1.bat        R                                  
13206        Job2.bat        PD   afterany:13205                
13207        Job3.bat        PD   afterany:13205

Making a job depend on the completion of several other jobs: example below.

[user@biowulf]$ sbatch Job1.bat
13201

[user@biowulf]$ sbatch Job2.bat
13202

[user@biowulf]$ sbatch --dependency=afterany:13201,13202 Job3.bat
13203

[user@biowulf]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13201        Job1.bat        R                                  
13202        Job2.bat        R                                  
13203        Job3.bat        PD   afterany:13201,afterany:13202

Chaining jobs is most easily done by submitting the second dependent job from within the first job. Example batch script:

#!/bin/bash

cd /data/mydir
run_some_command
sbatch --dependency=afterany:$SLURM_JOB_ID  my_second_job

More detailed examples are shown on a separate page.

Using local disk

Video: Utilizing local disk on Biowulf nodes (10 mins)

Each Biowulf node has some amount of local disk available for use. For most nodes this is generally 800GB of fast solid state storage. A limited number have 2400GB (2.4TB) . Use the freen command to see how much is available on each node type. For jobs that read/write lots of temporary files during the run, it may be advantageous to use the local disk as scratch or temp space.

The command

sbatch --gres=lscratch:500    jobscript

will allocate 500 GB of local scratch space from the /lscratch directory. Other jobs may allocate from the remaining 300 GB on that node.

For multi-node jobs, each node will have the amount specified in the command line reserved for the job.

To access the directory allocated to the job, refer to it as /lscratch/$SLURM_JOB_ID. Users will no longer be able to read/write to the top level of /lscratch but have full read/write access to the /lscratch/$SLURM_JOB_ID set up for the job.

Note that each subjob in a swarm will have a separate lscratch directory. That means that lscratch cannot be used to share data between subjobs. Commands bundled into a single subjob with the -b option to swarm will all share the same lscratch directory, however.

When the job is terminated, all data in /lscratch/$SLURM_JOB_ID directory will be automatically deleted. Any data that needs to be saved should be copied to your /data directory before the job concludes.

Performance of lscratch will suffer for all users on a node when large numbers of files are created in a single directory. Please avoid these situations by either removing files no longer needed for the ongoing job , or structure your data differently (subdirectories, sqlite3 database, python shelf, ...).

Setting TMPDIR

TMPDIR is a near-universally agreed upon environment variable that defines where a program will write temporary files. By default, Unix systems set the value of TMPDIR to /tmp. On the Biowulf cluster, leaving TMPDIR set to /tmp can lead to problems due to:

Multiple users accessing the same files/directories within the /tmp space
Multiple processes accssing the same files/directories within the /tmp space
Limited disk space capacity in /tmp (typically ~8GB per node)

Because of this, users are strongly encouraged to allocate local scratch disk for their jobs, as well as setting TMPDIR to that local scratch disk. Because local scratch is not defined until the job begins running, setting TMPDIR must be done either within the batch script:

#!/bin/bash
export TMPDIR=/lscratch/$SLURM_JOB_ID

... run batch commands here ...

or once an interactive session begins:

[biowulf ~]$ sinteractive --gres=lscratch:5
salloc.exe: Granted job allocation 12345
[cn1234 ~]$ export TMPDIR=/lscratch/$SLURM_JOB_ID
[cn1234 ~]$ ...some interactive commands....

[cn1234 ~]$ exit

Requesting more than one GRES

To request more than one Generic Resource (GRES) like local scratch or GPUs, use the following format:

[biowulf ~]$ sinteractive --constraint=gpuk80 --gres=lscratch:10,gpu:k80:1

Note that this is not the same as using the --gres option multiple times in which case only the last will be honored.

Cluster Status

Video: Slurm Resources, Partitions and Scheduling (14 mins)

Cluster status info is available on the System Status page. The partitions page shows free and allocated cores for each partition over the last 24 hrs.

On the command line, freen will report free nodes/cores on the cluster, and batchlim will report the current per-user limits and walltime limits on the partitions. e.g.

Monitoring Jobs

Video: Job Monitoring tools on Biowulf (21 mins)

squeue will report all jobs on the cluster. squeue -u username will report your running jobs. An in-house variant of squeue is sjobs, which provides the information in a different format. Slurm commands like squeue are very flexible, so that you can easily create your own aliases.

Examples of squeue and sjobs:

[biowulf ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22392     norm meme_sho   user   PD       0:00      1 (Dependency)
             22393     norm meme_sho   user   PD       0:00      1 (Dependency)
             22404     norm   stmv-1   user    R      10:04      1 cn0414
             22391     norm meme_sho   user    R      10:06      1 cn0413

[biowulf ~]$ sjobs
                                                       ................Requested.................
User     JobId  JobName    Part  St    Runtime  Nodes CPUs     Mem        Dependency    Features    Nodelist
user     22391  meme_short norm  R      10:09   1     32   1.0GB/cpu                     (null)    cn0413         
user     22404  stmv-1     norm  R      10:07   1     32   1.0GB/cpu                     (null)    cn0414         
user     22392  meme_short norm  PD      0:00   1      2   1.0GB/cpu   afterany:22391    (null)    (Dependency)   
user     22393  meme_short norm  PD      0:00   1      4   1.0GB/cpu   afterany:22392    (null)    (Dependency)

[More about sjobs]

jobload will report running jobs, the %CPU usage, and the memory usage while they are running. See here for details and an example.

Jobhist will report the CPU and memory usage of completed jobs. Details and example.

An excellent web-based utility that allows you to monitor your running and completed jobs is the User Dashboard.

Email notifications

Using the --mail-type=<type> option to sbatch, users can request email notifications from SLURM as certain events occur. Email will be sent to $USER@biowulf.nih.gov which is automatically forwarded to your NIH email address. Multiple event types can be specified as a comma separated list. For example

[user@biowulf]$ sbatch --mail-type=BEGIN,TIME_LIMIT_90,END batch_script.sh

Available event types:

Event type	Description
BEGIN	Job started
END	Job finished
FAIL	Job failed
REQUEUE	Job was requeued
ALL	BEGIN,END,FAIL,REQUEUE
TIME_LIMIT_50	Job reached 50% of its time limit
TIME_LIMIT_80	Job reached 80% of its time limit
TIME_LIMIT_90	Job reached 90% of its time limit
TIME_LIMIT	Job reached its time limit

By default email will be delivered to your NIH mailbox. Should you need to specify an alternate email address with --mail-user, please make sure to specify a literal, valid address.

Modifying a job after submission

After a job is submitted, some of the submission parameters can be modified using the scontrol command. Examples:

Change the job dependency:

  scontrol update JobId=181766 dependency=afterany:18123

Request a matlab license:

  scontrol update JobId=181755 licenses=matlab

Job was submitted to the norm partition, resend to ccr partition:

  scontrol update JobID=181755 partition=ccr QOS=ccr

Walltimes on pending and running jobs can also be increased or decreased using the newwall command. Examples:

Reduce the walltime for job id 12345 to 2 hours.

  newwall --jobid 12345 --time 2:00:00

Increase the walltime for job id 12345 to 8 hours.

  newwall --jobid 12345 --time 8:00:00

See newwall --help for usage details.

Note: Users can only increase walltimes up to the walltime limit of the partition, which you can check using batchlim. If you need a longer walltime increase, contact staff@hpc.nih.gov

--partition=partname	Job to run on partition 'partname'. (default: 'norm')
--ntasks=#	Number of tasks (processes) to be run
--cpus-per-task=#	Number of CPUs required for each task (e.g. '8' for an 8-way multithreaded job)
--ntasks-per-core=1	Do not use hyperthreading (this flag typically used for parallel jobs)
--mem=#g	Memory required for the job (Note the g (GB) in this option)
--exclusive	Allocate the node exclusively
--no-requeue \| --requeue	If an allocated node hangs, whether the job should be requeued or not.
--error=/path/to/dir/filename	Location of std class="softBottom"err file (by default, slurm######.out in the submitting directory)
--output=/path/to/dir/filename	Location of stdout file (by default, slurm######.out in the submitting directory)
--wrap="command arg1 arg2"	Submit a single command with arguments instead of a script (note quotes)
--license=idl:6	Request 6 IDL licenses (Minimum necessary for an instance of IDL)

Command	Allocation
sbatch jobscript	2 CPUs, 4 GB memory on a shared node.
sbatch --mem=8g	2 CPUs, 8 GB memory on a shared node.
sbatch --mem=200g --cpus-per-task=4	4 CPUs, 200 GB memory on a shared node
sbatch --mem=24g --cpus-per-task=16	16 CPUs, 24 GB memory on a shared node.
sinteractive --mem=Mg --cpus-per-task=C	interactive job with C cpus and M GB of memory on a shared node.

--dependency=afterany:Job1	Job2 will start after Job1 completes with any exit status
--dependency=after:Job1	Job2 will start any time after Job1 starts
--dependency=afterok:Job1	Job2 will run only if Job1 completed with an exit status of 0
--dependency=afternotok:Job1	Job2 will run only if Job1 completed with a non-zero exit status