NIH HPC News & Announcements
Changes to Slurm memory limit enforcement on Biowulf
Date: 26 September 2017 12:09:47
From: Tim Miller
A configuration change has been made to the Slurm batch system to allow
the batch system to accurately measure shared memory usage of a job. In
the past, jobs running applications that used large amounts of shared
memory might have been killed because of an incorrect memory calculation.
This change has the effect that only the particular process that exceeds
the memory allocation will be killed. (In the past, the entire job would
have been killed). A batch script that has a series of commands will now
continue past the killed process. Unless the last command in the script is
the one that was killed, the Slurm status of the job will show as
'COMPLETED' rather than 'FAILED' despite one or more commands not running
to completion.
Users who run batch scripts or swarm subcommands that execute pipelines of
multiple dependent commands should do the following:
- Use "set -e" as the first command in pipelined scripts or swarm subjobs
to ensure that the entire batch script will fail if a single subcommand
fails.
- Carefully monitor jobs via the jobhist command or the HPC user
dashboard at https://hpc.nih.gov/dashboard to ensure that no job reached
the memory allocation was not excessive.
- (For advanced users with knowledge of shell scripting) Test the exit
status of each command and take appropriate action within the script if a
particular process fails. In bash, the exit status of the prior command is
stored in the "$?" shell variable; generally a value of 0 indicates that
the command succeeded and any other value indicates a failure.
An example of a script that runs multiple dependent commands would be one
that first copies data to local scratch, then processes it, and finally
copies the results back from local scratch to the user's data directory.
As always, please contact staff@hpc.nih.gov with any questions.
########################################################################
Please contact staff@hpc.nih.gov with any questions about the NIH HPC Systems
[Last 12 months of HPC announcements]