NIH HPC News & Announcements
HPC/Biowulf Users - Testing/Monitoring Your Jobs
Date: 11 August 2021 08:08:10
From: NIH HPC Systems Staff
As the number of Biowulf users increases and ever more demanding
workflows are submitted to the cluster, resource utilization has become
a pressing issue.
Allocating more memory than needed is wasteful and often results in
longer queue times for both your own and other users' jobs. Allocating
too little memory results in wasted cpu cycles when jobs end prematurely
due to exceeding memory limits, and can destabilize nodes and file
systems. Allocating 1 gpu and an unnecessarily large amount of memory
can effectively make the other 3 gpus on the node unavailable.
You should _never_ submit large numbers of jobs or a large swarm
without understanding the cpu, gpu, and memory requirements of those
jobs. Instead first submit single jobs or swarms no larger than 3 and
monitor the job(s).
Tools are available to allow you to determine the cpu and memory
requirements of your jobs: 'jobload', 'jobhist', 'dashboard_cli' and
the graphical User Dashboard at https://hpc.nih.gov/dashboard.
Please also see
https://hpc.nih.gov/training/intro_biowulf/hands-on-monitoring.html
for ways you can monitor the memory utilization of your jobs.
In the future, users who habitually waste cluster cpus, gpus or memory
will have restrictions placed on their account limits.
We are always here to answer questions about good job allocation
practices at staff@hpc.nih.gov.
########################################################################
Please contact staff@hpc.nih.gov with any questions about the NIH HPC Systems
[Last 12 months of HPC announcements]