NIH HPC News & Announcements
Biowulf filesystem problems and job loss
Date: 16 December 2019 13:12:57
From: Susan Chacko
Due to the loss of a cluster Infiniband core switch, all GPFS
filesystems were briefly unavailable to the Biowulf compute nodes. This
in many cases will have resulted in the loss of running jobs. The GPFS
filesystems are now available again and are stable.
Please check all your running jobs and resubmit if necessary.
Indications that a job was affected are:
- the error 'Stale file handle' in the Slurm output file.
- the job failed between 12:45 pm and 1:05 pm with no obvious error.
- the Biowulf dashboard shows the job CPU/memory utilization dropping to
zero during this period.
- 'squeue' lists the job as running but 'jobload' shows 0% CPU usage.
We sincerely apologize for the disruption. Please let us know if you
have questions.
NIH HPC Staff.
########################################################################
Please contact staff@hpc.nih.gov with any questions about the NIH HPC Systems
[Last 12 months of HPC announcements]