The freen command can be used to give an instantaneous report of free nodes, CPUs, and GPUs on the cluster (in the example below, only a subset of the features have been displayed, for clarity). Note: This example below does not describe the current status of free nodes, CPUs, or partitions on Biowulf. It is just an example. To see the current status, type 'freen' on Biowulf
biowulf$ freen .......Per-Node Resources....... Partition FreeNds FreeCPUs FreeGPUs Cores CPUs GPUs Mem Disk Features -------------------------------------------------------------------------------------------------------- norm* 1/525 5380/29254 - 28 56 - 248g 800g cpu56,core28,g256 [...] norm* 434/487 14720/15584 - 16 32 - 121g 800g cpu32,core16,g128 [...] norm* 181/527 15858/29512 - 28 56 - 246g 400g cpu56,core28,g256 [...] unlimited 1/14 156/418 - 16 32 - 121g 800g cpu32,core16,g128 [...] unlimited 0/4 0/116 - 28 56 - 246g 400g cpu56,core28,g256 [...] multinode 94/605 6982/33880 - 28 56 - 248g 800g cpu56,core28,g256 [...] multinode 1/190 32/6080 - 16 32 - 58g 800g cpu32,core16,g64 [...] multinode 14/535 784/29960 - 28 56 - 246g 400g cpu56,core28,g256 [...] gpu 1/22 662/704 23/44 16 32 2 121g 800g [...] x2650,gpuk20x gpu 25/48 2130/2688 105/192 28 56 4 121g 800g [...] x2680,ibfdr,gpup100 gpu 0/68 1788/3808 36/272 28 56 4 248g 800g [...] x2680,ibfdr,gpuk80 gpu 7/7 392/392 28/28 28 56 4 121g 800g [...] x2680,ibfdr,gpuv100 huygens 1/1 56/56 4/4 28 56 4 248g 800g [...] x2680,gpuk80,huygens largemem 0/4 16/256 - 32 64 - 1005g 800g cpu64,core32,g1024 [...] largemem 0/4 480/576 - 72 144 - 3025g 800g cpu144,core72,g3072 [...] largemem 0/20 1044/2880 - 72 144 - 1510g 800g cpu144,core72,g1536 [...] nimh 0/64 1128/2048 - 16 32 - 121g 800g cpu32,core16,g128 [...] ccr 0/72 382/4032 - 28 56 - 248g 800g cpu56,core28,g256 [...] ccr 0/116 376/6496 - 28 56 - 246g 400g cpu56,core28,g256 [...] ccrlcb 1/42 1164/2352 - 28 56 - 246g 400g cpu56,core28,g256 [...] ccrclin 4/4 224/224 - 28 56 - 246g 400g cpu56,core28,g256 [...] ccrgpu 1/16 256/896 4/64 28 56 4 246g 400g cpu56,core28,g256 [...] ccrlcbgpu 3/8 428/448 27/32 28 56 4 246g 400g cpu56,core28,g256 [...] quick 0/64 1128/2048 - 16 32 - 121g 800g cpu32,core16,g128 [...] quick 0/72 382/4032 - 28 56 - 248g 800g cpu56,core28,g256 [...] quick 0/56 0/1792 - 16 32 - 121g 400g cpu32,core16,g128 [...] quick 0/116 376/6496 - 28 56 - 246g 400g cpu56,core28,g256 [...] quick 1/16 256/896 4/64 28 56 4 246g 400g cpu56,core28,g256 [...] quick 51/57 1774/1824 - 16 32 - 58g 400g cpu32,core16,g64 [...] quick 335/335 8040/8040 - 12 24 - 19g 100g cpu24,core12,g24 [...] quick 277/277 4432/4432 - 8 16 - 19g 200g cpu16,core8,g24 [...] quick 10/10 160/160 - 8 16 - 66g 200g cpu16,core8,g72 [...] student 1/19 572/608 20/38 16 32 2 121g 800g cpu32,core16,g128 [...] student 280/306 9360/9792 - 16 32 - 121g 800g cpu32,core16,g128 [...]The freen output has one line for each unique type of node in a partition. For example, there are 2 types of nodes in the ccr (NCI_CCR) partition, differing in their processor type or amount of memory, so there are 2 lines for 'ccr' in the freen output.
Explanation of the columns:
Col. | Title | Explanation
1 | Partition | Name of the partition. The asterisk next to 'norm' indicates that it is the default partition
| 2 | FreeNds | The number of nodes that are completely free on this partition. e.g. 1/256 indicates that only 1 node is completely unallocated.
| 3 | FreeCPUS | The number of free CPUs available in this partition. Since the batch system allocates by core (2 CPUs), there may be free CPUs available on
on many different nodes, as well as the free CPUs on the entirely unallocated nodes in Column 2.
| 4 | FreeGPUS | The number of free GPUs available in this partition, where applicable.
| 5 | Cores | For nodes in this partition, the number of physical cores on a node. This may help determine the maximum number of threads that an
application should run.
| 6 | CPUs | For nodes in this partition, the number of CPUs on a node. This number reports hyperthreaded CPUs.
| 7 | GPUs | For nodes in this partition, the number of GPUs on a node, where applicable.
| 8 | Mem | Memory per node for this partition.
| 9 | Disk | Local disk space for nodes in this partition. This may determine the max local disk that can be allocated in a job using lscratch.
| 10 | Features | The batch system 'features' associated with nodes in this partition. They are only important if you require a particular type of processor,
(e.g. for parallel jobs that should all run on the same processor type) or a node-locked license such as acemd.
| |
The nodetype command return attributes for a given node.
biowulf$ nodetype cn2039 compute core28 cpu56 g256 ibfdr phase2 pod3 rack_w16 ssd400 x2695
Note: If two or more arguments are passed to `nodetype`, the extra arguments will be interpreted as attributes, and no output will be printed. If the supplied attributes are found on the node, then `nodetype` will give an exit status of 0. If one or more of the attributes is not found on the node, then an exit status of 1 will be given.
$ ( nodetype cn3118 ibfdr && echo yes ) || echo no yes $ ( nodetype cn3118 ibfdr x2680 && echo yes ) || echo no yes $ ( nodetype cn3118 ibfdr x2680 bogus && echo yes ) || echo no no
The batchlim command shows the current limits per user per partition. Note: The example below does not describe the current limits on Biowulf -- it is just an example of the batchlim output. To see the current limits, type 'batchlim' on Biowulf, or see batch limits on the system status page (NIH-only)
biowulf$ batchlim Max jobs per user: 4000 Max array size: 1001 Partition MaxCPUsPerUser DefWalltime MaxWalltime --------------------------------------------------------------- norm 2048 02:00:00 10-00:00:00 multinode 2048 08:00:00 10-00:00:00 turbo qos 4096 08:00:00 interactive 64 08:00:00 1-12:00:00 (2 simultaneous jobs) quick 6144 02:00:00 02:00:00 largemem 128 04:00:00 10-00:00:00 gpu 128 10-00:00:00 10-00:00:00 (8 GPUs per user) unlimited 128 UNLIMITED UNLIMITED ccr 3072 04:00:00 10-00:00:00 ccrclin 224 04:00:00 10-00:00:00 ccrlcb 1024 04:00:00 10-00:00:00 niddk 1024 04:00:00 10-00:00:00 nimh 1024 04:00:00 10-00:00:00
Explanation of the columns:
Partition | Partition name |
MaxCPUsPerUser | Max. simultaneous CPUs that the batch system will allocate to a single user. Once the max has been allocated to a user, the remaining jobs belonging to that user will remain in the queue until some jobs complete. |
DefWalltime | Default Walltime for this partition. If no walltime is specified when submitting the job, this is the
default walltime allocation. The output example above indicates that the default walltime for the 'norm' partition is 4 hrs, and the
maximum walltime that can be requested is 10 days. Format: DD-HH:MM:SS (Days, Hours, Mins, Secs) |
MaxWalltime | Max walltime that can be requested for a job in this partition.
Format: DD-HH:MM:SS (Days, Hours, Mins, Secs) |
Priority | Partition priority. The scheduling algorithm uses a number of factors to determine which jobs get to run first. One of these is the partition. Interactive jobs will get a higher priority and be scheduled earlier, but in most other cases this is irrelevant for users. |
squeue will report all jobs on the cluster. squeue -u username will report your running jobs. An in-house variant of squeue is sjobs, which provides the information in a different format. Slurm commands like squeue are very flexible, so that you can easily create your own aliases.
Examples of squeue and sjobs:
[biowulf ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 22392 norm meme_sho user PD 0:00 1 (Dependency) 22393 norm meme_sho user PD 0:00 1 (Dependency) 22404 norm stmv-1 user R 10:04 1 cn0414 22391 norm meme_sho user R 10:06 1 cn0413 [biowulf ~]$ sjobs ................Requested................. User JobId JobName Part St Runtime Nodes CPUs Mem Dependency Features Nodelist user 22391 meme_short norm R 10:09 1 32 1.0GB/cpu (null) cn0413 user 22404 stmv-1 norm R 10:07 1 32 1.0GB/cpu (null) cn0414 user 22392 meme_short norm PD 0:00 1 2 1.0GB/cpu afterany:22391 (null) (Dependency) user 22393 meme_short norm PD 0:00 1 4 1.0GB/cpu afterany:22392 (null) (Dependency)
Sjobs output:
User, Jobid and JobName | self-explanatory. |
Part | which partition the job was submitted to (for pending jobs) or is running on (for running jobs). |
Runtime | How long the job has been running for. Reported as 0 for pending jobs. |
Nodes, CPUs, Mem | Requested resources for the job |
Dependency | If the job is dependent on another job's completion. [More info about job dependencies] |
Features | Any features that were requested for this job. e.g. processor type. |
Nodelist | Nodes on which the job is running. (for running jobs only). For pending jobs, this column will display the reason why the job is pending, if known. |
If sjobs does not meet your needs, and there are particular columns in the 'squeue' output that you would like to see, it is trivial to set up an alias that reports your preferred fields. Below is an example of the default squeue output, and setting up an alias that reports specific desired fields.
biowulf$ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20481173 norm viruses2 helixapp PD 0:00 1 (Resources) biowulf$ alias sq="squeue --me -o '%18i %10j %10P %20S %20e'" biowulf$ sq JOBID NAME PARTITION START_TIME END_TIME 20481173 viruses2.p norm 2017-06-27T07:32:28 2017-06-27T11:32:28
sq reports that the batch system expects to start the job on June 27, 2016, at 7:32 am. Based on the default or requested walltime for this job (4 hrs), it is expected to end at 11:32 am of the same day. Note that the expected start and end time may not always be calculable by the batch system, depending on the state of the batch queue.
See the Slurm squeue webpage for a full list of available squeue parameters.
jobload will report running jobs, the %CPU usage, and the memory usage.
usage: jobload [-h] (-u USER | -n NODE | -j JOBID) [-v] query SLURM by user, node or job ID optional arguments: -h, --help show this help message and exit -u USER, --user USER query jobs by user -n NODE, --node NODE query jobs by node -j JOBID, --jobid JOBID query jobs by job id -v, --verbose increase output verbosity
Example:
[someuser@biowulf]$ jobload -u someuser JOBID TIME NODES CPUS THREADS LOAD MEMORY Elapsed / Wall Alloc Active Used/Alloc 20451459 05:27:10 / 1-00:00:00 cn0335 6 5 83% 1.3 GB/4.9 GB
Explanation of the columns:
Jobid | The batch job ID |
Time | Elapsed: How long the job has been running. (5 hrs, 27 mins, 10 seconds in the example above) Wall: The requested walltime for this job (1 day, 00 hours, 00 minutes, 00 seconds in the example above) If a job is getting close to its walltime, you can extend the walltime with the newwall command. |
Nodes | The nodes that have been allocated to this job. In the example above, the job is running on cn0335. |
CPUS alloc | The number of CPUs that have been allocated to this job.
This cannot be changed after the job has started. In the job above, there are 5 threads running on 6 CPUs, so all is well. |
THREADS active | The number of active threads that are running on the allocated CPUs. Ideally, this would be a
1-to-1 match. If the number of threads is significantly more than the allocated CPUs, this indicates an overloaded job, which should probably be stopped and resubmitted with a larger CPU request. In the example above, the job is currently running 5 threads on 6 allocated CPUs. |
LOAD | Ratio of the Active Threads/Allocated CPUs. |
MEMORY Used/Alloc | Memory used and allocated. If the processes uses more memory than is allocated, the batch system will kill the job. In the example above, the job is using 1.3 GB out of an allocated of 4.9 GB. |
jobhist will report the CPU allocated and memory usage of completed jobs. Data is stored for at least a month after the job completes.
Example:
biowulf% jobhist 17500 JobId : 17500 User : user Submitted : 20150504 13:16:26 Command Path : /spin1/users/someuser/meme Submission Command : sbatch --ntasks=28 --time=01:30:00 --constraint=x2600 --exclusive meme_short.slurm Partition State AllocNodes AllocCPUs Walltime MemReq MemUsed Nodelist norm COMPLETED 1 32 00:15:55 0.8GB/cpu 2.0GB p1028
Jobid | The batch job ID |
User | The user who submitted this job. |
Submitted | Timestamp of job submission. In the example above, the job was submitted on 4 May 2015 at 1:16:26 pm. |
Command Path | The directory from which the job was submitted. By default, the output file slurm-###### will appear in this directory. (slurm-17500 in this example). |
Submission Command | The command that was used to submit this job. In the example above, 28 tasks were requested (--ntasks=28), a specific processor type (--constraint=x2600) and exclusive node allocation were requested. A walltime of 1 hr, 30 mins was requested. |
Partition | The partition that the job ran on (In this case, 'norm') |
State | The exit state of the job, e.g. COMPLETED, FAILED, etc. In this case, the batch system thinks that the job completed successfully. |
AllocNodes | The number of nodes on which the job ran. (In this case , 1 node) |
AllocCPUs | The number of CPUs on which the job ran. (In this case, 32 CPUs) |
Walltime | How long the job ran for. This value can be used to decide on the walltime for future jobs. In the example above, a walltime of 90 mins was requested, but the job completed in 15 mins. Based on this, future similar jobs could be submitted with walltime requests of 20 or 30 mins. |
MemReq | Allocated memory. In this case no specific memory was requested (i.e. no --mem flag in the submission command line) so the job was allocated the default memory of 0.8 GB/cpu = 25.6 GB total. |
MemUsed | Memory used. In this case, 2.0 GB of memory was used by the job. Based on this, future such jobs could be submitted with --mem=3g. |
Nodelist | Node(s) on which the job ran. |
'newwall' allows users to increase the walltime of their jobs upto the partition walltime limit.
biowulf% sbatch sleep.bat 20276034 biowulf% sjobs -u user User JobId JobName Part St Reason Runtime Walltime Nodes CPUs Memory Dependency Nodelist ============================================================================================================= user 20276034 sleep.bat norm R --- 0:10 2:00:00 1 2 2GB/cpu cn3164 ============================================================================================================= biowulf% newwall -j 20276034 -t 4:00:00 Sucessfully updated 1 job(s). biowulf% sjobs -u user User JobId JobName Part St Reason Runtime Walltime Nodes CPUs Memory Dependency Nodelist ============================================================================================================= user 20276034 sleep.bat norm R --- 0:30 4:00:00 1 2 2GB/cpu cn3164 =============================================================================================================
jobdata will return a large amount of data for a single job. It pulls from multiple sources and can be a bit slow. It can output as JSON format and in a colorized status.
Usage: /usr/local/bin/jobdata [ options ]Options: -h, --help print options list --color colorize output --human give human-readable timestamps --show-time-series get time-series data --show-scripts print script contents --json print data as JSON --pretty pretty-print JSON Description: Report data for a single job. The jobid can be a single integer (e.g. 55555555) or as a jobarray formatted string (e.g. 55555555_55). For additional information, type 'man jobdata' at the prompt.
Examples:
biowulf% jobdata 12345678 account group alloc_node cn4304 avg_cpus 2 avg_mem 73728 command /data/user/processing/batch/run.sh comment cores_per_socket cpus_per_task 2 dependency elapsed 1506 eligible_time 1671986133 end_time 1672072567 gres job_id 12345678 job_name run.sh jobid 12345678 jobidarray 12345678 jobidraw 12345678 licenses max_cpu_used 1 max_cpu_used_node cn1234 max_cpu_used_time 1672004200 max_dth_used 0 max_dth_used_node cn1234 max_dth_used_time 1672004200 max_gpu_used 0 max_gpu_used_node cn1234 max_gpu_used_time 1672004200 max_mem_used 4823 max_mem_used_node cn1234 max_mem_used_time 1672004264 max_rss 0 metastate running node_list cn1234 nodes cn1234 ntasks_per_core ntasks_per_node ntasks_per_socket parentjobid partition norm pn_cpus pn_mem 73728 pn_tmp priority 20068 qos global queued 34 sbatch_cmd sbatch --cpus-per-task 2 /data/user/processing/batch/run.sh shared 0 sockets_per_node start_time 1671986167 state RUNNING state_reason stateraw 1 std_err /data/user/processing/batch/run.err std_in /dev/null std_out /data/user/processing/batch/run.out submit_time 1671986133 swarm_cmdline threads_per_core time_limit 86400 total_cpus 2 total_gpus total_mem 73728 total_nodes 1 type regular uid 66666 user user work_dir /data/user/processing/batch biowulf% jobdata --show-time-series 12345678 timestamp node cpus memory gpus dthr ------------------------------------------------- 1671945396 cn1234 2 875 0 0 1671945428 cn1234 1 612 0 0 1671945460 cn1234 2 876 0 0 1671945492 cn1234 2 904 0 0 1671945524 cn1234 11 886 0 8 1671945556 cn1234 0 881 0 0 1671945588 cn1234 0 697 0 0 1671945621 cn1234 0 569 0 0 1671945653 cn1234 0 535 0 0 1671945686 cn1234 1 38 0 0 1671945718 cn1234 1 38 0 0 biowulf% jobdata --show-time-series --human 12345678 timestamp node cpus memory gpus dthr ---------------------------------------------------------- 2022-12-25T00:16:36 cn1234 2 875.000 MB 0 0 2022-12-25T00:17:08 cn1234 1 612.000 MB 0 0 2022-12-25T00:17:40 cn1234 2 876.000 MB 0 0 2022-12-25T00:18:12 cn1234 2 904.000 MB 0 0 2022-12-25T00:18:44 cn1234 11 886.000 MB 0 8 2022-12-25T00:19:16 cn1234 0 881.000 MB 0 0 2022-12-25T00:19:48 cn1234 0 697.000 MB 0 0 2022-12-25T00:20:21 cn1234 0 569.000 MB 0 0 2022-12-25T00:20:53 cn1234 0 535.000 MB 0 0 2022-12-25T00:21:26 cn1234 1 38.000 MB 0 0 2022-12-25T00:21:58 cn1234 1 38.000 MB 0 0
dashboard_cli is a command-line interface to data in the dashboard. Current values for CPU, memory, GPU, and threads in D state for running jobs are cached, and so the values may be as much as one minute old. By default only pending, running, and recently finished jobs are displayed. After 10 days, finished jobs are shifted to a slower response archival table and can be accessed by including the --archive option. Queries can be filtered by jobid, node, partition, qos, reservation, job state, and other criteria. Ordering of the columns can be changed. The output can be customized and displayed in plain text, tab-delimited, HTML, and JSON formats.
biowulf% dashboard_cli jobs -h Usage: dashboard_cli jobs [ options ] Description: Display job data from the dashboard Options: Selection filters: -u,--user choose another user (staff and root only) -j,--jobid choose jobs, using one of these formats: NNNNNNNN - a single regular job NNNNNNNN_NNN - a single subjob from an array NNNNNNNN_ - an entire array of subjobs --joblist provide a comma-delimited list of jobids -n,--node show jobs that allocated this node --nodelist show jobs that allocated a node from nodelist --partition show jobs in this partition --qos show jobs that ran with this qos --jobname match jobname --state show jobs with the given state --pending show only pending jobs --running show only running jobs --ended show only ended jobs --zeroload show jobs with zero load (requires --running) --overload show jobs more than 1.5x cpu alloc (requires --running) --elapsed-min, --elapsed-max show only jobs whose elapsed time are within this range --queued-min, --queued-max show only jobs whose queued time are within this range --node-min, --node-max show only jobs whose node counts are within this range --mem-max, --mem-min show jobs whose allocated memory are within this range --mem-util-max, --mem-util-min show jobs whose memory utilization are within this range --cpu-max, --cpu-min show jobs whose allocated cpus are within this range --cpu-util-max, --cpu-util-min show jobs whose cpu utilization are within this range --gpu-max, --gpu-min show jobs whose allocated gpus are within this range --gpu-util-max, --gpu-util-min show jobs whose gpu utilization are within this range --gpu-only show jobs that allocated gpus --gpu-type show jobs with specific gpu type (k20x,k80,p100,v100,v100x) --mem-over show jobs that had overloaded memory --bad-swarm show job arrays whose subjobs finished too quickly --is-active return exit code of 0 if one or more jobs in a selected set is PENDING or RUNNING, otherwise 1 (NOTE: the exit code will be > 1 if an error occurs, see below) --since date show jobs submitted since this date --until date show jobs submitted up to and including this date --start-since like --since, show jobs started since this date --start-until like --until, show jobs started up to and including this date --end-since like --since, show jobs ended since this date --end-until like --until, show jobs ended up to and including this date --running-since, show jobs that were running between two time points, --running-until both options must be given --archive select jobs up to 1 year back (older jobs are stored in an inaccessible format) Dates can be given in the following formats: Sat May 17 09:52:10 2008 Sat May 17 09:52:10 May 17 09:52:10 2008 May 17 2008 May 17 5/17 5/17/08 5/17 2008-05-17 lastyear lastmonth yesterday today forever 1211032330 -24h -2d -100m --fields select fields to return --add-fields include additional fields beyond the default Default fields: jobid state submit_time partition nodes cpus mem timelimit gres dependency queued_time state_reason start_time elapsed_time end_time cpu_max mem_max eval All fields: jobid jobidarray jobname user state submit_time state_reason state_desc priority partition qos reservation nodes cpus gpus mem timelimit gres dependency alloc_node command work_dir std_in std_out std_err start_time queued_time nodelist end_time elapsed_time exit_code cpu_cur cpu_min cpu_max cpu_avg cpu_util mem_cur mem_min mem_max mem_avg mem_util gpu_cur gpu_min gpu_max gpu_avg gpu_util D_cur D_min D_max D_avg eval comment Output formatting: --compact fit as much data on screen as possible --order comma-delimited list of fields to order by (ascending) --desc requires --order, order descending --allfields print out all fields --vertical print data in vertical format --json print output as JSON --HTML print output as HTML table --tab print data in simple tab-delimited format --noheader don't show header rows and bars --raw give raw values: dates as seconds since epoch time in seconds memory as MB --null use this string for null values (default = '-') Staff Only: --commands query bonus commands view (can take much longer for search) --cmd filter for a given command (15 character max) Other: -h, --help print options list -d, --debug run in debug mode -v, --verbose increase verbosity level (0,1,2,3) Exit codes: 0 - no error 1 - job(s) not active 2 - database error 3 - incorrect input 4 - bad output
Here is the default example output:
biowulf% dashboard_cli jobs jobid state submit_time partition nodes cpus mem timelimit gres dependency queued_time state_reason start_time elapsed_time end_time cpu_max mem_max eval =============================================================================================================================================================================================================================== 10367055 COMPLETED 2021-03-14T19:00:08 norm 1 16 32 GB 16:00:00 lscratch:200 - 1:47 - 2021-03-14T19:01:55 2:33 2021-03-14T19:04:28 2 105 MB - 10414438 COMPLETED 2021-03-15T06:00:58 interactive 1 4 3 GB 8:00:00 lscratch:10 - 0:21 - 2021-03-15T06:01:19 2:21 2021-03-15T06:03:40 1 6 MB - 10419656 COMPLETED 2021-03-15T06:58:14 interactive 1 4 3 GB 8:00:00 - - 0:24 - 2021-03-15T06:58:38 1:59:03 2021-03-15T08:57:41 5 223 MB - 10429847 COMPLETED 2021-03-15T10:00:59 interactive 1 8 6 GB 8:00:00 - - 0:01 - 2021-03-15T10:01:00 4:30 2021-03-15T10:05:30 - - -
You can select a time range and customize the output fields:
biowulf% dashboard_cli jobs --since -5d --fields jobid,partition,nodes,cpus,mem,gpus,state,elapsed_time,cpu_util,mem_util,gpu_util jobid partition nodes cpus mem gpus state elapsed_time cpu_util mem_util gpu_util ========================================================================================================= 59635708 norm 1 16 32 GB COMPLETED 4:15 0.0% 0.0% 59637407 norm 1 16 32 GB CANCELLED 59637409 norm 1 16 32 GB CANCELLED 59643816 interactive 1 1 768 MB CANCELLED 59651107 interactive 1 4 3 GB TIMEOUT 8:00:28 59652154 gpu 2 12 94 GB 8 COMPLETED 0:04 59652155 multinode 32 256 375 GB COMPLETED 0:04 59652156 multinode 32 256 375 GB CANCELLED 5:02 93.5% 16.4%
You can select a specific partition and order the columns based on fields:
biowulf% dashboard_cli jobs --since forever --partition gpu --order queued_time --desc jobid state submit_time partition nodes cpus mem timelimit gres dependency queued_time state_reason start_time elapsed_time end_time cpu_max mem_max eval ======================================================================================================================================================================================================================= 9607388 COMPLETED 2021-03-05T08:31:43 gpu 2 12 94 GB 20:00 gpu:k80:4 - 9:46:26 - 2021-03-05T18:18:09 1:04 2021-03-05T18:19:13 9 2 GB - 9607387 COMPLETED 2021-03-05T08:31:39 gpu 2 12 94 GB 20:00 gpu:k80:4 - 9:41:44 - 2021-03-05T18:13:23 1:08 2021-03-05T18:14:31 9 1 GB - 9606589 COMPLETED 2021-03-05T07:10:33 gpu 2 8 62 GB 10:00 gpu:k80:4 - 11:08 - 2021-03-05T07:21:41 0:52 2021-03-05T07:22:33 7 520 MB - 9606600 COMPLETED 2021-03-05T07:11:05 gpu 2 8 62 GB 10:00 gpu:k80:4 - 10:36 - 2021-03-05T07:21:41 0:52 2021-03-05T07:22:33 64 3 GB - 9607226 COMPLETED 2021-03-05T08:11:37 gpu 2 12 94 GB 4:00:00 gpu:v100x:4 - 8:10 - 2021-03-05T08:19:47 0:09 2021-03-05T08:19:56 - - -
Get a list of all jobs that did not complete successfully in a given timeframe:
biowulf% dashboard_cli jobs --since 2021-04-14 --ended --fields jobid,partition,state | grep -v COMPLETED jobid partition state ================================== 12705504 gpu CANCELLED 12706060 norm FAILED 12708353 norm TIMEOUT 12709134 norm FAILED 12708701 norm FAILED 12708716 norm FAILED 12709229 norm FAILED 12708483 quick FAILED 12708260 norm FAILED 12708781 norm FAILED 12708786 norm FAILED 12708279 norm FAILED 12708799 norm FAILED
A single job can be dumped in full detail:
biowulf% dashboard_cli jobs --jobid 9606610 --allfields --vertical jobid: 9606610 jobidarray: - jobname: Refine3D_k80_003 user: nobody state: COMPLETED submit_time: 2021-03-05T07:11:29 state_reason: - state_desc: - priority: 967374 partition: gpu qos: global reservation: - nodes: 2 cpus: 12 gpus: 8 mem: 94 GB timelimit: 4:00:00 gres: gpu:k80:4 dependency: - alloc_node: biowulf2-e0 command: /spin1/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003/RUN/run.sh work_dir: /spin1/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003 std_in: /dev/null std_out: /spin1/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003/RUN/run.out std_err: /spin/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003/RUN/run.err start_time: 2021-03-05T07:17:47 queued_time: 6:18 nodelist: cn4179,cn4190 end_time: 2021-03-05T09:44:05 elapsed_time: 2:26:18 exit_code: 0 cpu_cur: 4 cpu_min: 8 cpu_max: 9 cpu_avg: 9.00 cpu_util: 75.0% mem_cur: 11 GB mem_min: 5 GB mem_max: 34 GB mem_avg: 23 GB mem_util: 24.7% gpu_cur: 4 gpu_min: 4 gpu_max: 8 gpu_avg: 7.98 gpu_util: 99.8% D_cur: 0 D_min: 0 D_max: 5 D_avg: 0 eval: - comment: -
The content can be compacted fit within a screen, and current values for cpu, memory, gpu, and threads in D state can be displayed:
biowulf% dashboard_cli jobs --compact --running --fields jobid,partition,qos,elapsed_time,nodes,cpus,cpu_cur,cpu_max,cpu_avg,D_cur,D_avg,mem,mem_cur,mem_max,mem_avg,gpus,gpu_cur,gpu_max,gpu_avg,nodelist jobid partition qos etime n c cc cx ca dc da m mc mx ma g gc gx ga nodelist ================================================================================================================================== 9593505 gpu global 1-15:56:48 4 32 30 33 25.00 0 1 480GB 23GB 27GB 22GB 16 17 17 17.00 cn[2349-2352] 9593563 gpu global 1-14:34:46 3 24 23 26 20.00 0 1 360GB 17GB 21GB 16GB 12 13 13 13.00 cn[2392-2394]
A job can be monitored continuously using --is-active in a bash script:
while dashboard_cli jobs --is-active --jobid 32535313 do echo "job is pending or running" sleep 30 done echo job has finished
Small utility to provide some information about where you are.
biowulf$ whereami ## You are on a login node biowulf.nih.gov is a login node for the Biowulf cluster. It should only be used to submit jobs. Many modules are not available here and any compute intensive processes or file transfers will get killed. - If you want to do file transfers or larger file manipulation please use helix.nih.gov, an sinteractive session, or a OpenOndemand desktop session - For data processing please use an sinteractive session, a batch job, or one of the OpenOndemand interactive sessions. See https://hpc.nih.gov/systems/ biowulf$ sinteractive --mem=10g --cpus-per-task=2 --gres=lscratch:10,gpu:1 salloc: Pending job allocation 24022264 salloc: job 24022264 queued and waiting for resources salloc: job 24022264 has been allocated resources salloc: Granted job allocation 24022264 salloc: Waiting for resource configuration salloc: Nodes cn3088 are ready for job srun: error: x11: no local DISPLAY defined, skipping error: unable to open file /tmp/slurm-spank-x11.24022264.0 slurmstepd: error: x11: unable to read DISPLAY value node$ whereami -f pretty ┌ You are in an sinteractive session ───────────────────────────────────────┐ │ │ │ │ │ Interactive sessions are for exploratory data analysis, interactive work, │ │ debugging, development, and testing. They can also be used for jupyter │ │ notebooks, rstudio-server, and tools that require GUIs. │ │ │ │ Session summary on cn3088: │ │ │ │ JobId: 24022264 │ │ CPUs: 2 │ │ Memory: 10G │ │ GPUs: 1 k80 │ │ lscratch: 10G │ │ Remaining time: 7h59m34s │ │ │ │ │ │ See https://hpc.nih.gov/docs/userguide.html#int │ │ │ │ │ └───────────────────────────────────────────────────────────────────────────┘ node$ whereami -f short sinteractive
The getfacl_path command can be used to display permissions for a given path in great detail.
biowulf% getfacl_path -h Usage: /usr/local/bin/getfacl_path [ options ] Options: -p, --path path -h, --help print options list -v, --verbose increase verbosity level (0,1,2,3) Description: Display access permissions for a single path Please see https://hpc.nih.gov/storage/acls.html for info about ACLs. NOTE: Default ACLs are abbreviated as DACL
Example:
biowulf% getfacl_path -p /data/MISC123/projects/ABCD/20230101/data lrwxrwxrwx. root root /data -> ./spin1/USERS1 drwxr-xr-x. root root /spin1/ drwxrwxr-x. root root └── USERS1/ lrwxrwxrwx. root root └── MISC123 -> /gs11/users/MISC123 lrwxrwxrwx. root root └── gs11 -> /gpfs/gsfs11 drwxr-xr-x. root root /gpfs/ drwxr-xr-x. root root └── gsfs11/ drwxr-xr-x. root root └── users/ drwxr-x---+ user1 MISC123 └── MISC123/ (ACL) r-x users: user1,user2,user3,user4,user5, + 24 more! drwxr-xr-x+ user1 MISC123 └── projects/ (ACL) r-x users: user2,user3,user4 drwxr-xr-x+ user1 MISC123 └── ABCD/ (ACL) r-x users: user2 (ACL) r-x users: user3 (DACL) r-x users: user3 drwxr-xr-x+ user1 MISC123 └── 20230101/ (ACL) r-x users: user2 (ACL) r-x users: user3 (DACL) r-x users: user3 drwxr-xr-x+ user1 MISC123 └── data/ (ACL) r-x users: user3 (DACL) r-x users: user3
The setfacl_path command can be used to set ACLs for a single path.
biowulf% setfacl_path -h Usage: /usr/local/bin/setfacl_path [ options ] Options: -p, --path path -a, --acl acl to set (e.g. '-m u:nobody:rX') -d, --dryrun don't actually make the change -h, --help print options list -v, --verbose increase verbosity level (0,1,2,3) Description: Set access permissions for a single path Please see https://hpc.nih.gov/storage/acls.html for info about ACLs. NOTE 1: Default ACLs are abbreviated as DACL NOTE 2: Any file or directory owned by root is not changed NOTE 3: Symbolic links are not changed
Example:
biowulf% setfacl_path -p /data/MISC123/projects/git/web/favicon.ico -a '-m u:nobody:rX' -v 2 =========================================================================================== /data/MISC123/projects/git/web/favicon.ico =========================================================================================== setfacl -m u:nobody:rX '/gpfs/gsfs11/users/MISC123' setfacl -m u:nobody:rX '/gpfs/gsfs11/users/MISC123/projects/git' setfacl -m u:nobody:rX '/gpfs/gsfs11/users/MISC123/projects/git/web' setfacl -m u:nobody:rX '/gpfs/gsfs11/users/MISC123/projects/git/web/favicon.ico' =========================================================================================== lrwxrwxrwx. root root /data -> ./spin1/USERS1 drwxr-xr-x. root root /spin1/ drwxrwxr-x. root root └── USERS1/ lrwxrwxrwx. root root └── MISC123 -> /gs11/users/MISC123 lrwxrwxrwx. root root └── gs11 -> /gpfs/gsfs11 drwxr-xr-x. root root /gpfs/ drwxr-xr-x. root root └── gsfs11/ drwxr-xr-x. root root └── users/ drwxr-x---+ user1 MISC123 └── MISC123/ (ACL) r-x users: user1,user2,user3,user4,user5, + 25 more! lrwxrwxrwx. user1 MISC123 └── projects (ACL) r-x users: user2,user3,user4,nobody drwxr-xr-x+ user1 MISC123 └── git/ (ACL) r-x nobody drwxr-xr-x+ user1 MISC123 └── web/ (ACL) r-x nobody -rw-r--r--+ user1 MISC123 └── favicon.ico (ACL) r-- nobody
The highlightUnicode command will find lines in a file that contain unicode characters and attempt to highlight the unicode characters.
Example:
biowulf% highlightUnicode unicode.txt 1:This is a file that contains unicode characters '소녀시대' That snuck in somehow 2:Strange space character: [ ] em space U+2003 4:Another funny character '█'
The highlightTabs command will highlight tab characters in a file.
Example:
biowulf% highlightTabs tab_delimit.txt 1:This➤file➤has➤tab delimited➤fields➤for➤searching