Biowulf High Performance Computing at the NIH

Biowulf Utilities

Utilities developed in-house for use on the NIH Biowulf cluster.
freen
The 'freen' command can be used to give an instantaneous report of free nodes, CPUs, and GPUs on the cluster. (in the example below, only a subset of the features have been displayed, for clarity)

Note: This example below does not describe the current status of free nodes, CPUs, or partitions on Biowulf. It is just an example. To see the current status, type 'freen' on Biowulf

biowulf$ freen

                                                     .......Per-Node Resources.......
Partition   FreeNds      FreeCPUs          FreeGPUs  Cores  CPUs  GPUs   Mem   Disk  Features
--------------------------------------------------------------------------------------------------------
norm*       1/525        5380/29254        -            28    56     -   248g   800g cpu56,core28,g256 [...]
norm*       434/487      14720/15584       -            16    32     -   121g   800g cpu32,core16,g128 [...]
norm*       181/527      15858/29512       -            28    56     -   246g   400g cpu56,core28,g256 [...]
unlimited   1/14         156/418           -            16    32     -   121g   800g cpu32,core16,g128 [...]
unlimited   0/4          0/116             -            28    56     -   246g   400g cpu56,core28,g256 [...]
multinode   94/605       6982/33880        -            28    56     -   248g   800g cpu56,core28,g256 [...]
multinode   1/190        32/6080           -            16    32     -    58g   800g cpu32,core16,g64  [...]
multinode   14/535       784/29960         -            28    56     -   246g   400g cpu56,core28,g256 [...]
gpu         1/22         662/704           23/44        16    32     2   121g   800g [...] x2650,gpuk20x
gpu         25/48        2130/2688         105/192      28    56     4   121g   800g [...] x2680,ibfdr,gpup100
gpu         0/68         1788/3808         36/272       28    56     4   248g   800g [...] x2680,ibfdr,gpuk80
gpu         7/7          392/392           28/28        28    56     4   121g   800g [...] x2680,ibfdr,gpuv100
huygens     1/1          56/56             4/4          28    56     4   248g   800g [...] x2680,gpuk80,huygens
largemem    0/4          16/256            -            32    64     -  1005g   800g cpu64,core32,g1024 [...]
largemem    0/4          480/576           -            72   144     -  3025g   800g cpu144,core72,g3072 [...]
largemem    0/20         1044/2880         -            72   144     -  1510g   800g cpu144,core72,g1536 [...]
nimh        0/64         1128/2048         -            16    32     -   121g   800g cpu32,core16,g128 [...]
ccr         0/72         382/4032          -            28    56     -   248g   800g cpu56,core28,g256 [...]
ccr         0/116        376/6496          -            28    56     -   246g   400g cpu56,core28,g256 [...]
ccrlcb      1/42         1164/2352         -            28    56     -   246g   400g cpu56,core28,g256 [...]
ccrclin     4/4          224/224           -            28    56     -   246g   400g cpu56,core28,g256 [...]
ccrgpu      1/16         256/896           4/64         28    56     4   246g   400g cpu56,core28,g256 [...]
ccrlcbgpu   3/8          428/448           27/32        28    56     4   246g   400g cpu56,core28,g256 [...]
quick       0/64         1128/2048         -            16    32     -   121g   800g cpu32,core16,g128 [...]
quick       0/72         382/4032          -            28    56     -   248g   800g cpu56,core28,g256 [...]
quick       0/56         0/1792            -            16    32     -   121g   400g cpu32,core16,g128 [...]
quick       0/116        376/6496          -            28    56     -   246g   400g cpu56,core28,g256 [...]
quick       1/16         256/896           4/64         28    56     4   246g   400g cpu56,core28,g256 [...]
quick       51/57        1774/1824         -            16    32     -    58g   400g cpu32,core16,g64 [...]
quick       335/335      8040/8040         -            12    24     -    19g   100g cpu24,core12,g24 [...]
quick       277/277      4432/4432         -             8    16     -    19g   200g cpu16,core8,g24 [...]
quick       10/10        160/160           -             8    16     -    66g   200g cpu16,core8,g72 [...]
student     1/19         572/608           20/38        16    32     2   121g   800g cpu32,core16,g128 [...]
student     280/306      9360/9792         -            16    32     -   121g   800g cpu32,core16,g128 [...]

The freen output has one line for each unique type of node in a partition. For example, there are 2 types of nodes in the ccr (NCI_CCR) partition, differing in their processor type or amount of memory, so there are 2 lines for 'ccr' in the freen output.

Explanation of the columns:

Col. Title Explanation
1 Partition Name of the partition. The asterisk next to 'norm' indicates that it is the default partition
2 FreeNds The number of nodes that are completely free on this partition. e.g. 1/256 indicates that only 1 node is completely unallocated.
3 FreeCPUSThe number of free CPUs available in this partition. Since the batch system allocates by core (2 CPUs), there may be free CPUs available on on many different nodes, as well as the free CPUs on the entirely unallocated nodes in Column 2.
4 FreeGPUSThe number of free GPUs available in this partition, where applicable.
5 Cores For nodes in this partition, the number of physical cores on a node. This may help determine the maximum number of threads that an application should run.
6 CPUs For nodes in this partition, the number of CPUs on a node. This number reports hyperthreaded CPUs.
7 GPUs For nodes in this partition, the number of GPUs on a node, where applicable.
8 Mem Memory per node for this partition.
9 Disk Local disk space for nodes in this partition. This may determine the max local disk that can be allocated in a job using lscratch.
10 Features The batch system 'features' associated with nodes in this partition. They are only important if you require a particular type of processor, (e.g. for parallel jobs that should all run on the same processor type) or a node-locked license such as acemd.

batchlim

The batchlim command shows the current limits per user per partition.

Note: The example below does not describe the current limits on Biowulf -- it is just an example of the batchlim output. To see the current limits, type 'batchlim' on Biowulf, or see batch limits on the system status page (NIH-only)

biowulf$ batchlim


Max jobs per user: 4000
Max array size:    1001

Partition        MaxCPUsPerUser     DefWalltime     MaxWalltime
---------------------------------------------------------------
norm                     2048         02:00:00     10-00:00:00
multinode                2048         08:00:00     10-00:00:00
     turbo qos           4096                         08:00:00
interactive                64         08:00:00      1-12:00:00 (3 simultaneous jobs)
quick                    6144         02:00:00        02:00:00
largemem                  128         04:00:00     10-00:00:00
gpu                       128      10-00:00:00     10-00:00:00 (8 GPUs per user)
unlimited                 128        UNLIMITED       UNLIMITED
ccr                      3072         04:00:00     10-00:00:00
ccrclin                   224         04:00:00     10-00:00:00
ccrlcb                   1024         04:00:00     10-00:00:00
niddk                    1024         04:00:00     10-00:00:00
nimh                     1024         04:00:00     10-00:00:00

Explanation of the columns:

Partition Partition name
MaxCPUsPerUser Max. simultaneous CPUs that the batch system will allocate to a single user. Once the max has been allocated to a user, the remaining jobs belonging to that user will remain in the queue until some jobs complete.
DefWalltime Default Walltime for this partition. If no walltime is specified when submitting the job, this is the default walltime allocation. The output example above indicates that the default walltime for the 'norm' partition is 4 hrs, and the maximum walltime that can be requested is 10 days.
Format: DD-HH:MM:SS (Days, Hours, Mins, Secs)
MaxWalltime Max walltime that can be requested for a job in this partition.
Format: DD-HH:MM:SS (Days, Hours, Mins, Secs)
Priority Partition priority. The scheduling algorithm uses a number of factors to determine which jobs get to run first. One of these is the partition. Interactive jobs will get a higher priority and be scheduled earlier, but in most other cases this is irrelevant for users.

sjobs

squeue will report all jobs on the cluster. squeue -u username will report your running jobs. An in-house variant of squeue is sjobs, which provides the information in a different format. Slurm commands like squeue are very flexible, so that you can easily create your own aliases.

Examples of squeue and sjobs:

[biowulf ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22392     norm meme_sho   user   PD       0:00      1 (Dependency)
             22393     norm meme_sho   user   PD       0:00      1 (Dependency)
             22404     norm   stmv-1   user    R      10:04      1 cn0414
             22391     norm meme_sho   user    R      10:06      1 cn0413

[biowulf ~]$ sjobs
                                                       ................Requested.................
User     JobId  JobName    Part  St    Runtime  Nodes CPUs     Mem        Dependency    Features    Nodelist
user     22391  meme_short norm  R      10:09   1     32   1.0GB/cpu                     (null)    cn0413         
user     22404  stmv-1     norm  R      10:07   1     32   1.0GB/cpu                     (null)    cn0414         
user     22392  meme_short norm  PD      0:00   1      2   1.0GB/cpu   afterany:22391    (null)    (Dependency)   
user     22393  meme_short norm  PD      0:00   1      4   1.0GB/cpu   afterany:22392    (null)    (Dependency)   

Sjobs output:

User, Jobid and JobName self-explanatory.
Part which partition the job was submitted to (for pending jobs) or is running on (for running jobs).
Runtime How long the job has been running for. Reported as 0 for pending jobs.
Nodes, CPUs, Mem Requested resources for the job
Dependency If the job is dependent on another job's completion. [More info about job dependencies]
Features Any features that were requested for this job. e.g. processor type.
Nodelist Nodes on which the job is running. (for running jobs only). For pending jobs, this column will display the reason why the job is pending, if known.

If sjobs does not meet your needs, and there are particular columns in the 'squeue' output that you would like to see, it is trivial to set up an alias that reports your preferred fields. Below is an example of the default squeue output, and setting up an alias that reports specific desired fields.

biowulf$ squeue -u helixapp
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          20481173      norm viruses2 helixapp PD       0:00      1 (Resources)
	  
biowulf$ alias mysq="squeue -u helixapp -o '%18i %10j %10P %20S %20e'"

biowulf$ mysq
JOBID              NAME       PARTITION  START_TIME          END_TIME
20481173           viruses2.p norm       2017-06-27T07:32:28 2017-06-27T11:32:28

'mysq' reports that the batch system expects to start the job on June 27, 2016, at 7:32 am. Based on the default or requested walltime for this job (4 hrs), it is expected to end at 11:32 am of the same day. Note that the expected start and end time may not always be calculable by the batch system, depending on the state of the batch queue.

See the Slurm squeue webpage for a full list of available squeue parameters.

jobload

jobload will report running jobs, the %CPU usage, and the memory usage.

usage: jobload [-h] (-u USER | -n NODE | -j JOBID) [-v]

query SLURM by user, node or job ID

optional arguments:
  -h, --help            show this help message and exit
  -u USER, --user USER  query jobs by user
  -n NODE, --node NODE  query jobs by node
  -j JOBID, --jobid JOBID
                        query jobs by job id
  -v, --verbose         increase output verbosity
  

Example:

  [someuser@biowulf]$ jobload -u someuser
           JOBID            TIME            NODES   CPUS  THREADS   LOAD             MEMORY
                     Elapsed / Wall                Alloc   Active                Used/Alloc
        20451459    05:27:10 / 1-00:00:00  cn0335      6        5    83%      1.3 GB/4.9 GB

Explanation of the columns:

Jobid The batch job ID
Time Elapsed: How long the job has been running. (5 hrs, 27 mins, 10 seconds in the example above)
Wall: The requested walltime for this job (1 day, 00 hours, 00 minutes, 00 seconds in the example above)
If a job is getting close to its walltime, you can extend the walltime with the newwall command.
Nodes The nodes that have been allocated to this job. In the example above, the job is running on cn0335.
CPUS
alloc
The number of CPUs that have been allocated to this job. This cannot be changed after the job has started.
In the job above, there are 5 threads running on 6 CPUs, so all is well.
THREADS
active
The number of active threads that are running on the allocated CPUs. Ideally, this would be a 1-to-1 match.
If the number of threads is significantly more than the allocated CPUs, this indicates an overloaded job, which should probably be stopped and resubmitted with a larger CPU request.
In the example above, the job is currently running 5 threads on 6 allocated CPUs.
LOADRatio of the Active Threads/Allocated CPUs.
MEMORY Used/Alloc Memory used and allocated. If the processes uses more memory than is allocated, the batch system will kill the job. In the example above, the job is using 1.3 GB out of an allocated of 4.9 GB.

jobhist

Jobhist will report the CPU allocated and memory usage of completed jobs. Data is stored for at least a month after the job completes. Example:

biowulf% jobhist 17500

JobId              : 17500
User               : user  
Submitted          : 20150504 13:16:26
Command Path       : /spin1/users/someuser/meme
Submission Command : sbatch --ntasks=28 --time=01:30:00 --constraint=x2600 --exclusive meme_short.slurm

Partition     State      AllocNodes     AllocCPUs     Walltime         MemReq      MemUsed    Nodelist
      norm  COMPLETED          1           32         00:15:55       0.8GB/cpu       2.0GB    p1028               
      

Jobid The batch job ID
User The user who submitted this job.
Submitted Timestamp of job submission. In the example above, the job was submitted on 4 May 2015 at 1:16:26 pm.
Command Path The directory from which the job was submitted. By default, the output file slurm-###### will appear in this directory. (slurm-17500 in this example).
Submission Command The command that was used to submit this job. In the example above, 28 tasks were requested (--ntasks=28), a specific processor type (--constraint=x2600) and exclusive node allocation were requested. A walltime of 1 hr, 30 mins was requested.
Partition The partition that the job ran on (In this case, 'norm')
State The exit state of the job, e.g. COMPLETED, FAILED, etc. In this case, the batch system thinks that the job completed successfully.
AllocNodes The number of nodes on which the job ran. (In this case , 1 node)
AllocCPUs The number of CPUs on which the job ran. (In this case, 32 CPUs)
Walltime How long the job ran for. This value can be used to decide on the walltime for future jobs. In the example above, a walltime of 90 mins was requested, but the job completed in 15 mins. Based on this, future similar jobs could be submitted with walltime requests of 20 or 30 mins.
MemReq Allocated memory. In this case no specific memory was requested (i.e. no --mem flag in the submission command line) so the job was allocated the default memory of 0.8 GB/cpu = 25.6 GB total.
MemUsed Memory used. In this case, 2.0 GB of memory was used by the job. Based on this, future such jobs could be submitted with --mem=3g.
Nodelist Node(s) on which the job ran.

newwall

'newwall' allows users to increase the walltime of their jobs upto the partition walltime limit.

biowulf% sbatch sleep.bat
20276034

biowulf% sjobs -u user
User    JobId     JobName    Part  St  Reason  Runtime  Walltime  Nodes  CPUs  Memory   Dependency Nodelist
=============================================================================================================
user  20276034  sleep.bat  norm  R   ---        0:10   2:00:00      1     2  2GB/cpu              cn3164
=============================================================================================================

biowulf% newwall -j 20276034 -t 4:00:00
Sucessfully updated 1 job(s).

biowulf% sjobs -u user
User    JobId     JobName    Part  St  Reason  Runtime  Walltime  Nodes  CPUs  Memory   Dependency Nodelist
=============================================================================================================
user  20276034  sleep.bat  norm  R   ---        0:30   4:00:00      1     2  2GB/cpu              cn3164
=============================================================================================================

jobdata

'jobdata' will return a large amount of data for a single job. It pulls from multiple sources and can be a bit slow. It can output as JSON format and in a colorized status.

Usage: /usr/local/bin/jobdata [ options ] 

Options:

  -h, --help           print options list

  --color              colorize output
  --human              give human-readable timestamps
  --show-time-series   get time-series data
  --show-scripts       print script contents
  --json               print data as JSON
  --pretty             pretty-print JSON

Description:

  Report data for a single job.  The jobid can be a single integer (e.g.
  55555555) or as a jobarray formatted string (e.g. 55555555_55).

For additional information, type 'man jobdata' at the prompt.

dashboard_cli

'dashboard_cli' is a command-line interface to data in the dashboard. Current values for CPU, memory, GPU, and threads in D state for running jobs are cached, and so the values may be as much as one minute old. By default only pending, running, and recently finished jobs are displayed. After 10 days, finished jobs are shifted to a slower response archival table and can be accessed by including the --archive option. Queries can be filtered by jobid, node, partition, qos, reservation, job state, and other criteria. Ordering of the columns can be changed. The output can be customized and displayed in plain text, tab-delimited, HTML, and JSON formats.

biowulf% dashboard_cli jobs -h

Usage: dashboard_cli jobs [ options ]

Description:

  Display job data from the dashboard

Options:

  Selection filters:

  -u,--user         choose another user (staff and root only)
  -j,--jobid        choose jobs, using one of these formats:
                      NNNNNNNN     - a single regular job
                      NNNNNNNN_NNN - a single subjob from an array
                      NNNNNNNN_    - an entire array of subjobs
  --joblist         provide a comma-delimited list of jobids
  -n,--node         show jobs that allocated this node
  --nodelist        show jobs that allocated a node from nodelist
  --partition       show jobs in this partition
  --qos             show jobs that ran with this qos
  --jobname         match jobname
  --state           show jobs with the given state
  --pending         show only pending jobs
  --running         show only running jobs
  --ended           show only ended jobs
  --zeroload        show jobs with zero load (requires --running)
  --overload        show jobs more than 1.5x cpu alloc (requires --running)
  --elapsed-min, --elapsed-max
                    show only jobs whose elapsed time are within this range
  --queued-min, --queued-max
                    show only jobs whose queued time are within this range
  --node-min, --node-max
                    show only jobs whose node counts are within this range
  --mem-max, --mem-min
                    show jobs whose allocated memory are within this range
  --mem-util-max, --mem-util-min
                    show jobs whose memory utilization are within this range
  --cpu-max, --cpu-min
                    show jobs whose allocated cpus are within this range
  --cpu-util-max, --cpu-util-min
                    show jobs whose cpu utilization are within this range
  --gpu-max, --gpu-min
                    show jobs whose allocated gpus are within this range
  --gpu-util-max, --gpu-util-min
                    show jobs whose gpu utilization are within this range
  --gpu-only        show jobs that allocated gpus
  --gpu-type        show jobs with specific gpu type (k20x,k80,p100,v100,v100x)
  --mem-over        show jobs that had overloaded memory
  --bad-swarm       show job arrays whose subjobs finished too quickly
  --is-active       return exit code of 0 if one or more jobs in a selected set
                    is PENDING or RUNNING, otherwise 1 (NOTE: the exit code will
                    be > 1 if an error occurs, see below)

  --since date      show jobs submitted since this date
  --until date      show jobs submitted up to and including this date
  --start-since     like --since, show jobs started since this date
  --start-until     like --until, show jobs started up to and including this date
  --end-since       like --since, show jobs ended since this date
  --end-until       like --until, show jobs ended up to and including this date
  --running-since,  show jobs that were running between two time points,
  --running-until     both options must be given

  --archive         pull from WAAAYYY back

  Dates can be given in the following formats:

    Sat May 17 09:52:10 2008  Sat May 17 09:52:10       May 17 09:52:10 2008      
    May 17 2008               May 17                    5/17                      
    5/17/08                   5/17                      2008-05-17                
    lastyear                  lastmonth                 yesterday                 
    today                     forever                   1211032330                
    -24h                      -2d                       -100m                     

  --fields           select fields to return
  --add-fields       include additional fields beyond the default

  Default fields:

    jobid         state         submit_time   partition     nodes         
    cpus          mem           timelimit     gres          dependency    
    queued_time   state_reason  start_time    elapsed_time  end_time      
    cpu_max       mem_max       eval          

  All fields:

    jobid         jobidarray    jobname       user          state         
    submit_time   state_reason  state_desc    priority      partition     
    qos           reservation   nodes         cpus          gpus          
    mem           timelimit     gres          dependency    alloc_node    
    command       work_dir      std_in        std_out       std_err       
    start_time    queued_time   nodelist      end_time      elapsed_time  
    exit_code     cpu_cur       cpu_min       cpu_max       cpu_avg       
    cpu_util      mem_cur       mem_min       mem_max       mem_avg       
    mem_util      gpu_cur       gpu_min       gpu_max       gpu_avg       
    gpu_util      D_cur         D_min         D_max         D_avg         
    eval          comment       

  Output formatting:

  --compact         fit as much data on screen as possible
  --order           comma-delimited list of fields to order by (ascending)
  --desc            requires --order, order descending
  --allfields       print out all fields
  --vertical        print data in vertical format
  --json            print output as JSON
  --HTML            print output as HTML table
  --tab             print data in simple tab-delimited format
  --noheader        don't show header rows and bars
  --raw             give raw values:
                      dates as seconds since epoch
                      time in seconds
                      memory as MB
  --null            use this string for null values (default = '-')
 
  Other:

  -h, --help        print options list
  -d, --debug       run in debug mode
  -v, --verbose     increase verbosity level (0,1,2,3)

  Exit codes:

    0 - no error
    1 - job(s) not active
    2 - database error
    3 - incorrect input
    4 - bad output

Here is the default example output:

biowulf% dashboard_cli jobs

jobid     state      submit_time          partition    nodes  cpus  mem     timelimit  gres           dependency  queued_time  state_reason    start_time           elapsed_time  end_time             cpu_max  mem_max  eval
===============================================================================================================================================================================================================================
10367055  COMPLETED  2021-03-14T19:00:08  norm             1    16   32 GB   16:00:00  lscratch:200   -                  1:47  -               2021-03-14T19:01:55          2:33  2021-03-14T19:04:28        2   105 MB  -
10414438  COMPLETED  2021-03-15T06:00:58  interactive      1     4    3 GB    8:00:00  lscratch:10    -                  0:21  -               2021-03-15T06:01:19          2:21  2021-03-15T06:03:40        1     6 MB  -
10419656  COMPLETED  2021-03-15T06:58:14  interactive      1     4    3 GB    8:00:00  -              -                  0:24  -               2021-03-15T06:58:38       1:59:03  2021-03-15T08:57:41        5   223 MB  -
10429847  COMPLETED  2021-03-15T10:00:59  interactive      1     8    6 GB    8:00:00  -              -                  0:01  -               2021-03-15T10:01:00          4:30  2021-03-15T10:05:30        -        -  -

You can select a time range and customize the output fields:

biowulf% dashboard_cli jobs --since -5d --fields jobid,partition,nodes,cpus,mem,gpus,state,elapsed_time,cpu_util,mem_util,gpu_util

jobid     partition    nodes  cpus  mem     gpus  state      elapsed_time  cpu_util  mem_util  gpu_util
=========================================================================================================
59635708  norm             1    16   32 GB        COMPLETED          4:15      0.0%      0.0%
59637407  norm             1    16   32 GB        CANCELLED
59637409  norm             1    16   32 GB        CANCELLED
59643816  interactive      1     1  768 MB        CANCELLED
59651107  interactive      1     4    3 GB        TIMEOUT         8:00:28
59652154  gpu              2    12   94 GB  8     COMPLETED          0:04
59652155  multinode       32   256  375 GB        COMPLETED          0:04
59652156  multinode       32   256  375 GB        CANCELLED          5:02     93.5%     16.4%

You can select a specific partition and order the columns based on fields:

biowulf% dashboard_cli jobs --since forever --partition gpu --order queued_time --desc

jobid    state      submit_time          partition  nodes  cpus  mem    timelimit  gres         dependency  queued_time  state_reason  start_time           elapsed_time  end_time             cpu_max  mem_max  eval
=======================================================================================================================================================================================================================
9607388  COMPLETED  2021-03-05T08:31:43  gpu            2    12  94 GB      20:00  gpu:k80:4    -               9:46:26  -             2021-03-05T18:18:09          1:04  2021-03-05T18:19:13        9     2 GB  -
9607387  COMPLETED  2021-03-05T08:31:39  gpu            2    12  94 GB      20:00  gpu:k80:4    -               9:41:44  -             2021-03-05T18:13:23          1:08  2021-03-05T18:14:31        9     1 GB  -
9606589  COMPLETED  2021-03-05T07:10:33  gpu            2     8  62 GB      10:00  gpu:k80:4    -                 11:08  -             2021-03-05T07:21:41          0:52  2021-03-05T07:22:33        7   520 MB  -
9606600  COMPLETED  2021-03-05T07:11:05  gpu            2     8  62 GB      10:00  gpu:k80:4    -                 10:36  -             2021-03-05T07:21:41          0:52  2021-03-05T07:22:33       64     3 GB  -
9607226  COMPLETED  2021-03-05T08:11:37  gpu            2    12  94 GB    4:00:00  gpu:v100x:4  -                  8:10  -             2021-03-05T08:19:47          0:09  2021-03-05T08:19:56        -        -  -

Get a list of all jobs that did not complete successfully in a given timeframe:

biowulf% dashboard_cli jobs --since 2021-04-14 --ended --fields jobid,partition,state | grep -v COMPLETED
jobid     partition    state
==================================
12705504  gpu          CANCELLED
12706060  norm         FAILED
12708353  norm         TIMEOUT
12709134  norm         FAILED
12708701  norm         FAILED
12708716  norm         FAILED
12709229  norm         FAILED
12708483  quick        FAILED
12708260  norm         FAILED
12708781  norm         FAILED
12708786  norm         FAILED
12708279  norm         FAILED
12708799  norm         FAILED

A single job can be dumped in full detail:

biowulf% dashboard_cli jobs --jobid 9606610 --allfields --vertical

jobid: 9606610
jobidarray: -
jobname: Refine3D_k80_003
user: nobody
state: COMPLETED
submit_time: 2021-03-05T07:11:29
state_reason: -
state_desc: -
priority: 967374
partition: gpu
qos: global
reservation: -
nodes: 2
cpus: 12
gpus: 8
mem: 94 GB
timelimit: 4:00:00
gres: gpu:k80:4
dependency: -
alloc_node: biowulf2-e0
command: /spin1/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003/RUN/run.sh
work_dir: /spin1/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003
std_in: /dev/null
std_out: /spin1/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003/RUN/run.out
std_err: /spin/users/nobody/private/apps/RELION/tests/210305/Refine3D_k80_3.1.2/003/RUN/run.err
start_time: 2021-03-05T07:17:47
queued_time: 6:18
nodelist: cn4179,cn4190
end_time: 2021-03-05T09:44:05
elapsed_time: 2:26:18
exit_code: 0
cpu_cur: 4
cpu_min: 8
cpu_max: 9
cpu_avg: 9.00
cpu_util: 75.0%
mem_cur: 11 GB
mem_min: 5 GB
mem_max: 34 GB
mem_avg: 23 GB
mem_util: 24.7%
gpu_cur: 4
gpu_min: 4
gpu_max: 8
gpu_avg: 7.98
gpu_util: 99.8%
D_cur: 0
D_min: 0
D_max: 5
D_avg: 0
eval: -
comment: -

The content can be compacted fit within a screen, and current values for cpu, memory, gpu, and threads in D state can be displayed:

biowulf% dashboard_cli jobs --compact --running --fields jobid,partition,qos,elapsed_time,nodes,cpus,cpu_cur,cpu_max,cpu_avg,D_cur,D_avg,mem,mem_cur,mem_max,mem_avg,gpus,gpu_cur,gpu_max,gpu_avg,nodelist

jobid     partition  qos     etime       n  c   cc  cx  ca     dc  da  m      mc    mx    ma    g   gc  gx  ga     nodelist
==================================================================================================================================
9593505   gpu        global  1-15:56:48  4  32  30  33  25.00   0   1  480GB  23GB  27GB  22GB  16  17  17  17.00  cn[2349-2352]
9593563   gpu        global  1-14:34:46  3  24  23  26  20.00   0   1  360GB  17GB  21GB  16GB  12  13  13  13.00  cn[2392-2394]

A job can be monitored continuously using --is-active in a bash script:

while dashboard_cli jobs --is-active --jobid 32535313
do
    echo "job is pending or running"
    sleep 30
done
echo job has finished