Biowulf High Performance Computing at the NIH
TensorRT on Biowulf

NVIDIA TensorRTâ„¢ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ cd /lscratch/$SLURM_JOBID

[user@cn3144 ~]$ module load tensorrt
[+] Loading libarchive  3.3.2 
[+] Loading singularity  on cn4172 
[+] Loading tensorrt  18.09 

[user@cn3144 ~]$ giexec
Mandatory params:
  --deploy=      Caffe deploy file
  OR --uff=      UFF file
  --output=      Output blob name (can be specified multiple times)

Mandatory params for onnx:
  --onnx=        ONNX Model file

Optional params:
  --uffInput=,C,H,W Input blob names along with their dimensions for UFF parser
  --model=       Caffe model file (default = no model, random weights used)
  --batch=N            Set batch size (default = 1)
  --device=N           Set cuda device to N (default = 0)
  --iterations=N       Run N iterations (default = 10)
  --avgRuns=N          Set avgRuns to N - perf is measured as an average of avgRuns (default=10)
  --percentile=P       For each iteration, report the percentile time at P percentage (0<=P<=100, with 0 representing min, and 100 representing max; default = 99.0%)
  --workspace=N        Set workspace size in megabytes (default = 16)
  --fp16               Run in fp16 mode (default = false). Permits 16-bit kernels
  --int8               Run in int8 mode (default = false). Currently no support for ONNX model.
  --verbose            Use verbose logging (default = false)
  --engine=      Generate a serialized TensorRT engine
  --calib=       Read INT8 calibration cache file.  Currently no support for ONNX model.
  --useDLA=N           Enable execution on DLA for all layers that support dla. Value can range from 1 to N, where N is the number of dla engines on the platform.
  --allowGPUFallback   If --useDLA flag is present and if a layer can't run on DLA, then run on GPU. 

[user@cn3144 ~]$ cp -r /usr/local/apps/tensorrt/TEST_DATA/mnist .

[user@cn3144 ~]$ sample_mnist --datadir=/lscratch/$SLURM_JOBID/mnist/
Building and running a GPU inference engine for MNIST

Input:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@#-:.-=@@@@@@@@@@@@@@
@@@@@%=     . *@@@@@@@@@@@@@
@@@@%  .:+%%% *@@@@@@@@@@@@@
@@@@+=#@@@@@# @@@@@@@@@@@@@@
@@@@@@@@@@@%  @@@@@@@@@@@@@@
@@@@@@@@@@@: *@@@@@@@@@@@@@@
@@@@@@@@@@- .@@@@@@@@@@@@@@@
@@@@@@@@@:  #@@@@@@@@@@@@@@@
@@@@@@@@:   +*%#@@@@@@@@@@@@
@@@@@@@%         :+*@@@@@@@@
@@@@@@@@#*+--.::     +@@@@@@
@@@@@@@@@@@@@@@@#=:.  +@@@@@
@@@@@@@@@@@@@@@@@@@@  .@@@@@
@@@@@@@@@@@@@@@@@@@@#. #@@@@
@@@@@@@@@@@@@@@@@@@@#  @@@@@
@@@@@@@@@%@@@@@@@@@@- +@@@@@
@@@@@@@@#-@@@@@@@@*. =@@@@@@
@@@@@@@@ .+%%%%+=.  =@@@@@@@
@@@@@@@@           =@@@@@@@@
@@@@@@@@*=:   :--*@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Output:

0: 
1: 
2: 
3: **********
4: 
5: 
6: 
7: 
8: 
9: 

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. tensorrt.sh). For example:

#!/bin/bash
module load tensorrt
cp -r /usr/local/apps/tensorrt/TEST_DATA/mnist /lscratch/$SLURM_JOBID/
sample_mnist --datadir=/lscratch/$SLURM_JOBID/mnist/
python3 tensorrt.py

Submit this job using the Slurm sbatch command.

sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 tensorrt.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. tensorrt.swarm). For example:

python3 tensorrt1.py
python3 tensorrt2.py
python3 tensorrt3.py
python3 tensorrt4.py

Submit this job using the swarm command.

swarm -f tensorrt.swarm -g 20 -t 14 --partition=gpu --gres=gpu:k80:1,lscratch:10 --module tensorrt
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module tensorrt Loads the tensorrt module for each subjob in the swarm