Biowulf High Performance Computing at the NIH
DeepLab: Semantic Image Segmentation with Deep Learning

DeepLab is a Semantic Image Segmentation tool. It makes use of the Deep Convolutional Networks, Dilated (a.k.a. Atrous) Convolution, and Fully Connected Conditional Random Fields.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=40g --gres=gpu:v100,lscratch:10
[user@@cn4466 ~]$module load DeepLab/20180816
Download the source code from gitHub:
[user@@cn4466 ~]$ git clone https://github.com/tensorflow/models.git 
[user@@cn4466 ~]$ cd models/research/deeplab
[user@@cn4466 ~]$ ln -s . deeplab
Start the DeepLab container envisonment:
[user@@cn4466 ~]$ deeplab
Singularity Deeplab_gpu_TF-1.10.sqsh:deeplab>
Run the testing script:
Singularity Deeplab_gpu_TF-1.10.sqsh:deeplab> bash local_test.sh 
...
2018-08-21 13:17:20.736914: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2018-08-21 13:17:20.737005: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 396.37.0
ok
testForwardpassDeepLabv3plus (__main__.DeeplabModelTest) ... ok
testScaleDimensionOutput (__main__.DeeplabModelTest) ... ok
testWrongDeepLabVariant (__main__.DeeplabModelTest) ... ok
test_session (__main__.DeeplabModelTest)
Returns a TensorFlow Session for use in executing tests. ... ok

----------------------------------------------------------------------
Ran 5 tests in 20.310s

OK
Uncompressing VOCtrainval_11-May-2012.tar
Removing the color map in ground truth annotations...
Converting PASCAL VOC 2012 dataset...
...
>> Converting image 366/1464 shard 0
>> Converting image 732/1464 shard 1
>> Converting image 1098/1464 shard 2
>> Converting image 1464/1464 shard 3
>> Converting image 729/2913 shard 0
>> Converting image 1458/2913 shard 1
>> Converting image 2187/2913 shard 2
>> Converting image 2913/2913 shard 3
>> Converting image 363/1449 shard 0
>> Converting image 726/1449 shard 1
>> Converting image 1089/1449 shard 2
>> Converting image 1449/1449 shard 3
...
Resolving dtn06-e0 (dtn06-e0)... 10.1.200.242
Connecting to dtn06-e0 (dtn06-e0)|10.1.200.242|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 460058541 (439M) [application/x-tar]
Saving to: 'deeplabv3_pascal_train_aug_2018_01_04.tar.gz'

deeplabv3_pascal_train_ 100%[===============================>] 438.75M  8.65MB/s    in 52s
...
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /scratch/denisovga/DeepLab/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global_step/sec: 0.00750979
INFO:tensorflow:Recording summary at step 5.
INFO:tensorflow:global step 10: loss = 0.2374 (98.614 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 1449
INFO:tensorflow:Eval batch size 1 and num batch 1449
...
INFO:tensorflow:Found new checkpoint at /scratch/denisovga/DeepLab/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10
INFO:tensorflow:Graph was finalized.
...
INFO:tensorflow:Restoring parameters from /scratch/denisovga/DeepLab/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-08-21-13:41:13
INFO:tensorflow:Evaluation [144/1449]
h
INFO:tensorflow:Evaluation [288/1449]
INFO:tensorflow:Evaluation [432/1449]
INFO:tensorflow:Evaluation [576/1449]
INFO:tensorflow:Evaluation [720/1449]
INFO:tensorflow:Evaluation [864/1449]
INFO:tensorflow:Evaluation [1008/1449]
INFO:tensorflow:Evaluation [1152/1449]
INFO:tensorflow:Evaluation [1296/1449]
INFO:tensorflow:Evaluation [1440/1449]
INFO:tensorflow:Evaluation [1449/1449]
...
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Restoring parameters from /scratch/denisovga/DeepLab/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10
INFO:tensorflow:Visualizing batch 1 / 1449
INFO:tensorflow:Visualizing batch 2 / 1449
INFO:tensorflow:Visualizing batch 3 / 1449
...

End the interactive session:
[user@cn4466 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. deeplab.sh). For example:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:v100:1,lscratch:10
#SBATCH --mem=40g
module load DeepLab/20180816
export SINGULARITY_BINDPATH="/data/$USER,/fdb,/scratch,/lscratch"
cd /scratch/$USER
git clone https://github.com/tensorflow/models.git 
cd models/research/deeplab
ln -s . deeplab
deeplab
bash local_test.sh

Submit this job using the Slurm sbatch command.

sbatch deeplab.sh 
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. deeplab.swarm). For example:

#!/bin/bash
cd /scratch/$USER
git clone https://github.com/tensorflow/models.git 
cd models/research/deeplab
ln -s . deeplab
deeplab
bash local_test.sh

Submit this job using the swarm command.

swarm -f deeplab.swarm --module DeepLab/20180816 -g 40 --partition=gpu --gres gpu:p100:1,lscratch:10