Biowulf High Performance Computing at the NIH
Deep Learning on Biowulf

Deep learning frameworks such as Tensorflow, Keras, Pytorch, and Caffe2 are available through the centrally installed python module. In addition, other frameworks such as MXNET can be installed using a user's personal conda environment. This page will guide you through the use of the different deep learning frameworks in Biowulf using interactive sessions and sbatch submission (and by extension swarm jobs).

The examples are in python 3.6 and should work on all the other python versions (2.7 and 3.5) unless otherwise stated. For each framework, a python interpreter is used to import the library and do simple commands related to the framework. In addition, a github repository of the framework's tutorial is cloned and example codes, usually basic image classification training such as CIFAR10 or MNIST, are run using the github script.

Important Notes

Tensorflow

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load python/3.6
[+] Loading python 3.6  ...

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning

[user@cn3144 ~]$ cd /data/$USER/deeplearning

[user@cn3144 ~]$ python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> import tensorflow.contrib.eager as tfe
>>> 
>>> tf.enable_eager_execution()
>>> 
>>> print("TensorFlow version: {}".format(tf.VERSION))
TensorFlow version: 1.8.0
>>> print("Eager execution: {}".format(tf.executing_eagerly()))
Eager execution: True
>>> quit()

[user@cn3144 ~]$ git clone https://github.com/tensorflow/models.git

[user@cn3144 ~]$ python models/tutorials/image/cifar10/cifar10_train.py --max_steps=1000
/usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2018-06-19 14:44:52.770583: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-19 14:44:53.296730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:83:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-06-19 14:44:53.296807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-19 14:44:53.622526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-19 14:44:53.622592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-06-19 14:44:53.622612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-06-19 14:44:53.623037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11489 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:83:00.0, compute capability: 3.7)
2018-06-19 14:44:56.478164: step 0, loss = 4.68 (334.1 examples/sec; 0.383 sec/batch)
2018-06-19 14:44:56.939725: step 10, loss = 4.64 (2773.2 examples/sec; 0.046 sec/batch)
2018-06-19 14:44:57.235208: step 20, loss = 4.63 (4332.1 examples/sec; 0.030 sec/batch)
2018-06-19 14:44:57.503400: step 30, loss = 4.38 (4772.4 examples/sec; 0.027 sec/batch)
2018-06-19 14:44:57.770756: step 40, loss = 4.36 (4787.6 examples/sec; 0.027 sec/batch)
2018-06-19 14:44:58.032283: step 50, loss = 4.36 (4894.4 examples/sec; 0.026 sec/batch)
[...]
2018-06-19 14:45:25.860690: step 990, loss = 2.38 (4733.6 examples/sec; 0.027 sec/batch)

[user@cn3144 ~]$ cat > submit.sh <<'EOF'
#! /bin/bash 
module load python/3.6
python models/tutorials/image/cifar10/cifar10_train.py --max_steps=1000
EOF

[user@cn3144 ~]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh
3940419

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Keras

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:p100:1,lscratch:10 --mem=20g -c14
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load python/3.6
[+] Loading python 3.6  ...

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning

[user@cn3144 ~]$ cd /data/$USER/deeplearning

[user@cn3144 ~]$ python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from keras.models import Sequential
/usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>> model = Sequential()
>>> from keras.layers import Dense
>>> model.add(Dense(units=64, activation='relu', input_dim=100))
>>> model.add(Dense(units=10, activation='softmax'))
>>> quit()

[user@cn3144 ~]$ git clone https://github.com/keras-team/keras.git

[user@cn3144 ~]$ sed -i 's/epochs = 200/epochs = 2/' keras/examples/cifar10_resnet.py

[user@cn3144 ~]$ python keras/examples/cifar10_resnet.py
[...]
Using TensorFlow backend.
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170500096/170498071 [==============================] - 25s 0us/step
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
y_train shape: (50000, 1)
2018-06-19 15:02:16.063830: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-19 15:02:16.600836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:83:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-06-19 15:02:16.600903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-19 15:02:16.921253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-19 15:02:16.921315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-06-19 15:02:16.921330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-06-19 15:02:16.921747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11489 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:83:00.0, compute capability: 3.7)
Learning rate:  0.001
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 32, 32, 3)    0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 32, 32, 16)   448         input_1[0][0]                    
__________________________________________________________________________________________________
[...]
Epoch 00002: val_acc improved from 0.41990 to 0.62410, saving model to /spin1/users/teacher/deeplearning/saved_models/cifar10_ResNet20v1_model.002.h5
10000/10000 [==============================] - 4s 368us/step
Test loss: 1.2348264940261842
Test accuracy: 0.6241

[user@cn3144 ~]$ cat > submit.sh <<'EOF'
#! /bin/bash 
module load python/3.6
python keras/examples/cifar10_resnet.py
EOF

[user@cn3144 ~]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh
3940419

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Keras for R

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load cuDNN/7.0/CUDA-9.0 CUDA/9.0 R/3.5.0 python/3.5
[+] Loading cuDNN 7.0  libraries... 
[+] Loading CUDA Toolkit  9.0.176  ... 
[+] Loading gcc  7.2.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[+] Loading openmpi 3.0.0  for GCC 7.2.0 
[+] Loading R 3.5.0_build2 
[+] Loading python 3.5  ... 

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning/R

[user@cn3144 ~]$ cd /data/$USER/deeplearning/R

[user@cn3144 ~]$ R
R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

 > library(keras)
 > library(tensorflow)
 > model <- keras_model_sequential()
/usr/local/Anaconda/envs/py3.5/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
 >

[user@cn3144 ~]$ git clone https://github.com/rstudio/keras.git

[user@cn3144 ~]$ Rscript keras/vignettes/examples/mnist_cnn.R
/usr/local/Anaconda/envs/py3.5/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
x_train_shape: 60000 28 28 1 
60000 train samples
10000 test samples
Train on 48000 samples, validate on 12000 samples
Epoch 1/12
2018-09-12 15:01:18.244091: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-12 15:01:18.515350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:8a:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-09-12 15:01:18.515420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-12 15:01:20.166555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-12 15:01:20.166611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-09-12 15:01:20.166628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-09-12 15:01:20.167006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11491 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:8a:00.0, compute capability: 3.7)
48000/48000 [==============================] - 12s 241us/step - loss: 0.3010 - acc: 0.9058 - val_loss: 0.0674 - val_acc: 0.9808
Epoch 2/12
48000/48000 [==============================] - 8s 157us/step - loss: 0.0984 - acc: 0.9707 - val_loss: 0.0495 - val_acc: 0.9855
[...]
Epoch 12/12
48000/48000 [==============================] - 8s 157us/step - loss: 0.0254 - acc: 0.9920 - val_loss: 0.0394 - val_acc: 0.9894
Test loss: 0.02953377 
Test accuracy: 0.9904 

[user@cn3144 ~]$ cat > submit.sh <<'EOF'
#! /bin/bash 
module load cuDNN/7.0/CUDA-9.0 CUDA/9.0 R/3.5.0 python/3.5
cd /data/$USER/deeplearning/rstudio
Rscript keras/vignettes/examples/mnist_cnn.R
EOF

[user@cn3144 ~]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh
3940419

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Pytorch

Allocate an interactive session with X11 forwarding and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load python/3.6
[+] Loading python 3.6  ...

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning

[user@cn3144 ~]$ cd /data/$USER/deeplearning

[user@cn3144 ~]$ python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import print_function
>>> import torch
>>> x = torch.empty(5, 3)
>>> print(x)
tensor([[-6.0496e-13,  1.5305e-41,  2.5110e+13],
        [ 3.0611e-41, -1.0846e-08,  1.5305e-41],
        [-1.0855e-08,  1.5305e-41, -3.1812e-13],
        [ 1.5305e-41, -3.1560e-13,  1.5305e-41],
        [-1.0853e-08,  1.5305e-41, -1.0853e-08]])
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[ 0.6320,  0.4974,  0.1132],
        [ 0.8411,  0.8527,  0.2586],
        [ 0.2586,  0.7206,  0.8066],
        [ 0.5950,  0.4406,  0.8707],
        [ 0.6431,  0.8721,  0.5510]])
>>> quit()

[user@cn3144 ~]$ git clone https://github.com/pytorch/tutorials.git

#make sure X11 is enabled by typing `xeyes` 

[user@cn3144 ~]$ python tutorials/beginner_source/blitz/cifar10_tutorial.py
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
Files already downloaded and verified
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-teacher'
  dog  bird   cat  ship
[1,  2000] loss: 2.158
[1,  4000] loss: 1.855
[1,  6000] loss: 1.673
[1,  8000] loss: 1.591
[1, 10000] loss: 1.526
[1, 12000] loss: 1.441
[2,  2000] loss: 1.362
[2,  4000] loss: 1.367
[2,  6000] loss: 1.321
[2,  8000] loss: 1.297
[2, 10000] loss: 1.268
[2, 12000] loss: 1.267
Finished Training
GroundTruth:    cat  ship  ship plane
Predicted:    cat  ship   car plane
Accuracy of the network on the 10000 test images: 52 %
Accuracy of plane : 62 %
Accuracy of   car : 70 %
Accuracy of  bird : 16 %
Accuracy of   cat : 49 %
Accuracy of  deer : 49 %
Accuracy of   dog : 50 %
Accuracy of  frog : 36 %
Accuracy of horse : 61 %
Accuracy of  ship : 62 %
Accuracy of truck : 69 %
cuda:0

[user@cn3144 ~]$ cat > submit.sh <<'EOF'
#! /bin/bash 
module load python/3.6
python tutorials/beginner_source/blitz/cifar10_tutorial.py
EOF

[user@cn3144 ~]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh
3940419

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Caffe2

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load python nccl/2.2.13_cuda9.0
[+] Loading python 2.7  ... 
[+] Loading CUDA Toolkit  9.0.176  ... 
[+] Loading NCCL for cuda 9.0  2.2.13  ... 

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning

[user@cn3144 ~]$ cd /data/$USER/deeplearning

[user@cn3144 ~]$ python
>>> from caffe2.python import workspace, model_helper
>>> import numpy as np
>>> # Create random tensor of three dimensions
... x = np.random.rand(4, 3, 2)
>>> print(x)
[[[0.76794446 0.10728679]
  [0.21646403 0.55296556]
  [0.34029796 0.86811885]]

 [[0.44665673 0.11730879]
  [0.66358693 0.60259659]
  [0.64326743 0.04888184]]

 [[0.74584444 0.39408433]
  [0.80037294 0.6543183 ]
  [0.56033315 0.05735479]]

 [[0.96530297 0.22466116]
  [0.94904576 0.92072679]
  [0.4773546  0.7318874 ]]]
>>> print(x.shape)
(4, 3, 2)
>>> 
>>> workspace.FeedBlob("my_x", x)
True
>>> 
>>> x2 = workspace.FetchBlob("my_x")
>>> print(x2)
[[[0.76794446 0.10728679]
  [0.21646403 0.55296556]
  [0.34029796 0.86811885]]

 [[0.44665673 0.11730879]
  [0.66358693 0.60259659]
  [0.64326743 0.04888184]]

 [[0.74584444 0.39408433]
  [0.80037294 0.6543183 ]
  [0.56033315 0.05735479]]

 [[0.96530297 0.22466116]
  [0.94904576 0.92072679]
  [0.4773546  0.7318874 ]]]
>>> quit()


[user@cn3144 ~]$ git clone https://github.com/caffe2/caffe2.git

[user@cn3144 ~]$ python caffe2/caffe2/python/models/resnet_test.py
No handlers could be found for logger "caffe2.python.net_drawer"
net_drawer will not run correctly. Please install the correct dependencies.
/usr/local/Anaconda/envs/py2.7/lib/python2.7/site-packages/caffe2/python/hypothesis_test_util.py:75: HypothesisDeprecationWarning: 
The min_satisfying_examples setting has been deprecated and disabled, due to
overlap with the filter_too_much healthcheck and poor interaction with the
max_examples setting.
[...]
I0619 15:57:50.653511 48188 memonger.cc:236] Remapping 126 using 5 shared blobs.
INFO:memonger:Memonger memory optimization took 0.0219659805298 secs
I0619 15:57:51.410488 48188 operator.cc:167] Engine CUDNN is not available for operator Conv.
I0619 15:57:51.410671 48188 operator.cc:167] Engine CUDNN is not available for operator Relu.
I0619 15:57:51.410712 48188 operator.cc:167] Engine CUDNN is not available for operator MaxPool.
[...]
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
I0619 15:58:16.026832 48188 memonger.cc:236] Remapping 138 using 34 shared blobs.
I0619 15:58:16.026859 48188 memonger.cc:239] Memonger saved approximately : 320.812 MB.
INFO:memonger:Memonger memory optimization took 0.0328068733215 secs
before: 880 after: 776
.
----------------------------------------------------------------------
Ran 3 tests in 39.758s

OK

[user@cn3144 ~]$ cat > submit.sh <<'EOF'
#! /bin/bash 
module load python CUDA
python caffe2/caffe2/python/models/resnet_test.py
EOF

[user@cn3144 ~]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh
3940419

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

MXNET

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning

[user@cn3144 ~]$ cd /data/$USER/deeplearning

[user@cn3144 ~]$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
--2018-04-23 15:04:10--  https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving dtn04-e0... 10.1.200.240
Connecting to dtn04-e0|10.1.200.240|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 58304693 (56M) [application/x-sh]
Saving to: “Miniconda3-latest-Linux-x86_64.sh”
2018-04-23 15:04:11 (118 MB/s) - “Miniconda3-latest-Linux-x86_64.sh” saved [58304693/58304693]

[user@cn3144 ~]$ bash Miniconda3-latest-Linux-x86_64.sh -p /data/$USER/deeplearning/conda -b
PREFIX=/data/teacher/deeplearning/conda
installing: python-3.6.5-hc3d631a_2 ...
Python 3.6.5 :: Anaconda, Inc.
installing: ca-certificates-2018.03.07-0 ...
installing: conda-env-2.6.0-h36134e3_1 ...
[...]
installation finished.

[user@cn3144 ~]$ source /data/$USER/deeplearning/conda/etc/profile.d/conda.sh

[user@cn3144 ~]$ conda activate base

(base) [user@cn3144 deeplearning]$ which python
/data/$USER/deeplearning/conda/bin/python

(base) [user@cn3144 deeplearning]$ module load CUDA
[+] Loading CUDA Toolkit  9.0.176  ...

(base) [user@cn3144 deeplearning]$ pip install mxnet-cu90
Collecting mxnet-cu90
[...]
Successfully installed graphviz-0.8.3 mxnet-cu90-1.2.0 numpy-1.14.5

[user@cn3144 ~]$ python
>>> import mxnet as mx
>>> a = mx.nd.ones((2, 3), mx.gpu())
>>> b = a * 2 + 1
>>> b.asnumpy()
array([[3., 3., 3.],
       [3., 3., 3.]], dtype=float32)
>>> quit()

(base) [user@cn3144 deeplearning]$ git clone https://github.com/apache/incubator-mxnet.git

(base) [user@cn3144 deeplearning]$ cd incubator-mxnet/example/image-classification

(base) [user@cn3144 deeplearning]$ python train_mnist.py --network mlp --num-epochs 2
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=60000, num_layers=None, optimizer='sgd', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[...]
INFO:root:Epoch[1] Batch [700]	Speed: 44104.51 samples/sec	accuracy=0.968906
INFO:root:Epoch[1] Batch [800]	Speed: 45402.42 samples/sec	accuracy=0.971562
INFO:root:Epoch[1] Batch [900]	Speed: 45352.02 samples/sec	accuracy=0.970781
INFO:root:Epoch[1] Train-accuracy=0.970439
INFO:root:Epoch[1] Time cost=1.328
INFO:root:Epoch[1] Validation-accuracy=0.965267

[user@cn3144 ~]$ cat > submit.sh <<'EOF'
#! /bin/bash 
module load CUDA
source /data/$USER/deeplearning/conda/etc/profile.d/conda.sh
conda activate base
python train_mnist.py --network mlp --num-epochs 2
EOF

[user@cn3144 ~]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh
3940419

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$