Deep learning frameworks such as Tensorflow, Keras, and Pytorch are available through the centrally installed python module. Additionally, Flux is available through the centrally installed julia module. Other frameworks such as MXNET can be installed using a user's personal conda environment. This page will guide you through the use of the different deep learning frameworks in Biowulf using interactive sessions and sbatch submission (and by extension swarm jobs).
For each framework, a python interpreter is used to import the library and do simple commands related to the framework. In addition, a github repository of the framework's tutorial is cloned and example codes, usually basic image classification training such as CIFAR10 or MNIST, are run using the github script.
Allocate an interactive session and run the program. Sample session:
[user@biowulf ~]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 58344035 salloc.exe: job 58344035 queued and waiting for resources salloc.exe: job 58344035 has been allocated resources salloc.exe: Granted job allocation 58344035 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4172 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn4172 ~]$ module load cuDNN/8.2.1/CUDA-11.3 python/3.9 [+] Loading cuDNN/8.2.1/CUDA-11.3 libraries... [+] Loading python 3.9 ... [user@cn4172 ~]$ mkdir -p /data/${USER}/deeplearning/tensorflow-example [user@cn4172 ~]$ cd /data/${USER}/deeplearning/tensorflow-example [user@cn4172 tensorflow-example]$ python Python 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> print("TensorFlow version: {}".format(tf.__version__)) TensorFlow version: 2.1.0 >>> quit() [user@cn4172 tensorflow-example]$ cat >hello-tflow.py<<'EOF' import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test) EOF [user@cn4172 tensorflow-example]$ python hello-tflow.py 2020-05-19 12:25:29.207263: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-05-19 12:25:29.238367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:84:00.0 name: Tesla K80 computeCapability: 3.7 coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.92GiB deviceMemoryBandwidth: 223.96GiB/s 2020-05-19 12:25:29.239507: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 [...] 10000/10000 [==============================] - 1s 55us/sample - loss: 0.0787 - accuracy: 0.9762 [user@cn4172 tensorflow-example]$ exit exit salloc.exe: Relinquishing job allocation 58344035
Now let's submit the job to Biowulf's Slurm scheduler using sbatch.
[user@biowulf ~]$ cd /data/${USER}/deeplearning/tensorflow-example [user@biowulf tensorflow-example]$ cat >submit.sh<<'EOF' #!/bin/bash module load cuDNN/7.6.5/CUDA-10.1 python/3.7 python hello-tflow.py EOF [user@biowulf tensorflow-example]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 58345698 [user@biowulf tensorflow-example]$ tail -n 1 slurm-58345698.out 10000/10000 [==============================] - 1s 55us/sample - loss: 0.0754 - accuracy: 0.9769 [user@biowulf tensorflow-example]$
Please see the tensorboard page for visualization of tensorflow training.
Allocate an interactive session and run the program. Sample session:
[user@biowulf ~]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 58344035 salloc.exe: job 58344035 queued and waiting for resources salloc.exe: job 58344035 has been allocated resources salloc.exe: Granted job allocation 58344035 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4172 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn4172 ~]$ module load cuDNN/8.2.1/CUDA-11.3 julialang/1.9.2 [+] Loading cuDNN/8.2.1/CUDA-11.3 libraries... [+] Loading julialang 1.9.2 ... [user@cn4172 ~]$ mkdir -p /data/${USER}/deeplearning/flux-example [user@cn4172 ~]$ cd /data/${USER}/deeplearning/flux-example [user@cn4172 flux-example]$ julia _ _ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 1.5.0 (2020-08-01) _/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release |__/ | julia> ] (v1.5) pkg> add CUDA Updating registry at `~/.julia/registries/General` ######################################################################## 100.0% Resolving package versions... [...] [052768ef] + CUDA v2.3.0 (v1.5) pkg> add Flux [...] (v1.5) pkg add Parameters [...] (v1.5) pkg add MLDatasets [...] (v1.5) pkg add Statistics [...] (v1.5) pkg> status Status `/home/user/.julia/environments/v1.5/Project.toml` [052768ef] CUDA v2.3.0 [587475ba] Flux v0.11.2 [eb30cadb] MLDatasets v0.5.3 [10745b16] Statistics [d96e819e] Parameters v0.12.1 julia> exit() [user@cn4172 mnist]$ git clone https://github.com/FluxML/model-zoo.git Cloning into 'model-zoo'... remote: Enumerating objects: 166, done. remote: Counting objects: 100% (166/166), done. remote: Compressing objects: 100% (134/134), done. remote: Total 2938 (delta 66), reused 90 (delta 27), pack-reused 2772 Receiving objects: 100% (2938/2938), 1.96 MiB | 12.63 MiB/s, done. Resolving deltas: 100% (1498/1498), done. [user@cn4172 mnist]$ cd model-zoo/vision/mnist [user@cn4172 mnist]$ julia mlp.jl [ Info: CUDA is on ┌ Warning: Your Tesla K80 GPU does not meet the minimal required compute capability (3.7.0 < 5.0). │ Some functionality might not work. For a fully-supported set-up, please use an older version of CUDA.jl └ @ CUDA ~/.julia/packages/CUDA/YeS8q/src/state.jl:251 [ Info: Epoch 1 loss_all(train_data, m) = 2.3814766f0 loss_all(train_data, m) = 2.3615136f0 loss_all(train_data, m) = 2.342074f0 loss_all(train_data, m) = 2.323052f0 loss_all(train_data, m) = 2.3044188f0 [...] loss_all(train_data, m) = 0.31868845f0 loss_all(train_data, m) = 0.31832847f0 loss_all(train_data, m) = 0.31800961f0 accuracy(train_data, m) = 0.9143038093777877 accuracy(test_data, m) = 0.9177315848214287 [user@cn4172 mnist]$ exit exit salloc.exe: Relinquishing job allocation 58344035
Now let's submit the Flux job to Biowulf's Slurm scheduler using sbatch.
[user@biowulf ~]$ cd /data/${USER}/deeplearning/flux-example/model-zoo/vision/mnist [user@biowulf mnist]$ cat >submit.sh<<'EOF' #!/bin/bash module load cuDNN/7.6.5/CUDA-10.1 julialang/1.5.0 julia mlp.jl EOF [user@biowulf mnist]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 58345698 [user@biowulf mnist]$ tail -n 2 slurm-58345698.out accuracy(train_data, m) = 0.9135415505129348 accuracy(test_data, m) = 0.9174386160714286 [user@biowulf mnist]$
Allocate an interactive session and run the program. Sample session:
[user@biowulf ~]$ sinteractive --gres=gpu:p100:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 45138535 salloc.exe: job 45138535 queued and waiting for resources salloc.exe: job 45138535 has been allocated resources salloc.exe: Granted job allocation 45138535 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn2350 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn2350 ~]$ module load python/3.9 [+] Loading python 3.9 ... [user@cn2350 ~]$ mkdir -p /data/$USER/deeplearning [user@cn2350 ~]$ cd /data/$USER/deeplearning [user@cn2350 deeplearning]$ python Python 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> from keras.models import Sequential Using TensorFlow backend. >>> model = Sequential() >>> from keras.layers import Dense >>> model.add(Dense(units=64, activation='relu', input_dim=100)) WARNING: Logging before flag parsing goes to stderr. W1227 18:04:50.156235 46912496418496 deprecation.py:506] From /usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. >>> model.add(Dense(units=10, activation='softmax')) >>> quit() [user@cn2350 deeplearning]$ git clone https://github.com/keras-team/keras.git Cloning into 'keras'... remote: Enumerating objects: 32987, done. remote: Total 32987 (delta 0), reused 0 (delta 0), pack-reused 32987 Receiving objects: 100% (32987/32987), 13.02 MiB | 22.88 MiB/s, done. Resolving deltas: 100% (24114/24114), done. [user@cn2350 deeplearning]$ cd keras/ && git checkout 7a39b6c6 && cd .. #ensure version for tutorial Note: checking out '7a39b6c6'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at 7a39b6c... Fix too many values to unpack error (#13511) [user@cn2350 deeplearning]$ python keras/examples/mnist_cnn.py Using TensorFlow backend. [...] Epoch 12/12 60000/60000 [==============================] - 3s 50us/step - loss: 0.0262 - accuracy: 0.9922 - val_loss: 0.0241 - val_accuracy: 0.9917 Test loss: 0.02409471721524369 Test accuracy: 0.9916999936103821 [user@cn2350 deeplearning]$ cat > submit.sh <<'EOF' #!/bin/bash module load python/3.7 python keras/examples/cifar10_resnet.py EOF [user@cn2350 deeplearning]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 45138623 [user@cn2350 deeplearning]$ exit exit salloc.exe: Relinquishing job allocation 45138535 [user@biowulf ~]$
Allocate an interactive session and run the program. Sample session:
[user@biowulf ~]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 45255667 salloc.exe: job 45255667 queued and waiting for resources salloc.exe: job 45255667 has been allocated resources salloc.exe: Granted job allocation 45255667 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4175 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn4175 ~]$ module load cuDNN/7.6.5/CUDA-10.2 CUDA/10.2 R/3.6.1 python/3.7 [+] Loading cuDNN/7.6.5/CUDA-10.2 libraries... [+] Loading CUDA Toolkit 10.2.89 ... [+] Loading gcc 9.2.0 ... [+] Loading GSL 2.6 for GCC 9.2.0 ... [-] Unloading gcc 9.2.0 ... [+] Loading gcc 9.2.0 ... [+] Loading openmpi 3.1.4 for GCC 9.2.0 [+] Loading ImageMagick 7.0.8 on cn4175 [+] Loading HDF5 1.10.4 [+] Loading pandoc 2.9.1 on cn4175 [+] Loading R 3.6.1 [+] Loading python 3.7 ... [user@cn4175 ~]$ mkdir -p /data/$USER/deeplearning/R [user@cn4175 ~]$ cd /data/$USER/deeplearning/R [user@cn4175 R]$ R R version 3.6.1 (2019-07-05) -- "Action of the Toes" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(keras) > library(tensorflow) > model <- keras_model_sequential() > quit() Save workspace image? [y/n/c]: n [user@cn4175 R]$ git clone https://github.com/rstudio/keras.git Cloning into 'keras'... remote: Enumerating objects: 6, done. remote: Counting objects: 100% (6/6), done. remote: Compressing objects: 100% (6/6), done. remote: Total 29481 (delta 0), reused 1 (delta 0), pack-reused 29475 Receiving objects: 100% (29481/29481), 28.35 MiB | 33.41 MiB/s, done. Resolving deltas: 100% (25960/25960), done. [user@cn4175 R]$ cd keras/ && git checkout e3f62ae2 && cd .. #force specific commit Note: checking out 'e3f62ae2'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at e3f62ae... Merge pull request #955 from dfalbel/vae-examples [user@cn4175 R]$ Rscript keras/vignettes/examples/mnist_cnn.R x_train_shape: 60000 28 28 1 60000 train samples 10000 test samples [...] 2019-12-30 19:33:06.852550: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2019-12-30 19:33:06.890161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:8a:00.0 2019-12-30 19:33:06.891298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 [...] Epoch 12/12 48000/48000 [==============================] - 7s 153us/sample - loss: 0.0262 - acc: 0.9916 - val_loss: 0.0398 - val_acc: 0.9907 Test loss: 0.03063129 Test accuracy: 0.9907 [user@cn4175 R]$ cat > submit.sh <<'EOF' #!/bin/bash module load cuDNN/7.6.5/CUDA-10.2 CUDA/10.2 R/3.6.1 python/3.7 cd /data/$USER/deeplearning/R Rscript keras/vignettes/examples/mnist_cnn.R EOF [user@cn4175 R]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 45256090 [user@cn4175 R]$ exit exit salloc.exe: Relinquishing job allocation 45255667
Allocate an interactive session with X11 forwarding and run the program. Sample session:
[user@biowulf ~]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 45391577 salloc.exe: job 45391577 queued and waiting for resources salloc.exe: job 45391577 has been allocated resources salloc.exe: Granted job allocation 45391577 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4210 are ready for job [user@cn4210 ~]$ xeyes #make sure that x11 is enabled and working ^C [user@cn4210 ~]$ module load python/3.7 [+] Loading python 3.7 ... [user@cn4210 ~]$ mkdir -p /data/$USER/deeplearning [user@cn4210 ~]$ cd /data/$USER/deeplearning [user@cn4210 deeplearning]$ python Python 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> from __future__ import print_function >>> import torch >>> x = torch.empty(5, 3) >>> print(x) tensor([[1.4013e-45, 7.0371e+28, 0.0000e+00], [0.0000e+00, 1.9247e+13, 3.0611e-41], [0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00], [1.3452e-43, 0.0000e+00, 0.0000e+00]]) >>> x = torch.rand(5, 3) >>> print(x) tensor([[0.3291, 0.7965, 0.2630], [0.3921, 0.4740, 0.3053], [0.3313, 0.5913, 0.1922], [0.3985, 0.6349, 0.9997], [0.3966, 0.3017, 0.4237]]) >>> quit() [user@cn4210 deeplearning]$ git clone https://github.com/pytorch/tutorials.git Cloning into 'tutorials'... remote: Enumerating objects: 3, done. remote: Counting objects: 100% (3/3), done. remote: Compressing objects: 100% (3/3), done. remote: Total 28364 (delta 0), reused 2 (delta 0), pack-reused 28361 Receiving objects: 100% (28364/28364), 563.09 MiB | 40.53 MiB/s, done. Resolving deltas: 100% (20031/20031), done. [user@cn4210 deeplearning]$ cd tutorials/ && git checkout c87836d4 && cd .. #ensure same starting point Note: checking out 'c87836d4'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at c87836d... Merge pull request #765 from pytorch/pr-run-options [user@cn4210 deeplearning]$ python tutorials/beginner_source/blitz/cifar10_tutorial.py Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz 170500096it [00:02, 71187107.19it/s] Files already downloaded and verified plane horse ship bird [1, 2000] loss: 2.233 [1, 4000] loss: 1.887 [1, 6000] loss: 1.672 [1, 8000] loss: 1.579 [1, 10000] loss: 1.510 [1, 12000] loss: 1.484 [2, 2000] loss: 1.389 [2, 4000] loss: 1.367 [2, 6000] loss: 1.344 [2, 8000] loss: 1.306 [2, 10000] loss: 1.313 [2, 12000] loss: 1.256 Finished Training GroundTruth: cat ship ship plane Predicted: cat ship ship ship Accuracy of the network on the 10000 test images: 55 % Accuracy of plane : 52 % Accuracy of car : 71 % Accuracy of bird : 23 % Accuracy of cat : 29 % Accuracy of deer : 47 % Accuracy of dog : 45 % Accuracy of frog : 76 % Accuracy of horse : 52 % Accuracy of ship : 79 % Accuracy of truck : 72 % cuda:0 [user@cn4210 deeplearning]$ cat > submit.sh <<'EOF' #!/bin/bash module load python/3.7 python tutorials/beginner_source/blitz/cifar10_tutorial.py EOF [user@cn4210 deeplearning]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 45392901 [user@cn4210 deeplearning]$ exit exit salloc.exe: Relinquishing job allocation 45391577 [user@biowulf ~]$
Allocate an interactive session and run the program. Sample session:
[user@biowulf ~]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 45394059 salloc.exe: job 45394059 queued and waiting for resources salloc.exe: job 45394059 has been allocated resources salloc.exe: Granted job allocation 45394059 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4187 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn4187 ~]$ mkdir -p /data/$USER/deeplearning [user@cn4187 ~]$ cd /data/$USER/deeplearning [user@cn4187 deeplearning]$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh --2020-01-02 12:09:44-- https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh Proxy request sent, awaiting response... 200 OK Length: 71785000 (68M) [application/x-sh] Saving to: ‘Miniconda3-latest-Linux-x86_64.sh’ 100%[============================================================>] 71,785,000 54.1MB/s in 1.3s 2020-01-02 12:09:46 (54.1 MB/s) - ‘Miniconda3-latest-Linux-x86_64.sh’ saved [71785000/71785000] [user@cn4187 deeplearning]$ bash Miniconda3-latest-Linux-x86_64.sh -p /data/$USER/deeplearning/conda -b PREFIX=/data/user/deeplearning/conda Unpacking payload ... Collecting package metadata (current_repodata.json): done Solving environment: done [...] installation finished. [user@cn4187 deeplearning]$ source /data/$USER/deeplearning/conda/etc/profile.d/conda.sh [user@cn4187 deeplearning]$ conda activate base (base) [user@cn4187 deeplearning]$ which python #ensure that you are using your local installation /data/user/deeplearning/conda/bin/python (base) [user@cn4187 deeplearning]$ module load CUDA/10.1 [+] Loading CUDA Toolkit 10.1.105 ... (base) [user@cn4187 deeplearning]$ pip install mxnet-cu101 Collecting mxnet-cu101 [...] Successfully installed graphviz-0.8.4 mxnet-cu101-1.5.1.post0 numpy-1.18.0 (base) [user@cn4187 deeplearning]$ python Python 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import mxnet as mx >>> a = mx.nd.ones((2, 3), mx.gpu()) >>> b = a * 2 + 1 >>> b.asnumpy() array([[3., 3., 3.], [3., 3., 3.]], dtype=float32) >>> quit() (base) [user@cn4187 deeplearning]$ git clone https://github.com/apache/incubator-mxnet.git Cloning into 'incubator-mxnet'... remote: Enumerating objects: 5, done. remote: Counting objects: 100% (5/5), done. remote: Compressing objects: 100% (5/5), done. remote: Total 109119 (delta 1), reused 0 (delta 0), pack-reused 109114 Receiving objects: 100% (109119/109119), 74.94 MiB | 30.01 MiB/s, done. Resolving deltas: 100% (73917/73917), done. Checking out files: 100% (4068/4068), done. (base) [user@cn4187 deeplearning]$ cd incubator-mxnet/example/image-classification/ (base) [user@cn4187 image-classification]$ git checkout 06aec8aa #ensure the same starting point Note: checking out '06aec8aa'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at 06aec8a... [CI] Re-enable testing with numpy 1.18 (#17200) (base) [user@cn4187 image-classification]$ python train_mnist.py --network mlp --num-epochs 2 INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=60000, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, use_imagenet_data_augmentation=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) [...] INFO:root:Epoch[1] Batch [800-900] Speed: 46124.04 samples/sec accuracy=0.965156 INFO:root:Epoch[1] Train-accuracy=0.966768 INFO:root:Epoch[1] Time cost=1.310 INFO:root:Epoch[1] Validation-accuracy=0.962779 (base) [user@cn4187 image-classification]$ cat > submit.sh <<'EOF' #!/bin/bash module load CUDA/10.1 source /data/$USER/deeplearning/conda/etc/profile.d/conda.sh conda activate base python train_mnist.py --network mlp --num-epochs 2 EOF (base) [user@cn4187 image-classification]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 45394383 (base) [user@cn4187 image-classification]$ exit exit salloc.exe: Relinquishing job allocation 45394059 [user@biowulf ~]$
Allocate an interactive session and run the program. (Note: The example below was obtained from the Quickstart Tutorial by Deniz Yuret at https://github.com/denizyuret/Knet.jl/blob/master/tutorial/15.quickstart.ipynb
[user@biowulf ~]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 58344035 salloc.exe: job 58344035 queued and waiting for resources salloc.exe: job 58344035 has been allocated resources salloc.exe: Granted job allocation 58344035 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4172 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn4172 ~]$ module load cuDNN/7.6.5/CUDA-10.1 julialang/1.5.0 [+] Loading cuDNN/7.6.5/CUDA-10.1 libraries... [+] Loading julialang 1.5.0 ... [user@cn4172 ~]$ mkdir -p /data/${USER}/deeplearning/knet-example [user@cn4172 ~]$ cd /data/${USER}/deeplearning/knet-example [user@cn4172 knet-example]$ julia _ _ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 1.5.0 (2020-08-01) _/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release |__/ | julia> ] (v1.5) pkg> add Knet [...] (v1.5) pkg add MLDatasets [...] (v1.5) pkg add IterTools [...] (v1.5) pkg> status Status `/home/user/.julia/environments/v1.5/Project.toml` [c8e1da08] IterTools v1.3.0 [1902f260] Knet v1.4.5 [eb30cadb] MLDatasets v0.5.3 julia> using Knet, MLDatasets, IterTools [ Info: Precompiling Knet [1902f260-5fb4-5aff-8c31-6271790ab950] Downloading artifact: CUDA110 Downloading artifact: CUDNN_CUDA110 Downloading artifact: CUTENSOR_CUDA110 [ Info: Precompiling MLDatasets [eb30cadb-4394-5ae3-aed4-317e484a6458] [ Info: Precompiling IterTools [c8e1da08-722c-5040-9ed9-7db0dc04731e] julia> struct Conv; w; b; f; end julia> (c::Conv)(x) = c.f.(pool(conv4(c.w, x) .+ c.b)) julia> Conv(w1,w2,cx,cy,f=relu) = Conv(param(w1,w2,cx,cy), param0(1,1,cy,1), f); julia> struct Dense; w; b; f; end julia> (d::Dense)(x) = d.f.(d.w * mat(x) .+ d.b) julia> Dense(i::Int,o::Int,f=relu) = Dense(param(o,i), param0(o), f); julia> struct Chain; layers; Chain(args...)=new(args); end julia> (c::Chain)(x) = (for l in c.layers; x = l(x); end; x) julia> (c::Chain)(x,y) = nll(c(x),y) julia> xtrn,ytrn = MNIST.traindata(Float32); ytrn[ytrn.==0] .= 10 julia> xtst,ytst = MNIST.testdata(Float32); ytst[ytst.==0] .= 10 julia> dtrn = minibatch(xtrn, ytrn, 100; xsize=(size(xtrn,1),size(xtrn,2),1,:)) julia> dtst = minibatch(xtst, ytst, 100; xsize=(size(xtst,1),size(xtst,2),1,:)); julia> LeNet = Chain(Conv(5,5,1,20), Conv(5,5,20,50), Dense(800,500), Dense(500,10,identity)) julia> progress!(adam(LeNet, ncycle(dtrn,10))) [100.00%, 6000/6000, 01:10/01:10, 85.94i/s] julia> accuracy(LeNet, dtst) ┌ Warning: accuracy(model,data; o...) is deprecated, please use accuracy(model; data=data, o...) └ @ Knet.Ops20 ~/.julia/packages/Knet/C0PoK/src/ops20/loss.jl:205 0.9903 julia> exit() [user@cn4172 knet-example]$ exit exit salloc.exe: Relinquishing job allocation 58344035
Now let's submit the Knet example above as a job to Biowulf's Slurm scheduler using sbatch.
[user@biowulf]$ cat >submit.sh<<'EOF' #!/bin/bash module load cuDNN/7.6.5/CUDA-10.1 julialang/1.5.0 julia knet-example.jl EOF [user@biowulf]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 58345698 [user@biowulf]$ tail -n 2 slurm-58345698.out [100.00%, 6000/6000, 01:00/01:00, 99.31i/s] ┌ Warning: accuracy(model,data; o...) is deprecated, please use accuracy(model; data=data, o...) └ @ Knet.Ops20 ~/.julia/packages/Knet/C0PoK/src/ops20/loss.jl:205 [user@biowulf]$
NOTE: Caffe2 is not available for python >=3 on Biowulf and the source code is now part of the pytorch Github repository. This is therefore a legacy example.
Allocate an interactive session and run the program. Sample session:
[user@biowulf ~]$ sinteractive --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 salloc.exe: Pending job allocation 45397884 salloc.exe: job 45397884 queued and waiting for resources salloc.exe: job 45397884 has been allocated resources salloc.exe: Granted job allocation 45397884 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn4210 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn4210 ~]$ module load python/2.7 CUDA/10.2 [+] Loading python 2.7 ... -------------------------------------------------------------------------------- Support for Python 2.7 will officially end January 1, 2020 though one more release is planned for mid April 2020. See https://www.python.org/dev/peps/pep-0373/ Therefore, on April 15, 2020 Python 3 will become the default Python module on biowulf. The python/2.7 module will continue to be available after this date but not as the default. Please update your code and workflows. -------------------------------------------------------------------------------- [+] Loading CUDA Toolkit 10.2.89 ... [user@cn4210 ~]$ mkdir -p /data/$USER/deeplearning [user@cn4210 ~]$ cd /data/$USER/deeplearning [user@cn4210 deeplearning]$ python Python 2.7.15 |Anaconda, Inc.| (default, Oct 10 2018, 21:32:13) [GCC 7.3.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from caffe2.python import workspace, model_helper >>> import numpy as np >>> x = np.random.rand(4, 3, 2) >>> print(x) [[[0.73718637 0.47582559] [0.20259583 0.73612329] [0.43679271 0.36409541]] [[0.47364977 0.03001404] [0.51150512 0.574326 ] [0.29767152 0.92993194]] [[0.73750268 0.10525092] [0.29873771 0.18084976] [0.2605085 0.71325083]] [[0.96754395 0.13209623] [0.54818848 0.46042724] [0.06515803 0.53438246]]] >>> print(x.shape) (4, 3, 2) >>> workspace.FeedBlob("my_x", x) True >>> x2 = workspace.FetchBlob("my_x") >>> print(x2) [[[0.73718637 0.47582559] [0.20259583 0.73612329] [0.43679271 0.36409541]] [[0.47364977 0.03001404] [0.51150512 0.574326 ] [0.29767152 0.92993194]] [[0.73750268 0.10525092] [0.29873771 0.18084976] [0.2605085 0.71325083]] [[0.96754395 0.13209623] [0.54818848 0.46042724] [0.06515803 0.53438246]]] >>> quit() [user@cn4210 deeplearning]$ git clone https://github.com/caffe2/caffe2.git Cloning into 'caffe2'... remote: Enumerating objects: 230451, done. remote: Total 230451 (delta 0), reused 0 (delta 0), pack-reused 230451 Receiving objects: 100% (230451/230451), 392.83 MiB | 35.82 MiB/s, done. Resolving deltas: 100% (211854/211854), done. [user@cn4210 deeplearning]$ cd caffe2/ && git checkout v0.8.1 && cd .. #get latest non-deprecated release Checking out files: 100% (1267/1267), done. Note: checking out 'v0.8.1'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at 32f023f... Add conv layer and layer tests [user@cn4210 deeplearning]$ python caffe2/caffe2/python/models/resnet_test.py [...] [I memonger.cc:239] Memonger saved approximately : 343.904 MB. INFO:memonger:Memonger memory optimization took 0.0177371501923 secs before: 880 after: 776 . ---------------------------------------------------------------------- Ran 3 tests in 19.429s OK [user@cn4210 deeplearning]$ cat > submit.sh <<'EOF' #!/bin/bash module load python/2.7 CUDA/10.2 python caffe2/caffe2/python/models/resnet_test.py EOF [user@cn4210 deeplearning]$ sbatch --partition=gpu --gres=gpu:k80:1,lscratch:10 --mem=20g -c14 submit.sh 45397997 [user@cn4210 deeplearning]$ exit exit salloc.exe: Relinquishing job allocation 45397884 [user@biowulf ~]$
Please see the multi-GPU deep learning page for training with more than 1 GPU.