Multinode Deep Learning on Biowulf

Quick Links

Notes

Single node Multi-GPU Tensorflow

Benchmarks

Single node Multi-GPU Pytorch

Multi-node Multi-GPU Tensorflow

Multi-node Multi-GPU Pytorch

Multinode GPUs will speed up the training of very large datasets. Examples for running multi-GPU training using Tensorflow and Pytorch are shown here. This page will guide you through the use of the different deep learning frameworks in Biowulf using interactive sessions and sbatch submission (and by extension swarm jobs).

Important Notes

Allocate a GPU node (such as the K80, P100, V100, and A100 nodes). Please use the 'freen' command for availability.
The python code has to explicitly written to use multi-GPU. The examples here are community written and maintained standards.

Single node Multi-GPU Tensorflow

Main page: www.tensorflow.org
Github page: https://github.com/tensorflow/tensorflow
Tutorial page: https://www.tensorflow.org/get_started/

Allocate an interactive session with 4 GPUs. Sample session follows. (User input in bold.)

[user@biowulf ~]$ sinteractive --gres=gpu:k80:4,lscratch:10 --mem=100g -c56 
salloc.exe: Pending job allocation 58766834
salloc.exe: job 58766834 queued and waiting for resources
salloc.exe: job 58766834 has been allocated resources
salloc.exe: Granted job allocation 58766834
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn4216 are ready for job
srun: error: x11: no local DISPLAY defined, skipping

While the central python installation has tensorflow, it does not contain the tensorflow-datasets package needed to run this example. We will create a conda environment. This assumes that you have already installed miniconda according to these instructions.

[user@cn4216 ~]$ source /data/${USER}/conda/etc/profile.d/conda.sh 

[user@cn4216 ~]$ conda create -y -n tensorflow \
    python=3.7 tensorflow-gpu==2.1.0 tensorflow-datasets==1.2.0 
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /data/user/conda/envs/tensorflow

  added / updated specs:
    - python=3.7
    - tensorflow-datasets==1.2.0
    - tensorflow-gpu==2.1.0

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
  _tflow_select      pkgs/main/linux-64::_tflow_select-2.1.0-gpu
  absl-py            pkgs/main/linux-64::absl-py-0.9.0-py37_0
[...snip...]
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate tensorflow
#
# To deactivate an active environment, use
#
#     $ conda deactivate

[user@cn4216 ~]$ conda activate tensorflow

Clone the tensorflow models github repository.

(tensorflow)[user@cn4216 ~]$ mkdir -p /data/${USER}/deeplearning

(tensorflow)[user@cn4216 ~]$ cd /data/${USER}/deeplearning

(tensorflow)[user@cn4216 deeplearning]$ git clone https://github.com/tensorflow/models.git && cd models
Cloning into 'models'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 36106 (delta 0), reused 1 (delta 0), pack-reused 36100
Receiving objects: 100% (36106/36106), 520.11 MiB | 43.63 MiB/s, done.
Resolving deltas: 100% (24044/24044), done.
Checking out files: 100% (2531/2531), done.

(tensorflow)[user@cn4216 models]$ git checkout v2.1.0 # match tensorflow version
Checking out files: 100% (2577/2577), done.
Note: checking out 'v2.1.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at d7a2673... Removing research/samples/tutorials models

Set up some work directories, prepare the environment, and train the model.

We will use the mnist example that has been re-written for TensorFlow v2. It contains the --distribution flag that allows you to determine how many GPUs will be used. Run with the --helpfull flag to see a complete list of options and arguments.

(tensorflow)[user@cn4216 models]$ export PYTHONPATH=`pwd`:$PYTHONPATH

(tensorflow)[user@cn4216 models]$ mkdir /lscratch/${SLURM_JOB_ID}/{model,data}

(tensorflow)[user@cn4216 models]$ export MODEL_DIR=/lscratch/${SLURM_JOB_ID}/model \
    DATA_DIR=/lscratch/${SLURM_JOB_ID}/data \
    GPU_N=4 

(tensorflow)[user@cn4216 models]$ python official/vision/image_classification/mnist_main.py \
    --model_dir=$MODEL_DIR \
    --data_dir=$DATA_DIR \
    --train_epochs=10 \
    --distribution_strategy=mirrored \
    --num_gpus=$GPU_N \
    --download 
2020-05-27 18:04:43.374880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-27 18:04:43.532007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:83:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.92GiB deviceMemoryBandwidth: 223.96GiB/s
2020-05-27 18:04:43.533659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:84:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.92GiB deviceMemoryBandwidth: 223.96GiB/s
[...snip...]
2020-05-27 18:04:45.413738: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-27 18:04:45.423735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1, 2, 3
2020-05-27 18:04:45.423844: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-27 18:04:45.429595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-27 18:04:45.429624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 1 2 3
2020-05-27 18:04:45.429679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N Y Y Y
202e-05-27 18:04:45.429699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1:   Y N Y Y
2020-05-27 18:04:45.429723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 2:   Y Y N Y
2020-05-27 18:04:45.429733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 3:   Y Y Y N
2020-05-27 18:04:45.445009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11531 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:83:00.0, compute capability: 3.7)
[...snip...]
I0527 18:05:59.210283 46912496418496 mnist_main.py:165] Run stats:
{'accuracy_top_1': 0.974609375, 'eval_loss': 0.08393807543648614, 'loss': 0.10950823462214963, 'training_accuracy_top_1': 0.9673019647598267}

(tensorflow)[user@cn4216 models]$ exit
exit
salloc.exe: Relinquishing job allocation 58766834
salloc.exe: Job allocation 58766834 has been revoked.

[user@biowulf ~]$

Here is a slightly more intensive example utilizing the same python environment and training the Resnet model on the CIFAR-10 dataset. We will submit this job to the batch scheduling system instead of running in an interactive session. Create a submission script.

[user@biowulf ~]$ mkdir resnet_job && cd resnet_job

[user@biowulf resnet_job]$ cat >resnet.sh<<'EOF'
#!/bin/bash
# set up directories and download data
mkdir /lscratch/${SLURM_JOB_ID}/{model,data}
cd /lscratch/${SLURM_JOB_ID}/data
wget https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
tar xf cifar-10-binary.tar.gz
# set up the environment
source /data/${USER}/conda/etc/profile.d/conda.sh
conda activate tensorflow
cd /data/${USER}/deeplearning/models
git checkout v2.1.0
export PYTHONPATH=`pwd`:$PYTHONPATH \
    MODEL_DIR=/lscratch/${SLURM_JOB_ID}/model \
    DATA_DIR=/lscratch/${SLURM_JOB_ID}/data/cifar-10-batches-bin \
    GPU_N=4
# train the model
python official/vision/image_classification/resnet_cifar_main.py \
    --num_gpus=$GPU_N \
    --batch_size=128 \
    --model_dir=$MODEL_DIR \
    --data_dir=$DATA_DIR
EOF

Now submit the job and check the output.

[user@biowulf resnet_job]$ sbatch --partition=gpu --gres=gpu:k80:4,lscratch:10 --mem=20g -c14 resnet.sh
58772301

[user@biowulf resnet_job]$ tail -n 1 slurm-58772301.out
{'accuracy_top_1': 0.17698317766189575, 'eval_loss': 6.45149180216667, 'loss': 2.8785855941283396, 'training_accuracy_top_1': 0.2016225904226303, 'step_timestamp_log': ['BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp'], 'train_finish_time': 1590623301.954509, 'avg_exp_per_second': 605.1630839054761}

Benchmarks

Depending on the dataset, training may scale well to multiple GPUs. Using the relatively small CIFAR-10 dataset, the number of images processed per second increases as more K80 GPUs are added. On the otherhand, the V100 performance "maxed" out even with 1 GPU, and adding more GPUs may even slightly degrade performace.

MATLAB splash image

However, good scaling is achieved using the much larger ImageNet dataset. In this example, all three types of GPUs show good scaling. Note that the ImageNet dataset has already been saved locally in /fdb/imagenet

These results were obtained using the same python environment and GitHub repo described above with the following command:

[user@biowulf models]$ python official/vision/image_classification/resnet_imagenet_main.py \
    --num_gpus=$GPU_N \
    --batch_size=128 \
    --model_dir=/lscratch/$SLURM_JOB_ID \
    --data_dir=/fdb/imagenet

MATLAB splash image

Single node Multi-GPU Pytorch

Main page: https://pytorch.org/
Github page: https://github.com/pytorch/pytorch
Tutorial page: https://pytorch.org/tutorials/

Allocate an interactive session with 4 GPUs and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:4,lscratch:10 --mem=100g -c56
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load python/3.8
[+] Loading python 3.8  ...

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning

[user@cn3144 ~]$ cd /data/$USER/deeplearning

Here we will run the multi-GPU parallelism example in pytorch

[user@cn3144 ~]$ python
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:46:39)
[GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count() ### number of GPU devices
4
>>> exit()

[user@cn3144 ~]$ wget https://pytorch.org/tutorials/_downloads/39aeec3cba2dd8363e78683704cabea7/data_parallel_tutorial.py

[user@cn3144 ~]$ python data_parallel_tutorial.py
Let's use 4 GPUs!
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([6, 5]) output size torch.Size([6, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([6, 5]) output size torch.Size([6, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2])
	In Model: input size torch.Size([6, 5]) output size torch.Size([6, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([3, 5]) output size torch.Size([3, 2])
	In Model: input size torch.Size([3, 5]) output size torch.Size([3, 2])
	In Model: input size torch.Size([3, 5]) output size torch.Size([3, 2])
	In Model: input size torch.Size([1, 5]) output size torch.Size([1, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Multi-node Multi-GPU Tensorflow using Horovod

Main page: https://github.com/uber/horovod
Tutorial page: https://github.com/uber/horovod/tree/master/examples

Allocate an interactive session and run the program. In this example, horovod is installed in your personal conda environment.
Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:4,lscratch:50 --mem=200g -c56
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ cd /data/$USER/deeplearning

[user@cn3144 ~]$ module load singularity
[+] Loading singularity  3.4.2  on cn0618

[user@cn3144 ~]$ singularity pull docker://horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:35c102085707f703de2d9eaad8752d6fe1b8f02b5d2149f1d8357c9cc7fb7d0a
 25.45 MiB / 25.45 MiB [====================================================] 0s
Copying blob sha256:251f5509d51d9e4119d4ffb70d4820f8e2d7dc72ad15df3ebd7cd755539e40fd
 34.54 KiB / 34.54 KiB [====================================================] 0s
Copying blob sha256:8e829fe70a46e3ac4334823560e98b257234c23629f19f05460e21a453091e6d
 848 B / 848 B [============================================================] 0s
Copying blob sha256:6001e1789921cf851f6fb2e5fe05be70f482fe9c2286f66892fe5a3bc404569c
[...]
2019/11/07 15:06:03  info unpack layer: sha256:fa6badf8ba54dfbaea36c253aae54b186d5caf80e24b59806844dafc72f44bf2
INFO:    Creating SIF file...
INFO:    Build complete: horovod_0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6.sif

[user@cn3144 ~]$ singularity run --nv horovod_0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6.sif \
mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib --oversubscribe \
python /examples/tensorflow_synthetic_benchmark.py --batch-size 64
[Deprecation warnings...]
Running benchmark...
Iter #0: 52.2 img/sec per GPU
Iter #1: 52.2 img/sec per GPU
Iter #2: 52.1 img/sec per GPU
Iter #3: 52.1 img/sec per GPU
Iter #4: 52.0 img/sec per GPU
Iter #5: 52.1 img/sec per GPU
Iter #6: 52.1 img/sec per GPU
Iter #7: 52.1 img/sec per GPU
Iter #8: 52.1 img/sec per GPU
Iter #9: 52.1 img/sec per GPU
Img/sec per GPU: 52.1 +-0.1
Total img/sec on 4 GPU(s): 208.5 +-0.4


[user@cn3144 ~]$ cat > submit.sh <<'EOF'
#!/bin/bash 
#SBATCH --constraint=gpup100
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=14
#SBATCH --mem=50g
#SBATCH --gres=gpu:p100:2
#SBATCH --time 30

module load singularity openmpi/4.0.1/cuda-10.1/gcc-7.4.0 openmpi/4.0.1/gcc-7.4.0
## Ideally openmpi and cuda need to match with the container (as much as possible)
## The container has openmpi 4.0 and cuda 10

mpirun -np $SLURM_NTASKS \
-bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -x NCCL_DEBUG=INFO \
singularity run --nv horovod_0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6.sif \
python /examples/tensorflow_synthetic_benchmark.py \
--batch-size 128 --num-batches-per-iter 32 --num-iters 20

### For better performance with V100 nodes, add --fp16-allreduce and increase the batch-size
EOF

[user@cn3144 ~]$ sbatch submit.sh
41487901

[user@cn3144 ~]$ tail slurm-41487901.out
Iter #12: 204.6 img/sec per GPU
Iter #13: 206.3 img/sec per GPU
Iter #14: 206.4 img/sec per GPU
Iter #15: 204.9 img/sec per GPU
Iter #16: 206.3 img/sec per GPU
Iter #17: 206.6 img/sec per GPU
Iter #18: 207.0 img/sec per GPU
Iter #19: 205.9 img/sec per GPU
Img/sec per GPU: 205.9 +-1.4
Total img/sec on 4 GPU(s): 823.6 +-5.4

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Multi-node Multi-GPU PyTorch with PyTorch-Lightning

Main page: https://pytorch-lightning.readthedocs.io
Distributed GPU Training: https://pytorch-lightning.readthedocs.io/en/stable/accelerators/gpu_intermediate.html

Pytorch-Lightning distributed training works best when submitted as a Slurm batch script.
Sample session:

[user@biowulf]$ sinteractive

[user@cn3144]$ cat > torch_test.py << EOF
import os
import sys
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy

class AutoEncoder(pl.LightningModule):
    # Simple autoencoder
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    # Define one step of training loop
    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    # Configure optimizer in LightningModule for auto-device function
    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

def main():
    # Make the model large enough to take appreciable time to check distribution
    encoder = nn.Sequential(nn.Linear(28 * 28, 2048), nn.ReLU(), nn.Linear(2048, 64))
    decoder = nn.Sequential(nn.Linear(64, 2048), nn.ReLU(), nn.Linear(2048, 28 * 28))

    # init the autoencoder model
    autoencoder = AutoEncoder(encoder, decoder)

    # init the dataset
    dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
    train_loader = utils.data.DataLoader(dataset, batch_size=64)

    # build the multi-node trainer, using DDP over NCCL
    # the node/device geometry needs to match your slurm batch script
    # devices is per-node
    nodes, gpus = int(sys.argv[1]), int(sys.argv[2])
    trainer = pl.Trainer(max_epochs=100,
            strategy=DDPStrategy(find_unused_parameters=False),
            accelerator="gpu",
            num_nodes=nodes,
            devices=gpus)
            
    # train the model, automatically moving all data, models, and
    # optimizers to the appropriate devices
    trainer.fit(model=autoencoder, train_dataloaders=train_loader)

if __name__ == "__main__":
    main()

EOF


[user@cn3144]$ cat > torch_test.sh << EOF
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --partition=gpu
#SBATCH --mem=16g
#SBATCH --gres=gpu:k80:4
#SBATCH --time 00:30:00

# Load the correct Python module
module load python/3.8

# Use srun to start up the specified number of tasks on each node for distribution
# pytorch-lightning autodetects Slurm environment and configures NCCL
srun python torch_test.py 2 4
EOF


[user@cn3144]$ module load python/3.8
[+] Loading python 3.8  ...

[user@cn3144]$ pip install --user pytorch-lightning
...snip...
Successfully installed pyDeprecate-0.3.2 pytorch-lightning-1.7.3 tensorboard-2.10.0 torchmetrics-0.9.3 typing-extensions-4.3.0

# Pre-download MNIST dataset since simultaneous setup from the distributed training script has race conditions
[user@cn3144]$ python -c "import os; from torchvision.datasets import MNIST; from torchvision.transforms import ToTensor;  MNIST(os.getcwd(), download=True, transform=ToTensor())"

[user@cn3144]$ sbatch torch_test.sh
46630586

[user@cn3144f]$ tail slurm-46630586.out
Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
...snip...
Epoch 0:  38%|███▊      | 45/118 [00:02<00:^C, 17.05it/s, loss=0.032, v_num=4.66e+7]

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46630586

[user@biowulf ~]$

Expect the output in the Slurm log file to be a bit buggy since the default Pytorch-lightning output is designed for an interactive console.