Multinode Deep Learning on Biowulf

Multinode GPUs will speed up the training of very large datasets. Examples for running multi-GPU training using Tensorflow and Pytorch are shown here. This page will guide you through the use of the different deep learning frameworks in Biowulf using interactive sessions and sbatch submission (and by extension swarm jobs).

Important Notes

Single node Multi-GPU Tensorflow

Allocate an interactive session with 4 GPUs. Sample session follows. (User input in bold.)

[user@biowulf ~]$ sinteractive --gres=gpu:k80:4,lscratch:10 --mem=100g -c56 
While the central python installation has tensorflow, it does not contain the tensorflow-datasets package needed to run this example. We will create a conda environment. This assumes that you have already installed miniconda according to these instructions.
[user@cn4216 ~]$ source /data/${USER}/conda/etc/profile.d/ 

[user@cn4216 ~]$ conda create -y -n tensorflow \
    python=3.7 tensorflow-gpu==2.1.0 tensorflow-datasets==1.2.0 
## Package Plan ##

  environment location: /data/user/conda/envs/tensorflow

  added / updated specs:
    - python=3.7
    - tensorflow-datasets==1.2.0
    - tensorflow-gpu==2.1.0

The following NEW packages will be INSTALLED:

[user@cn4216 ~]$ conda activate tensorflow
Clone the tensorflow models github repository.
(tensorflow)[user@cn4216 ~]$ mkdir -p /data/${USER}/deeplearning

(tensorflow)[user@cn4216 ~]$ cd /data/${USER}/deeplearning

(tensorflow)[user@cn4216 deeplearning]$ git clone && cd models
Set up some work directories, prepare the environment, and train the model.

We will use the mnist example that has been re-written for TensorFlow v2. It contains the --distribution flag that allows you to determine how many GPUs will be used. Run with the --helpfull flag to see a complete list of options and arguments.

(tensorflow)[user@cn4216 models]$ export PYTHONPATH=`pwd`:$PYTHONPATH

(tensorflow)[user@cn4216 models]$ mkdir /lscratch/${SLURM_JOB_ID}/{model,data}

(tensorflow)[user@cn4216 models]$ export MODEL_DIR=/lscratch/${SLURM_JOB_ID}/model \
    DATA_DIR=/lscratch/${SLURM_JOB_ID}/data \

(tensorflow)[user@cn4216 models]$ python official/vision/image_classification/ \
    --model_dir=$MODEL_DIR \
    --data_dir=$DATA_DIR \
    --train_epochs=10 \
    --distribution_strategy=mirrored \
    --num_gpus=$GPU_N \
I0527 18:05:59.210283 46912496418496] Run stats:
{'accuracy_top_1': 0.974609375, 'eval_loss': 0.08393807543648614, 'loss': 0.10950823462214963, 'training_accuracy_top_1': 0.9673019647598267}

(tensorflow)[user@cn4216 models]$ exit
salloc.exe: Relinquishing job allocation 58766834
salloc.exe: Job allocation 58766834 has been revoked.

[user@biowulf ~]$
Here is a slightly more intensive example utilizing the same python environment and training the Resnet model on the CIFAR-10 dataset. We will submit this job to the batch scheduling system instead of running in an interactive session. Create a submission script.
[user@biowulf ~]$ mkdir resnet_job && cd resnet_job

[user@biowulf resnet_job]$ cat ><<'EOF'
# set up directories and download data
mkdir /lscratch/${SLURM_JOB_ID}/{model,data}
cd /lscratch/${SLURM_JOB_ID}/data
tar xf cifar-10-binary.tar.gz
# set up the environment
source /data/${USER}/conda/etc/profile.d/
conda activate tensorflow
cd /data/${USER}/deeplearning/models
git checkout v2.1.0
    MODEL_DIR=/lscratch/${SLURM_JOB_ID}/model \
    DATA_DIR=/lscratch/${SLURM_JOB_ID}/data/cifar-10-batches-bin \
# train the model
python official/vision/image_classification/ \
    --num_gpus=$GPU_N \
    --batch_size=128 \
    --model_dir=$MODEL_DIR \
Now submit the job and check the output.
[user@biowulf resnet_job]$ sbatch --partition=gpu --gres=gpu:k80:4,lscratch:10 --mem=20g -c14

[user@biowulf resnet_job]$ tail -n 1 slurm-58772301.out
{'accuracy_top_1': 0.17698317766189575, 'eval_loss': 6.45149180216667, 'loss': 2.8785855941283396, 'training_accuracy_top_1': 0.2016225904226303, 'step_timestamp_log': ['BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp'], 'train_finish_time': 1590623301.954509, 'avg_exp_per_second': 605.1630839054761}


Depending on the dataset, training may scale well to multiple GPUs. Using the relatively small CIFAR-10 dataset, the number of images processed per second increases as more K80 GPUs are added. On the otherhand, the V100 performance "maxed" out even with 1 GPU, and adding more GPUs may even slightly degrade performace.

MATLAB splash image

However, good scaling is achieved using the much larger ImageNet dataset. In this example, all three types of GPUs show good scaling. Note that the ImageNet dataset has already been saved locally in /fdb/imagenet

These results were obtained using the same python environment and GitHub repo described above with the following command:

[user@biowulf models]$ python official/vision/image_classification/ \
    --num_gpus=$GPU_N \
    --batch_size=128 \
    --model_dir=/lscratch/$SLURM_JOB_ID \

MATLAB splash image

Single node Multi-GPU Pytorch

Allocate an interactive session with 4 GPUs and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:4,lscratch:10 --mem=100g -c56
[user@cn3144 ~]$ module load python/3.8
[+] Loading python 3.8  ...

[user@cn3144 ~]$ mkdir -p /data/$USER/deeplearning

[user@cn3144 ~]$ cd /data/$USER/deeplearning

Here we will run the multi-GPU parallelism example in pytorch

[user@cn3144 ~]$ python
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:46:39)
[GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import torch
>>> torch.cuda.is_available()
>>> torch.cuda.device_count() ### number of GPU devices
>>> exit()

[user@cn3144 ~]$ wget

[user@cn3144 ~]$ python
Let's use 4 GPUs!
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Multi-node Multi-GPU Tensorflow using Horovod

Allocate an interactive session and run the program. In this example, horovod is installed in your personal conda environment.
Sample session:

[user@biowulf]$ sinteractive --gres=gpu:k80:4,lscratch:50 --mem=200g -c56
[user@cn3144 ~]$ cd /data/$USER/deeplearning

[user@cn3144 ~]$ module load singularity
[+] Loading singularity  3.4.2  on cn0618

[user@cn3144 ~]$ singularity pull docker://horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6
[user@cn3144 ~]$ singularity run --nv horovod_0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6.sif \
mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib --oversubscribe \
python /examples/ --batch-size 64
[Deprecation warnings...]
Running benchmark...
Total img/sec on 4 GPU(s): 208.5 +-0.4

[user@cn3144 ~]$ cat > <<'EOF'
#SBATCH --constraint=gpup100
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=14
#SBATCH --mem=50g
#SBATCH --gres=gpu:p100:2
#SBATCH --time 30

module load singularity openmpi/4.0.1/cuda-10.1/gcc-7.4.0 openmpi/4.0.1/gcc-7.4.0
## Ideally openmpi and cuda need to match with the container (as much as possible)
## The container has openmpi 4.0 and cuda 10

mpirun -np $SLURM_NTASKS \
-bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -x NCCL_DEBUG=INFO \
singularity run --nv horovod_0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6.sif \
python /examples/ \
--batch-size 128 --num-batches-per-iter 32 --num-iters 20

### For better performance with V100 nodes, add --fp16-allreduce and increase the batch-size

[user@cn3144 ~]$ sbatch

[user@cn3144 ~]$ tail slurm-41487901.out
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Multi-node Multi-GPU PyTorch with PyTorch-Lightning

Pytorch-Lightning distributed training works best when submitted as a Slurm batch script.
Sample session:

[user@biowulf]$ sinteractive

[user@cn3144]$ cat > << EOF
import os
import sys
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy

class AutoEncoder(pl.LightningModule):
    # Simple autoencoder
    def __init__(self, encoder, decoder):
        self.encoder = encoder
        self.decoder = decoder

    # Define one step of training loop
    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    # Configure optimizer in LightningModule for auto-device function
    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

def main():
    # Make the model large enough to take appreciable time to check distribution
    encoder = nn.Sequential(nn.Linear(28 * 28, 2048), nn.ReLU(), nn.Linear(2048, 64))
    decoder = nn.Sequential(nn.Linear(64, 2048), nn.ReLU(), nn.Linear(2048, 28 * 28))

    # init the autoencoder model
    autoencoder = AutoEncoder(encoder, decoder)

    # init the dataset
    dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
    train_loader =, batch_size=64)

    # build the multi-node trainer, using DDP over NCCL
    # the node/device geometry needs to match your slurm batch script
    # devices is per-node
    nodes, gpus = int(sys.argv[1]), int(sys.argv[2])
    trainer = pl.Trainer(max_epochs=100,
    # train the model, automatically moving all data, models, and
    # optimizers to the appropriate devices, train_dataloaders=train_loader)

if __name__ == "__main__":


[user@cn3144]$ cat > << EOF
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --partition=gpu
#SBATCH --mem=16g
#SBATCH --gres=gpu:k80:4
#SBATCH --time 00:30:00

# Load the correct Python module
module load python/3.8

# Use srun to start up the specified number of tasks on each node for distribution
# pytorch-lightning autodetects Slurm environment and configures NCCL
srun python 2 4

[user@cn3144]$ module load python/3.8
[+] Loading python 3.8  ...

[user@cn3144]$ pip install --user pytorch-lightning
Successfully installed pyDeprecate-0.3.2 pytorch-lightning-1.7.3 tensorboard-2.10.0 torchmetrics-0.9.3 typing-extensions-4.3.0

# Pre-download MNIST dataset since simultaneous setup from the distributed training script has race conditions
[user@cn3144]$ python -c "import os; from torchvision.datasets import MNIST; from torchvision.transforms import ToTensor;  MNIST(os.getcwd(), download=True, transform=ToTensor())"

[user@cn3144]$ sbatch

[user@cn3144f]$ tail slurm-46630586.out
Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Epoch 0:  38%|███▊      | 45/118 [00:02<00:^C, 17.05it/s, loss=0.032, v_num=4.66e+7]

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46630586

[user@biowulf ~]$

Expect the output in the Slurm log file to be a bit buggy since the default Pytorch-lightning output is designed for an interactive console.