Data Parallelism with PyTorch

This page explains how to distribute an artificial neural network model implemented in PyTorch code using the data parallelism method.

Here, we document the built-in DistributedDataParallel solution, which is the most efficient according to the PyTorch documentation. This is a multi-process parallelism that works equally well in single-node and multi-node configurations.

A practical example is provided as a Notebook at the bottom of the page to give you access to a functional implementation of the explanations below.

Multi-Process Configuration with Slurm

For multi-node configurations, it is necessary to use the multi-processing managed by Slurm (execution via the Slurm command srun).

For single-node configurations, it is possible to use, as indicated in the PyTorch documentation, torch.multiprocessing.spawn. However, it is possible and more practical to use Slurm multi-processing in all cases, whether for single-node or multi-node. This is what we document on this page.

In Slurm, when a script is launched with the srun command, it is automatically distributed across all predefined tasks. For example, if we reserve 4 nodes, requesting 3 GPUs per node, we obtain:

4 nodes, indexed from 0 to 3
3 GPUs/node indexed from 0 to 2 on each node
4 x 3 = 12 processes in total, enabling the execution of 12 tasks with ranks from 0 to 11

Multi-process in Slurm
Illustration of a Slurm reservation of 4 nodes and 3 GPUs per node, i.e., 12 processes.
Inter-node collective communications are managed by the NCCL library.

Here are two example Slurm scripts for Jean-Zay:

For a reservation of N quadri-GPU V100 nodes via the default GPU partition:

#!/bin/bash
#SBATCH --job-name=torch-multi-gpu
#SBATCH --nodes=N              # total number of nodes (N to be defined)
#SBATCH --ntasks-per-node=4    # number of tasks per node (here 4 tasks, i.e., 1 task per GPU)
#SBATCH --gres=gpu:4           # number of GPUs reserved per node (here 4, i.e., all GPUs)
#SBATCH --cpus-per-task=10     # number of cores per task (thus 4x10 = 40 cores, i.e., all cores)
#SBATCH --hint=nomultithread
#SBATCH --time=20:00:00
#SBATCH --output=torch-multi-gpu%j.out
##SBATCH --account=abc@v100

module load pytorch-gpu/py3/2.5.0

srun python myscript.py

For a reservation of N octo-GPU A100 nodes:

#!/bin/bash
#SBATCH --job-name=torch-multi-gpu
#SBATCH --nodes=N            # total number of nodes (N to be defined)
#SBATCH --ntasks-per-node=8  # number of tasks per node (here 8 tasks, i.e., 1 task per GPU)
#SBATCH --gres=gpu:8         # number of GPUs reserved per node (here 8, i.e., all GPUs)
#SBATCH --cpus-per-task=8    # number of cores per task (thus 8x8 = 64 cores, i.e., all cores)
#SBATCH --hint=nomultithread
#SBATCH --time=20:00:00
#SBATCH --output=torch-multi-gpu%j.out
#SBATCH -C a100
##SBATCH --account=abc@a100

module load arch/a100
module load pytorch-gpu/py3/2.5.0

srun python myscript.py

Note

In both examples, the nodes are reserved exclusively. In particular, this gives us access to all the memory on each node.

Implementation of the DistributedDataParallel Solution

To implement the DistributedDataParallel solution in PyTorch, you need to:

Define the environment variables related to the master node to configure inter-node communications (required even in single-node) :
- MASTER_ADDR, the IP address or hostname of the node corresponding to task 0 (the first in the node list). If you are in a single-node configuration, the value localhost is sufficient.
- MASTER_PORT, a random port number. To avoid conflicts and by convention, use a port number between 10001 and 20000 (for example, 12345).
idr_torch
On Jean Zay, a library idr_torch has been developed by IDRIS to automatically define the MASTER_ADDR and MASTER_PORT variables within a running Slurm job. It is available in all our PyTorch environments. To use it, simply import it in your script:
import idr_torch
Once idr_torch is imported, you also have access to the internal variables:
- idr_torch.rank (global rank of the process),
- idr_torch.local_rank (local rank of the process within its node),
- idr_torch.size (number of processes),
- idr_torch.cpus_per_task (number of CPU cores allocated per process).
info
idr_torch is available on the IDRIS GitHub.
Initialise the distributed environment (i.e., the communicator) by calling the init_process_group function:
```
import idr_torch
import torch.distributed as dist

dist.init_process_group(backend='nccl',
                        init_method='env://',
                        world_size=idr_torch.size,
                        rank=idr_torch.rank)
```
- The backend indicates the library that will be used for communications. The possible backends are NCCL, GLOO, and MPI. NCCL is the communication library implemented by NVIDIA and is therefore strongly recommended on Jean Zay;
- init_method='env://' tells PyTorch to look for information about the master node in the environment (environment variables MASTER_ADDR and MASTER_PORT, see 1.);
- world_size corresponds to the number of GPUs allocated for execution;
- rank corresponds to the global rank of the current process.
Transfer the model to the memory of the GPU associated with the current process, identified by its local rank (local_rank):
```
torch.cuda.set_device(idr_torch.local_rank)
gpu = torch.device("cuda")
model = model.to(gpu)
```

Retrieve in a new variable ddp_model the version of the model associated with the current process:

from torch.nn.parallel import DistributedDataParallel as DDP
[...]
ddp_model = DDP(model, device_ids=[idr_torch.local_rank])

Use a distributed sampler to split the batches into mini-batches across the different GPUs at each iteration. PyTorch provides the DistributedSampler class for this purpose, which takes as arguments the input dataset, the number of available GPUs, the global rank of the current process, and the shuffling order:

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
                                                                num_replicas=idr_torch.size,
                                                                rank=idr_torch.rank,
                                                                shuffle=True)

important

The DistributedSampler class handles shuffling at the beginning of each epoch. To ensure the random seed varies correctly, you must inform the sampler of epoch changes:

for epoch in range(n_epochs):
    train_sampler.set_epoch(epoch)

Adapt the DataLoader to use the distributed sampler and a batch size corresponding to the mini-batch:

batch_size_per_gpu = global_batch_size // idr_torch.size
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size_per_gpu,
                                           shuffle=False, # shuffling delegated to sampler
                                           sampler=train_sampler)

Note

In this case, the shuffling step is delegated to the distributed sampler.

Send the mini-batches and corresponding labels to the GPUs memory:

for (images, labels) in train_loader:
    images = images.to(gpu)
    labels = labels.to(gpu)

Saving and Loading Checkpoints

It is possible to implement checkpoints during distributed training on GPUs.

Saving

The model is replicated on each GPU, but saving checkpoints can be performed by a single GPU to limit write operations. By convention, the GPU with rank 0 is used:

if idr_torch.rank == 0:
    torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)

Thus, the checkpoint will contain the information from the GPU with rank 0, which is then saved in a format specific to distributed models.

Loading

At the start of training, loading a checkpoint is first performed by the CPU, then the information is sent to the GPU.

By default and by convention, this transfer is done to the memory location that was used during the save step: in our example, only GPU 0 would load the model into memory.

To ensure the information is communicated to all GPUs, you must use the map_location argument of the torch.load loading function to redirect memory storage.

In the example below, the map_location argument orders a redirection of memory storage to the local rank GPU. Since this function is called by all GPUs, each GPU correctly loads the checkpoint into its own memory:

map_location = {'cuda:%d' % 0: 'cuda:%d' % idr_torch.local_rank} # remap storage from GPU 0 to local GPU
ddp_model.load_state_dict(torch.load(CHECKPOINT_PATH), map_location=map_location) # load checkpoint

Note

If, as in the PyTorch tutorial, a checkpoint is loaded immediately after a save, it is necessary to call the dist.barrier() method before loading. Calling dist.barrier() synchronises the GPUs, ensuring that the checkpoint save by the rank 0 GPU is completed before the other GPUs attempt to load it.

Distributed Validation

The validation step, executed after each epoch or after a fixed number of training iterations, can be distributed across all GPUs involved in model training. When data parallelism is used and the validation dataset is substantial, this distributed validation solution across GPUs appears to be the most efficient and fastest.

Here, the challenge is to compute metrics (loss, accuracy, etc.) per batch and per GPU, then weight and average them across the entire validation dataset.

To do this, you need to:

Load the validation data in the same way as the training data, but without random transformations such as data augmentation or shuffling (see the documentation on data loading for distributed training in PyTorch):

# validation dataset loading (ImageNet for example)
val_dataset = torchvision.datasets.ImageNet(root=root, split='val', transform=val_transform)

# define distributed sampler for validation
val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset,
                                                              num_replicas=idr_torch.size,
                                                              rank=idr_torch.rank,
                                                              shuffle=False)

# define dataloader for validation
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
                                         batch_size=batch_size_per_gpu,
                                         shuffle=False,
                                         sampler=val_sampler)

Switch from "training" mode to "validation" mode to disable certain training-specific features that are costly and unnecessary here:
- model.eval() to switch the model to "validation" mode and disable dropout and batchnorm management, etc.
- with torch.no_grad() to skip gradient computation
- optionally, with autocast() to use AMP (mixed precision)
Evaluate the model and compute the metric per batch in the usual way (here, we use the example of loss calculation; the same applies to other metrics):
- outputs = model(val_images) followed by loss = criterion(outputs, val_labels)
Weight and accumulate the metric per GPU:
- val_loss += loss * val_images.size(0) / N where val_images.size(0) is the batch size and N is the total size of the validation dataset. Since batches may not all be the same size (the last batch is sometimes smaller), it is preferable to use the value val_images.size(0) here.
Sum the weighted averages of the metric across all GPUs:
- dist.all_reduce(val_loss, op=dist.ReduceOp.SUM) to sum the metric values computed by each GPU and communicate the result to all GPUs. This operation involves inter-GPU communications.

Example after loading validation data

model.eval()                                          # switch into validation mode
val_loss = torch.Tensor([0.]).to(gpu)                 # initialise val_loss value
N = len(val_dataset)                                  # get validation dataset length

for val_images, val_labels in val_loader:             # loop over validation batches
    val_images = val_images.to(gpu, non_blocking=True) # transfer images and labels to GPUs
    val_labels = val_labels.to(gpu, non_blocking=True)

    with torch.no_grad():                             # deactivate gradient computation
        with autocast():                              # activate AMP
            outputs = model(val_images)               # evaluate model
            loss = criterion(outputs, val_labels)     # compute loss

    val_loss += loss * val_images.size(0) / N         # cumulate weighted mean per GPU

dist.all_reduce(val_loss, op=dist.ReduceOp.SUM)       # sum weighted means and broadcast value to each GPU

model.train()                                         # switch again into training mode

SyncBatchNorm for Data Parallelism

BatchNorm layers allow for faster model training by having it converge quicker towards a better optimum. See the reference paper on layer normalisation

BatchNormalization layers apply a transform which maintains previous layer outputs mean at 0 and standard deviation at 1. In other words, they compute normalisation factors in order to normalise the output of every layer (or only some layers) of the model. Those factors are learned during training: at every step (the pass of a single batch), the BatchNormalization layer also learns the mean and standard deviation of the batch (for each dimension). The combination of these factors allows the mean and standard deviation to stay close to 0 and 1, respectively.

info

Batch Normalization behaves differently during training, validation, and inference. Therefore, it is important to indicate to the model its current state (training or validation).

During training, the layer normalises its outputs using the mean and standard deviation of the input batch. More precisely, the layer returns (batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta, where:

epsilon, a small constant to avoid division by zero,
gamma, a learned (trained) parameter updated via gradient calculation during backpropagation, initialised to 1,
beta, a learned (trained) parameter updated via gradient calculation during backpropagation, initialised to 0.

During inference or validation, the layer normalises its outputs using the trained gamma and beta, along with the moving_mean and moving_var factors: (batch - moving_mean) / (moving_var + epsilon) * gamma + beta.

moving_mean and moving_var are non-trainable factors, but they are updated at each batch iteration during training according to the following method:

moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)
moving_var = moving_var * momentum + var(batch) * (1 - momentum)

SyncBatchNorm Layers

In data parallelism, a replica of the model is loaded onto each device (GPU). These replicas are intended to be completely equivalent across all devices. However, with Batch Normalization, and because each parallelised device processes different mini-batches, the normalisation statistics are likely to diverge, particularly the moving_mean and moving_var variables.

If the mini-batch sizes per GPU are sufficiently large, this divergence may be considered acceptable. However, it is recommended and sometimes necessary to replace BatchNorm layers with SyncBatchNorm layers.

SyncBatchNorm layers enable synchronisation across devices (during data parallelism) for the computation of normalisation statistics.

SyncBatchNorm in PyTorch

A SyncBatchNorm layer is defined as follows in the model architecture:

syncBN_layer = torch.nn.SyncBatchNorm(num_features, eps=1e-05, momentum=0.1, affine=True,
                            track_running_stats=True, process_group=None)

However, using the convert_sync_batchnorm method, it is also possible to convert an existing model by transforming all BatchNorm layers into SyncBatchNorm layers.

sync_bn_model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
# only single gpu per process is currently supported
ddp_sync_bn_model = torch.nn.parallel.DistributedDataParallel(
                        sync_bn_model,
                        device_ids=[args.local_rank],
                        output_device=args.local_rank)

warning

The convert_sync_batchnorm method must be applied before converting the model to a DDP model.

Application Example

Multi-GPU, Multi-Node Execution with `DistributedDataParallel`

An example can be found in $DSDIR/examples_IA/Torch_parallel/Example_DataParallelism_Pytorch-eng.ipynb on Jean-Zay; it uses the MNIST database and a simple dense network. The example is a Notebook that allows you to create an execution script. It should be copied to your personal space (ideally in your $WORK).

cp $DSDIR/examples_IA/Torch_parallel/Example_DataParallelism_PyTorch-eng.ipynb $WORK

You can then run the Notebook from a Jean Zay frontend machine by selecting a PyTorch kernel (see our documentation on accessing JupyterHub for more information on using Notebooks on Jean Zay).

Documentation and Sources

PyTorch Guide

Multi-Process Configuration with Slurm​

Implementation of the DistributedDataParallel Solution​

Saving and Loading Checkpoints​

Saving​

Loading​

Distributed Validation​

Example after loading validation data​

SyncBatchNorm for Data Parallelism​

SyncBatchNorm Layers​

SyncBatchNorm in PyTorch​

Application Example​

Multi-GPU, Multi-Node Execution with DistributedDataParallel​

Documentation and Sources​

Multi-Process Configuration with Slurm

Implementation of the DistributedDataParallel Solution

Saving and Loading Checkpoints

Saving

Loading

Distributed Validation

Example after loading validation data

SyncBatchNorm for Data Parallelism

SyncBatchNorm Layers

SyncBatchNorm in PyTorch

Application Example

Multi-GPU, Multi-Node Execution with `DistributedDataParallel`

Documentation and Sources