Data Loading for Distributed Training in PyTorch

This page provides a practical guide to managing Datasets and DataLoaders for distributed training in PyTorch. It addresses the challenges introduced in the main data loading page.

This page covers:

Datasets (predefined and custom)
input data transformation tools (predefined and custom)
DataLoaders

The documentation concludes with a complete example of optimised data loading, as well as a practical implementation on Jean Zay using a Jupyter Notebook.

Preliminary remark

This documentation does not cover IterableDataset objects, which are used to handle datasets with unknown structures.

Datasets

Predefined Datasets in PyTorch

PyTorch provides a set of predefined Datasets in the torchvision, torchaudio, and torchtext libraries. These libraries handle the creation of Dataset objects for standard datasets listed in the official documentation:

Loading a dataset is done via the Datasets module. For example, the ImageNet image dataset can be loaded using torchvision as follows:

import torchvision

# load imagenet dataset stored in DSDIR
root = os.environ['DSDIR']+'/imagenet'
imagenet_dataset = torchvision.datasets.ImageNet(root=root)

Most of the time, it is possible to distinguish between training and validation data at loading time. For example, for the ImageNet dataset:

import torchvision

# load imagenet dataset stored in DSDIR
root = os.environ['DSDIR']+'/imagenet'

## load data for training
imagenet_train_dataset = torchvision.datasets.ImageNet(root=root,
                                                       split='train')
## load data for validation
imagenet_val_dataset = torchvision.datasets.ImageNet(root=root,
                                                     split='val')

Each loading function then offers dataset-specific features (data quality, extracting a subset of the data, etc.). Please consult the official documentation for more details.

Remark

The torchvision library contains a generic loading function torchvision.Datasets.ImageFolder. It is suitable for any image dataset, provided it is stored in a specific format (see the official documentation for more details).

important

Some functions allow downloading datasets online using the download=True argument. We remind you that Jean Zay compute nodes do not have internet access, so such operations must be performed in advance from a login node or a pre/post-processing node. We also remind you that large public datasets are already available in Jean Zay's shared DSDIR space. This space can be enriched upon request to IDRIS support (assist@idris.fr).

Custom Datasets

It is possible to create custom Dataset classes by defining three characteristic functions:

__init__ initialises the variable containing the data to be processed
__len__ returns the length of the dataset
__getitem__ returns the data corresponding to a given index

For example:

from torch.utils.data import Dataset

class myDataset(Dataset):

    def __init__(self, data):
        # Initialise dataset from source dataset
        self.data = data

    def __len__(self):
        # Return length of the dataset
        return len(self.data)

    def __getitem__(self, idx):
        # Return one element of the dataset according to its index
        return self.data[idx]

Transformations

Predefined transformations in PyTorch

The torchvision, torchtext, and torchaudio libraries offer a range of pre-implemented transformations, accessible via the transforms module of the Datasets class. These transformations are listed in the official documentation:

Transformation instructions are carried by the Dataset object. It is possible to combine different types of transformations using the transforms.Compose() function. For example, to resize all images in the ImageNet dataset:

import torchvision

# define list of transformations to apply
data_transform = torchvision.transforms.Compose([torchvision.transforms.Resize((300,300)),
                                                 torchvision.transforms.ToTensor()])

# load imagenet dataset and apply transformations
root = os.environ['DSDIR']+'/imagenet'
imagenet_dataset = torchvision.datasets.ImageNet(root=root,
                                                 transform=data_transform)

Remark

The transforms.ToTensor() transformation converts a PIL image or NumPy array to a tensor.

To apply transformations to a custom Dataset, you need to modify it accordingly, for example as follows:

from torch.utils.data import Dataset

class myDataset(Dataset):

    def __init__(self, data, transform=None):
        # Initialise dataset from source dataset
        self.data = data
        self.transform = transform

    def __len__(self):
        # Return length of the dataset
        return len(self.data)

    def __getitem__(self, idx):
        # Return one element of the dataset according to its index
        x = self.data[idx]

        # apply transformation if requested
        if self.transform:
            x = self.transform(x)

        return x

Custom Transformations

It is also possible to create custom transformations by defining callable functions and passing them directly to transforms.Compose(). For example, you can define sum (Add) and multiplication (Mult) transformations as follows:

# define Add transformation
class Add(object):
    def __init__(self, value):
        self.value = value

    def __call__(self, sample):
        # add a constant to the data
        return sample + self.value

# define Mult transformation
class Mult(object):
    def __init__(self, value):
        self.value = value

    def __call__(self, sample):
        # multiply the data by a constant
        return sample * self.value

# define list of transformations to apply
data_transform = transforms.Compose([Add(2),Mult(3)])

DataLoaders

A DataLoader object is a wrapper around a Dataset object that structures the data (creating batches), pre-processes it (shuffling, transformations), and distributes it to GPUs for the training phase.

The DataLoader is an object of the torch.utils.data.DataLoader class:

import torch

# define DataLoader for a given dataset
dataloader = torch.utils.data.DataLoader(dataset)

Optimising Data Loading Parameters

The configurable arguments for the DataLoader class are as follows:

DataLoader(dataset,
           shuffle=False,
           sampler=None, batch_sampler=None, collate_fn=None,
           batch_size=1, drop_last=False,
           num_workers=0, worker_init_fn=None, persistent_workers=False,
           pin_memory=False, timeout=0,
           prefetch_factor=2, *
          )

Random Processing of Input Data

The shuffle=True argument enables random processing of input data. Note: This functionality must be delegated to the sampler if you are using a distributed sampler (see next point).

Data Distribution Across Multiple Processes for Distributed Training

The sampler argument allows you to specify the type of dataset sampling you want to implement. To distribute data across multiple processes, use the DistributedSampler sampler provided by the torch.utils.data.distributed class in PyTorch. For example:

import idr_torch # IDRIS package available in all PyTorch modules to interface with Slurm

# define distributed sampler
data_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
                                                               shuffle=True,
                                                               num_replicas=idr_torch.size,
                                                               rank=idr_torch.rank)

This sampler takes as arguments the shuffling activation order, the number of available processes num_replicas, and the local rank rank. The shuffling step is delegated to the sampler so it can be processed in parallel. The number of processes and the local rank are determined from the Slurm environment in which the training script was launched. Here, we use the idr_torch library to retrieve this information. This library is developed by IDRIS and is available in all PyTorch modules on Jean Zay.

Remark

The DistributedSampler sampler is suitable for the torch.nn.parallel.DistributedDataParallel data parallelism strategy, which we document on this page.

Optimising Resource Usage During Training

The batch size is defined by the batch_size argument. An optimal batch size ensures efficient use of compute resources, i.e., maximising GPU memory usage and evenly distributing the workload across GPUs.

It may happen that the amount of input data is not a multiple of the requested batch size. In this case, to prevent the DataLoader from generating an 'incomplete' batch with the last extracted data and thus avoid an imbalance in the workload between GPUs, you can instruct it to ignore this last batch using the drop_last=True argument. However, this may represent a loss of information that must be estimated in advance.

Overlapping Data Transfers and Computation

It is possible to optimise batch transfers from CPU to GPU by overlapping data transfers with computation.

One optimisation is to preload the next batches to be processed during training. The number of preloaded batches is controlled by the prefetch_factor argument. By default, this value is set to 2, which is suitable in most cases.

Another optimisation is to have the DataLoader store the batches in pinned memory (pin_memory=True) on the CPU. This strategy avoids certain recopy steps during CPU-to-GPU transfers. It also enables the use of the asynchronous mechanism non_blocking=True when calling transfer functions such as .to() or .device().

Speeding Up Data Preprocessing (Transformations)

Data preprocessing (transformations) is a CPU-intensive step. To speed it up, you can parallelise the operations across multiple CPUs using the DataLoader's multiprocessing functionality. The number of processes involved is specified by the num_workers argument.

The persistent_workers=True argument keeps the processes active throughout training, thus avoiding their reinitialisation at each epoch. This time saving, however, implies potentially significant RAM memory usage, especially if multiple DataLoaders are used.

Complete Example of Optimised Data Loading

Here is a complete example of optimised ImageNet dataset loading for distributed training on Jean Zay:

import torch
import torchvision
import idr_torch # IDRIS package available in all PyTorch modules to interface with Slurm

# define list of transformations to apply
data_transform = torchvision.transforms.Compose([torchvision.transforms.Resize((300,300)),
                                                 torchvision.transforms.ToTensor()])

# load imagenet dataset and apply transformations
root = os.environ['DSDIR']+'/imagenet'
imagenet_dataset = torchvision.datasets.ImageNet(root=root,
                                                 transform=data_transform)

# define distributed sampler
data_sampler = torch.utils.data.distributed.DistributedSampler(imagenet_dataset,
                                                               shuffle=True,
                                                               num_replicas=idr_torch.size,
                                                               rank=idr_torch.rank
                                                              )

# define DataLoader
batch_size = 128                       # adjust batch size according to the amount of GPU memory
drop_last = True                       # set to False if it represents important information loss
num_workers = 4                        # adjust number of CPU workers per process
persistent_workers = True              # set to False if CPU RAM must be released
pin_memory = True                      # optimise CPU to GPU transfers
non_blocking = True                    # activate asynchronism to speed up CPU/GPU transfers
prefetch_factor = 2                    # adjust number of batches to preload

dataloader = torch.utils.data.DataLoader(imagenet_dataset,
                                         sampler=data_sampler,
                                         batch_size=batch_size,
                                         drop_last=drop_last,
                                         num_workers=num_workers,
                                         persistent_workers=persistent_workers,
                                         pin_memory=pin_memory,
                                         prefetch_factor=prefetch_factor
                                        )

# loop over batches
for i, (images, labels) in enumerate(dataloader):
    images = images.to(gpu, non_blocking=non_blocking)
    labels = labels.to(gpu, non_blocking=non_blocking)

Practical Implementation on Jean Zay

To put the above documentation into practice and see the benefits of each feature offered by the PyTorch DataLoader, you can retrieve the notebook_data_preprocessing_pytorch-eng.ipynb notebook from DSDIR. For example, to copy it to your WORK directory:

cp $DSDIR/examples_IA/Torch_parallel/notebook_data_preprocessing_pytorch-eng.ipynb $WORK

You can then run the Notebook from a Jean Zay login node (see our JupyterHub access documentation for more information on using Notebooks on Jean Zay).

Official Documentation

https://pytorch.org/docs/stable/data.html

Datasets​

Predefined Datasets in PyTorch​

Custom Datasets​

Transformations​

Predefined transformations in PyTorch​

Custom Transformations​

DataLoaders​

Optimising Data Loading Parameters​

Random Processing of Input Data​

Data Distribution Across Multiple Processes for Distributed Training​

Optimising Resource Usage During Training​

Overlapping Data Transfers and Computation​

Speeding Up Data Preprocessing (Transformations)​

Complete Example of Optimised Data Loading​

Practical Implementation on Jean Zay​

Official Documentation​

Datasets

Predefined Datasets in PyTorch

Custom Datasets

Transformations

Predefined transformations in PyTorch

Custom Transformations

DataLoaders

Optimising Data Loading Parameters

Random Processing of Input Data

Data Distribution Across Multiple Processes for Distributed Training

Optimising Resource Usage During Training

Overlapping Data Transfers and Computation

Speeding Up Data Preprocessing (Transformations)

Complete Example of Optimised Data Loading

Practical Implementation on Jean Zay

Official Documentation