Jean Zay: Database loading for the distributed learning of a model

This documentation describes the management of datasets and dataloaders in PyTorch and TensorFlow on the entire sequence of data transmission by batch to the model during a neural network learning.

For this documentation, we are in a context adapted to Jean Zay:

  • With large models and large databases (images, 3D images, videos, text, audio, etc): Data loading is done on the CPUs and these data are then transferred batch by batch to the GPUs for the model learning or inference.
  • With a data parallelism distribution: If you do not use this type of parallelism, you can ignore the distribution step.

The sequence of input data loading and pre-processing is schematised below:

 Chaîne complète de chargement de données pour l’apprentissage distribué d’un modèle

1/ The Data Set is all of the data. This is generally available in the form of a structured set of files.

2/ The Structure Discovery step identifies the dataset structure. At this point, the data set has not yet been loaded in memory. Here, we can recover the list of paths to the files, or the list of data pointers associated with their labels, for the calculation of the loss function during a supervised learning.

3/ Shuffle is the step of random permutation of the data set. This is necessary for the learning and is effectuated at the beginning of every epoch. The data pointers are stored in a memory buffer before being permutated. Ideally, the size of the buffer should be sufficiently large for storing all the data. However, memory constraints do not allow this most of the time. Therefore, a smaller size buffer which can contain ~1 000 or ~10 000 pointers is considered to be acceptable and without impact on the quality of the training. Comment: The shuffle is not effectuated for the inference or validation.

4/ Distribution is the data parallelisation step when the data is distributed on the different GPUs. Ideally, the distribution is done after the shuffle. In practice, it is generally done before the shuffle (or after the batch) to avoid inter-process communications. This is considered to be acceptable and without apparent impact on the quality of the training.

5/ Batch is the sampling step when the database is cut into batches of a fixed size. An iterative loop is then run on all these batches. Comment: With data parallelism, it is necessary to differentiate between the “batch size”, defined during the pre-processing of the input data, and the “mini-batch size” which corresponds to the portion of data diffused on each GPU during the training.

6/ Load and Transform is the step where the batch is loaded into memory, decoded and transformed (pre-processing). This is the step which requires the most CPU computing time and is, therefore, an obvious congestion point. Certain transformation operations are deterministic; that is, they generate the same transformed data in every epoch whereas others are linked to data augmentation and are randomized. It is possible to effectuate deterministic operations before the loop in the epochs in order to avoid redoing the computations at each iteration. For this, you can store the transformed data in a cache memory (in this case, the storage is done during the first epoch which will be longer than the following ones). You can also create a personalized input file such as TFRecords before the learning. Randomized operations must be effectuated at each iteration. One possible optimization is to transfer these computations onto the GPUs. Solutions such as DALI propose doing this for the data augmentation of images. GPUs are much faster than CPUs for this type of processing.
Comment: Our experience on Jean Zay has shown us that by optimizing the multiprocessing and the prefetch (see the following point), the CPUs are rarely congested for the data pre-processing. To our knowledge, therefore, transferring operations on GPUs is not imperative on Jean Zay.

7/ The Prefetch allows creating a FIFO (First In, First Out) wait queue for loading batches on GPUs. This functionality allows generating transfer/computation overlapping and greatly accelerates the data loading chain when the GPUs are more solicitated than the CPUs. When using the GPUs on Jean Zay, the prefetch is strongly advised.