SyncBN layer for data parallelism

BatchNorm layers

The BatchNorm layers enable a significant acceleration in model learning by enabling the model to converge more rapidly towards a better optimum. See paper about layer normalization.

The BatchNormalization layer applies a transformation which maintains the output mean of the preceding layer at 0 and the output standard deviation at 1. In other words, the layers calculate the factors in order to normalize the outputs of each layer (or certain layers) of the neural network. These factors are learned during the training step. At each batch iteration, the layer also calculates the batch standard deviation (by dimension). The combination of all these factors enables keeping the output mean near 0 and the output standard deviation close to 1.

Batch Normalization works differently during the training, or the inference or validation. It is important, therefore, to indicate to the model its state (training or validation).

During the training, the layer normalizes the outputs by using the mean and standard deviation of the input batch. More precisely, the layer returns (batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta with:

  • epsilon, a small constant to avoid division by 0
  • gamma, a learned (trained) factor with gradient calculation during the backpropagation which is initialized at 1
  • beta, a learned (trained) factor with a gradient calculation during the backpropagation which is initialized at 0

During the inference or the validation, the layer normalizes its outputs by using the moving_mean and moving_var factors in addition to the trained gammas and betas: (batch - moving_mean) / (moving_var + epsilon) * gamma + beta.

moving_mean and moving_var are not trained factors but are updated at each batch iteration during the training according to the following method:

  • moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)
  • moving_var = moving_var * momentum + var(batch) * (1 - momentum)

SyncBatchNorm layers

During data parallelisation, a replica of the model is loaded in each device (GPU). These replicas are supposed to be completely equivalent for each device. However, with BatchNormalization, due to the fact that different mini-batches pass through each parallelised GPU, it is probable that the factors diverge, notably the moving_mean and moving_var variables.

If the mini-batch size per GPU is large enough, this divergence could be acceptable. However, it is advised and sometimes necessary to replace the BatchNorm layers by the SyncBN layers.

The SyncBN layers enable hardware synchronisation (during the data parallelisation) for the calculation of normalization factors.

SyncBN in Pytorch

A SyncBN layer in the model architecture is defined as follows:

syncBN_layer = torch.nn.SyncBatchNorm(num_features, eps=1e-05, momentum=0.1, affine=True,
                             track_running_stats=True, process_group=None)

However, with the convert_sync_batchnorm method, it is possible to convert an existing model by transforming all the Batch Norm layers into SyncBatch Norm layers. It is necessary to apply convert_sync_batchnorm before transforming the model into a DDP model.

# network is nn.BatchNorm layer
sync_bn_network = torch.nn.SyncBatchNorm.convert_sync_batchnorm(network)
# only single gpu per process is currently supported
ddp_sync_bn_network = torch.nn.parallel.DistributedDataParallel(
                        sync_bn_network,
                        device_ids=[args.local_rank],
                        output_device=args.local_rank)

Source: Pytorch documentation

SyncBN in TensorFlow

For TensorFlow, the SyncBN layers are still at an experimental stage. Please refer directly to the following documentation: