Tensor Cores
This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.
Introduction
All computing units are designed to support arithmetic and logical operations. However, some of these operations require more elementary calculations than others. Matrix/tensor multiplication and convolution are among these very computationally intensive operations. It turns out that they are also very often used in AI.
Simply multiplying two 4x4 matrices requires 64 multiplication operations and 48 addition operations. Furthermore, the computational complexity of this type of operation increases very rapidly as the size and dimensions of the matrix increase1.
To address this issue, NVIDIA has offered a specific computing unit since its Volta architecture: The Tensor Core.
This is a specialised computing unit (hardware-wise) for tensor multiplication. Whereas CUDA cores must perform a series of elementary operations (multiplications and additions), tensor cores do the same thing in one go. The time savings can then be extremely significant depending on the precision used and the GPU in question. We are talking about a gain of between x2 and x3 for the vast majority of Deep Learning models.
Using and benefiting from Tensor Core acceleration
As indicated in the previous paragraph, using tensor cores for multiplication operations provides a significant performance gain. It is therefore a mandatory step for those who wish to optimise their training (or inference) times.
Jean Zay has NVIDIA V100, A100 and H100 GPUs. All three models have tensor cores. However, the usage and implementation differ between the three models.
Tensor cores generally work with lower precisions (hardware criterion). The lower the precision, the faster the calculation will execute. The available precisions depend on the architecture and GPU model.
Here are some examples of precisions (and associated performance) available on the tensor cores of Jean Zay's GPUs:
| Performance in TFLOPS by precision and GPU | ||||||
|---|---|---|---|---|---|---|
| GPU \ Precision | FP64 | TF32 | FP16 | BF16 | INT8 | FP8 |
| V100 | x | x | 112 | x | x | x |
| A100 | 19.5 | 156 | 312 | x | 624 | x |
| H100 | 51.2 | 378 | 756 | 756 | 1513 | 1513 |
Usage on NVIDIA V100 GPUs:
The V100 architecture is the first to implement Tensor Cores (640 tensor cores). Due to their hardware design, these tensor cores perform calculations only with Float16 precision. Most AI models are in Float32. If you want to benefit from the acceleration provided by tensor cores, you need to reduce the precision of the model variables2.
To do this, there are two possible solutions:
- Put all the elements of our model in Float16
- Implement AMP (mixed precision) => Highly recommended
Usage on NVIDIA A100 GPUs:
The A100 GPUs - more recent and powerful than the V100s - have the 3rd generation of tensor cores (432 tensor cores). They allow the use of tensor cores with commonly used precisions but also with new ones. The A100s can manage the use of their tensor cores themselves. Thus, without any additional code, the hardware will be able to intelligently convert certain variables to perform calculations on tensor cores. You benefit from the acceleration without any effort.
The architecture has its limits. If automatic use of tensor cores is observed, it is relatively low.
AMP is still recommended. It allows for more extensive use of tensor cores compared to usage without indication. Moreover, AMP has other advantages, particularly concerning memory usage.
Usage on NVIDIA H100 GPUs:
The latest extension of Jean Zay has NVIDIA H100 GPUs. Even more recent and powerful, they have the 4th generation of tensor cores (528 tensor cores). The H100s support the same precisions as the A100s, but also offer tensor cores in BF16 and FP8.
