Tensor Cores

⚠ INFORMATION
This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.

Introduction

All computing units are designed to support arithmetic and logical operations. However, some of these operations require more elementary calculations than others. Matrix/tensor multiplication and convolution are among these very computationally intensive operations. It turns out that they are also very often used in AI.

Simply multiplying two 4x4 matrices requires 64 multiplication operations and 48 addition operations. Furthermore, the computational complexity of this type of operation increases very rapidly as the size and dimensions of the matrix increase¹.

To address this issue, NVIDIA has offered a specific computing unit since its Volta architecture: The Tensor Core.

This is a specialised computing unit (hardware-wise) for tensor multiplication. Whereas CUDA cores must perform a series of elementary operations (multiplications and additions), tensor cores do the same thing in one go. The time savings can then be extremely significant depending on the precision used and the GPU in question. We are talking about a gain of between x2 and x3 for the vast majority of Deep Learning models.

NVIDIA Tensor Cores Comparison of a Pascal CUDA core and a Volta Tensor Core. Whereas the CUDA core performs matrix multiplication vector by vector, the Tensor Core performs all these operations in parallel with an execution speed 12 times faster.

Using and benefiting from Tensor Core acceleration

As indicated in the previous paragraph, using tensor cores for multiplication operations provides a significant performance gain. It is therefore a mandatory step for those who wish to optimise their training (or inference) times.

Jean Zay has NVIDIA V100, A100 and H100 GPUs. All three models have tensor cores. However, the usage and implementation differ between the three models.

Tensor cores generally work with lower precisions (hardware criterion). The lower the precision, the faster the calculation will execute. The available precisions depend on the architecture and GPU model.

Here are some examples of precisions (and associated performance) available on the tensor cores of Jean Zay's GPUs:

Performance in TFLOPS by precision and GPU
GPU \ Precision	FP64	TF32	FP16	BF16	INT8	FP8
V100	x	x	112	x	x	x
A100	19.5	156	312	x	624	x
H100	51.2	378	756	756	1513	1513

Usage on NVIDIA V100 GPUs:

The V100 architecture is the first to implement Tensor Cores (640 tensor cores). Due to their hardware design, these tensor cores perform calculations only with Float16 precision. Most AI models are in Float32. If you want to benefit from the acceleration provided by tensor cores, you need to reduce the precision of the model variables².

To do this, there are two possible solutions:

Put all the elements of our model in Float16
Implement AMP (mixed precision) => Highly recommended

Usage on NVIDIA A100 GPUs:

The A100 GPUs - more recent and powerful than the V100s - have the 3rd generation of tensor cores (432 tensor cores). They allow the use of tensor cores with commonly used precisions but also with new ones. The A100s can manage the use of their tensor cores themselves. Thus, without any additional code, the hardware will be able to intelligently convert certain variables to perform calculations on tensor cores. You benefit from the acceleration without any effort.

warning

The architecture has its limits. If automatic use of tensor cores is observed, it is relatively low.

tip

AMP is still recommended. It allows for more extensive use of tensor cores compared to usage without indication. Moreover, AMP has other advantages, particularly concerning memory usage.

Usage on NVIDIA H100 GPUs:

The latest extension of Jean Zay has NVIDIA H100 GPUs. Even more recent and powerful, they have the 4th generation of tensor cores (528 tensor cores). The H100s support the same precisions as the A100s, but also offer tensor cores in BF16 and FP8.

Multiplying two 8x8 matrices involves 512 multiplications and 448 additions ↩
The term precision refers to the number of bits used to store the variable ↩

Introduction​

Using and benefiting from Tensor Core acceleration​

Usage on NVIDIA V100 GPUs:​

Usage on NVIDIA A100 GPUs:​

Usage on NVIDIA H100 GPUs:​

Footnotes​

Introduction

Using and benefiting from Tensor Core acceleration

Usage on NVIDIA V100 GPUs:

Usage on NVIDIA A100 GPUs:

Usage on NVIDIA H100 GPUs:

Footnotes