Jean Zay: TensorFlow and PyTorch profiling tools

Profiling is an indispensable step in code optimisation. Its goal is to target the execution steps which are the most costly in time or memory, and to visualise the work load distribution between GPUs and CPUs.

Profiling an execution is an operation which consumes a lot of time. Therefore, this study is generally done on one or two iterations of your training steps.

The TensorFlow profiling tool

TensorFlow includes a profiling functionality called « TensorFlow Profiler ».

The TensorFlow Profiler requires TensorFlow and TensorBoard versions which are equal or superior to 2.2. On Jean Zay, it is available on the TensorFlow 2.2.0 (or higher) versions by loading the suitable module. For example:

$ module load tensorflow-gpu/py3/2.2.0

Instrumentation of your TensorFlow code for profiling

The code must include a TensorBoard callback as explained on the page TensorBoard visualisation tool for TensorFlow and PyTorch.

Visualisation of the TensorFlow code profile

Visualisation of the TensorFlow Profiler is possible via TensorBoard in the PROFILE tab. Access toTensorBoard is described here.

The TensorBoard PROFILE tab opens onto the « Overview Page ». This displays a summary of the performance in execution time of the different calculation steps. With this, you can quickly know if it is the training, data loading, or data preprocessing which consumes the most execution time.

On the Trace Viewer page, you can view a more detailed description of the execution sequence, distinguishing between the operations executed on GPUs and on CPUs. For example :

In this example, we see that the GPUs (on top) are little used most of the time compared to the CPUs (on bottom). The blocks of colour show that the GPUs are only used at the end of the steps while the CPUs are used regularly on certain threads. An optimisation is certainly possible through a better distribution of work between GPUs and CPUs.

The PyTorch profiling tool

PyTorch includes a profiling functionality called « Pytorch Profiler ».

Instrumentation of your PyTorch code for profiling

In the PyTorch code, you must:

  • Import the profiler.
from torch.profiler import profile, record_function, ProfilerActivity
  • Then, pass the profiler during the execution of the training function.
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("training_function"):

Comments :

  • ProfilerActivity.GPU : Allows recovering the CUDA events (linked to the GPUs).
  • with record_function(“$NAME”): Allows putting a decorator (a tag associated to a name) for a block of functions. It is also useful, therefore, to put decorators in the learning function for the sets of sub-functions. For example:
def train():
    for epoch in range(1, num_epochs+1):
        for i_step in range(1, total_step+1):
            # Obtain the batch.
            with record_function("load input batch"):
                images, captions = next(iter(data_loader))
            with record_function("Training step"):
                loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
  • Beginning with PyTorch version 1.6.0, it is possible to profile the CPU and GPU memory footprint by adding the profile_memory=True parameter under profile.

Visualisation of a PyTorch code profiling

It is not yet possible, in the current versions, to visualise the PyTorch Profiler results using the PROFILE functionality of TensorBoard. Below, we suggest other solutions documented in Pytorch.

Visualisation of a profiling table

To display the profile results after the training function has executed, you must launch the line:

print(prof.key_averages().table(sort_by="cpu_time", row_limit=10))

You will then obtain a list of all the automatically tagged functions, or those tagged yourself (via decorators), in descending order according to the total CPU time.

|-----------------------------------   ---------------  ---------------  -------------  --------------
Name                                   Self CPU total   CPU total        CPU time avg  Number of Calls
|-----------------------------------   ---------------  ---------------  -------------  --------------
training_function                     1.341s            62.089s          62.089s          1                
load input batch                      57.357s           58.988s          14.747s          4                
Training step                         1.177s            1.212s           303.103ms        4                
EmbeddingBackward                     51.355us          3.706s           231.632ms        16               
embedding_backward                    30.284us          3.706s           231.628ms        16               
embedding_dense_backward              3.706s            3.706s           231.627ms        16               
move to GPU                           5.967ms           546.398ms        136.599ms        4                
stack                                 760.467ms         760.467ms        95.058ms         8                
BroadcastBackward                     4.698ms           70.370ms         8.796ms          8                
ReduceAddCoalesced                    22.915ms          37.673ms         4.709ms          8                
|-----------------------------------  ---------------   ---------------  ---------------  ------------

The « Self CPU total » column corresponds to the time spent in the function itself, not in its sub-functions.

The « Number of Calls » column shows the number of GPUs used by a function.

Here, we see that the image input load time (load input batch) is much larger than the neural network training (Training step). Therefore, optimisation must target batch loading.

Visualisation of the profiling with the Chromium trace tool

To display a Trace Viewer equivalent to that of TensorBoard, you can also generate a « json » trace file with the following line:


This trace file is viewable on the Chromium project trace tool. From a CHROME (or CHROMIUM) browser, you must launch the following command in the URL bar:


Here, we see the CPU and GPU usage distinctly, as with TensorBoard. On top, the CPU functions are shown and on the bottom, the GPU functions are shown. There are 5 CPU tasks and 4 GPU tasks. Each block of colour represents a function or a sub-function. We are at the end of a loading of a data batch, followed by a training iteration. In the lower part, we see the forward, backward and synchronisation executed in multi-GPUs.

Official documentation