Table des matières
Jean Zay: TensorFlow and PyTorch profiling tools
Profiling is an indispensable step in code optimisation. Its goal is to target the execution steps which are the most costly in time or memory, and to visualise the work load distribution between GPUs and CPUs.
Profiling an execution is an operation which consumes a lot of time. Therefore, this study is generally done on one or two iterations of your training steps.
The TensorFlow profiling tool
TensorFlow includes a profiling functionality called « TensorFlow Profiler ».
The TensorFlow Profiler requires TensorFlow and TensorBoard versions which are equal or superior to 2.2. On Jean Zay, it is available on the TensorFlow 2.2.0 (or higher) versions by loading the suitable module. For example:
$ module load tensorflow-gpu/py3/2.2.0
Instrumentation of your TensorFlow code for profiling
The code must include a TensorBoard callback as explained on the page TensorBoard visualisation tool for TensorFlow and PyTorch.
Visualisation of the TensorFlow code profile
Visualisation of the TensorFlow Profiler is possible via TensorBoard in the PROFILE tab. Access toTensorBoard is described here.
The TensorBoard PROFILE tab opens onto the « Overview Page ». This displays a summary of the performance in execution time of the different calculation steps. With this, you can quickly know if it is the training, data loading, or data preprocessing which consumes the most execution time.
On the Trace Viewer page, you can view a more detailed description of the execution sequence, distinguishing between the operations executed on GPUs and on CPUs. For example :
In this example, we see that the GPUs (on top) are little used most of the time compared to the CPUs (on bottom). The blocks of colour show that the GPUs are only used at the end of the steps while the CPUs are used regularly on certain threads. An optimisation is certainly possible through a better distribution of work between GPUs and CPUs.
The PyTorch profiling tool
PyTorch includes a profiling functionality called « Pytorch Profiler ».
Instrumentation of your PyTorch code for profiling
In the PyTorch code, you must:
- Import the profiler.
from torch.profiler import profile, record_function, ProfilerActivity
- Then, pass the profiler during the execution of the training function.
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof: with record_function("training_function"): train()
Comments :
- ProfilerActivity.GPU : Allows recovering the CUDA events (linked to the GPUs).
- with record_function(“$NAME”): Allows putting a decorator (a tag associated to a name) for a block of functions. It is also useful, therefore, to put decorators in the learning function for the sets of sub-functions. For example:
def train(): for epoch in range(1, num_epochs+1): for i_step in range(1, total_step+1): # Obtain the batch. with record_function("load input batch"): images, captions = next(iter(data_loader)) ... with record_function("Training step"): ... loss = criterion(outputs.view(-1, vocab_size), captions.view(-1)) ...
- Beginning with PyTorch version 1.6.0, it is possible to profile the CPU and GPU memory footprint by adding the profile_memory=True parameter under profile.
Visualisation of a PyTorch code profiling
It is not yet possible, in the current versions, to visualise the PyTorch Profiler results using the PROFILE functionality of TensorBoard. Below, we suggest other solutions documented in Pytorch.
Visualisation of a profiling table
To display the profile results after the training function has executed, you must launch the line:
print(prof.key_averages().table(sort_by="cpu_time", row_limit=10))
You will then obtain a list of all the automatically tagged functions, or those tagged yourself (via decorators), in descending order according to the total CPU time.
|----------------------------------- --------------- --------------- ------------- -------------- Name Self CPU total CPU total CPU time avg Number of Calls |----------------------------------- --------------- --------------- ------------- -------------- training_function 1.341s 62.089s 62.089s 1 load input batch 57.357s 58.988s 14.747s 4 Training step 1.177s 1.212s 303.103ms 4 EmbeddingBackward 51.355us 3.706s 231.632ms 16 embedding_backward 30.284us 3.706s 231.628ms 16 embedding_dense_backward 3.706s 3.706s 231.627ms 16 move to GPU 5.967ms 546.398ms 136.599ms 4 stack 760.467ms 760.467ms 95.058ms 8 BroadcastBackward 4.698ms 70.370ms 8.796ms 8 ReduceAddCoalesced 22.915ms 37.673ms 4.709ms 8 |----------------------------------- --------------- --------------- --------------- ------------
The « Self CPU total » column corresponds to the time spent in the function itself, not in its sub-functions.
The « Number of Calls » column shows the number of GPUs used by a function.
Here, we see that the image input load time (load input batch
) is much larger than the neural network training (Training step
). Therefore, optimisation must target batch loading.
Visualisation of the profiling with the Chromium trace tool
To display a Trace Viewer equivalent to that of TensorBoard, you can also generate a « json » trace file with the following line:
prof.export_chrome_trace("trace.json")
This trace file is viewable on the Chromium project trace tool. From a CHROME (or CHROMIUM) browser, you must launch the following command in the URL bar:
about:tracing
Here, we see the CPU and GPU usage distinctly, as with TensorBoard. On top, the CPU functions are shown and on the bottom, the GPU functions are shown. There are 5 CPU tasks and 4 GPU tasks. Each block of colour represents a function or a sub-function. We are at the end of a loading of a data batch, followed by a training iteration. In the lower part, we see the forward, backward and synchronisation executed in multi-GPUs.
Official documentation
- Profiling on Tensorboard with Tensorflow: https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras
- Profiling a Pytorch code: https://pytorch.org/tutorials/recipes/recipes/profiler.html, https://pytorch.org/docs/stable/autograd.html