This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.
Nsight Compute
We invite you to consult the best practices for code profiling for general advice on performance analysis on Jean Zay.
Description
Nsight Compute is an NVIDIA tool dedicated to the fine profiling of CUDA kernels.
Unlike Nsight Systems (global timeline view), Nsight Compute focuses on the kernel level and allows precise identification of GPU bottlenecks.
It provides detailed metrics on:
- SM occupancy and warp efficiency;
- memory usage (global memory, L2 cache, shared memory);
- stalls and throughput limitations (compute-bound vs memory-bound).
It can be used via the command line (ncu) to collect measurements, and then via a graphical interface (ncu-ui) to explore reports, compare executions, and guide kernel optimisations.
Installed Versions
The module command provides access to the various versions of Nsight Compute.
To display the available versions:
$ module avail nvidia-nsight-compute
nvidia-nsight-compute/2020.3.1 nvidia-nsight-compute/2022.1.0 nvidia-nsight-compute/2023.3.1.0
nvidia-nsight-compute/2021.3.0 nvidia-nsight-compute/2022.4.0 nvidia-nsight-compute/2024.3.2.0
Usage
To use, for example, version 2024.3.2.0 of Nsight Compute, you need to load the corresponding module:
$ module load nvidia-nsight-compute/2024.3.2.0
Once the module is loaded, usage is in two steps:
- Run your application with
ncu(data collection); - Visualise/analyse the reports with
ncu-ui.
Command Line Execution (CLI)
Data collection is done with ncu in your Slurm scripts.
- For the
ncucommand to be recognised, the appropriate module must be loaded beforehand. - Nsight Compute profiling can significantly slow down execution. It is recommended to profile reduced and targeted cases.
- To see all options, use
ncu --help.
Here is an example of a submission script for a multi-GPU MPI code (4 processes, 4 GPUs), with one report file per MPI rank:
#!/bin/bash
#SBATCH --job-name=nsight_compute
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=10
##SBATCH --exclusive -C prof
#SBATCH --hint=nomultithread
#SBATCH --time=00:20:00
module load ...
module load nvidia-nsight-compute/2024.3.2.0
set -x
# Exemple: un rapport .ncu-rep par processus MPI
srun bash -c 'ncu -f -o ncu_report_rank${SLURM_PROCID} ./my_bin_exe'
Avoid profiling all iterations of a large production case. Instead, target a representative portion (reduced dataset, critical phase, targeted kernels).
Visualisation/Analysis of Results (GUI)
The reports generated by ncu are .ncu-rep files.
Using ncu-ui on Jean Zay requires an SSH connection with X11 forwarding (e.g. ssh -X).
You can open them with:
ncu-ui ncu_report_rank0.ncu-rep
The graphical interface may be slow with X11 forwarding from a login node. You can use a visualisation node or transfer the reports to your local machine.
Documentation
The complete documentation is available on the NVIDIA website.