⚠ INFORMATION
This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.

Compilation with MPI CUDA-aware and GPUDirect

For optimal performance, OpenMPI CUDA-aware libraries supporting GPUDirect are available on Jean Zay.

These MPI libraries allow communications using send and receive buffers allocated on the GPU memory. Thanks to GPUDirect support, these transfers are done directly from GPU to GPU without intermediate copying to CPU memory, when possible.

warning

GPUDirect is not usable on Jean Zay for codes using multiple GPUs per MPI process.

Code Compilation

It is necessary to compile the code using one of the OpenMPI CUDA-aware libraries available on Jean Zay.

After loading the compiler you wish to use, you must load one of the following modules:

$ module avail openmpi/*-cuda
----- /lustre/fshomisc/sup/hpe/pub/modules-idris-env4/modulefiles/linux-rhel9-skylake_avx512 ------
openmpi/3.1.4-cuda  openmpi/4.0.2-cuda  openmpi/4.1.0-cuda  openmpi/4.1.5-cuda
openmpi/3.1.6-cuda  openmpi/4.0.4-cuda  openmpi/4.1.1-cuda  openmpi/4.1.6-cuda
openmpi/4.0.1-cuda  openmpi/4.0.5-cuda  openmpi/4.1.4-cuda  openmpi/4.1.8-cuda
   
$ module load openmpi/4.0.4-cuda

If OpenMPI is not available for the desired compiler, an error message will be displayed. Do not hesitate to contact support at assist@idris.fr to request a new installation if necessary.

To find out the list of compilers for which a given version of OpenMPI CUDA-aware is available, you can use the command module show openmpi/<version>-cuda. For example:

$ module show openmpi/4.1.8-cuda
------------------------------------------------------------------------------------
/lustre/fshomisc/sup/hpe/pub/modules-idris-env4/modulefiles/linux-rhel9-skylake_avx512/openmpi/4.1.8-cuda:

module-whatis   {An open source Message Passing Interface implementation.}
prereq          intel-compilers/2021.9.0 nvidia-compilers/25.1 gcc/14.2.0 intel-oneapi-compilers/2023.1.0
conflict        openmpi
conflict        intel-mpi

Available software environment(s):
- intel-compilers/2021.9.0 cuda/12.8.0
- nvidia-compilers/25.1 cuda/12.8.0
- gcc/14.2.0 cuda/12.8.0
- intel-oneapi-compilers/2023.1.0 cuda/12.8.0

If you want to use this module with another software environment,
please contact the support team.
------------------------------------------------------------------------------------

The "Available software environment(s):" section indicates the compatible compilers with the specified library.

Compilation is done using the OpenMPI wrappers. For example:

mpifort source.f90

mpicc source.c

mpic++ source.C

No special options are necessary for compilation; you can refer to the documentation pages on NVIDIA/PGI compilers and on OpenACC code compilation for more information on compiling codes using GPUs.

Code Adaptation

On the V100 or A100 partitions of Jean Zay, using the MPI CUDA-aware GPUDirect feature requires following a specific initialisation order for CUDA or OpenACC and MPI in the code:

Initialisation of CUDA or OpenACC;
Selection of the GPU that each MPI process should use (binding step);
Initialisation of MPI.

If this initialisation order is not followed, your code execution may crash with the following error: CUDA failure: cuCtxGetDevice().

info

This adaptation is essential if you are working on one of the partitions dependent on the OmniPath interconnect network, i.e., the V100 and A100 partitions of Jean Zay. You can also apply it if you are working on the H100 partition (on InfiniBand network) but it is not necessary.

Below you will find an example of a Fortran and C subroutine to initialise OpenACC before initialising MPI, as well as a CUDA example.

OpenACC Example

warning

This example only works when exactly one MPI process per GPU is allocated.

Fortran Version

init_acc.f90
#ifdef _OPENACC
subroutine initialisation_openacc

    USE openacc
    
    character(len=6) :: local_rank_env
    integer          :: local_rank_env_status, local_rank

    ! Initialisation d'OpenACC
    !$acc init

    ! Récupération du rang local du processus via la variable d'environnement
    ! positionnée par Slurm, l'utilisation de MPI_Comm_rank n'étant pas encore
    ! possible puisque cette routine est utilisée AVANT l'initialisation de MPI
    call get_environment_variable(name="SLURM_LOCALID", value=local_rank_env, status=local_rank_env_status)

    if (local_rank_env_status == 0) then
        read(local_rank_env, *) local_rank
        ! Définition du GPU à utiliser via OpenACC
        call acc_set_device_num(local_rank, acc_get_device_type())
    else
        print *, "Erreur : impossible de déterminer le rang local du processus"
        stop 1
    end if
end subroutine initialisation_openacc
#endif

Example of usage:

init_acc_mpi.f90
! On initialise OpenACC...
#ifdef _OPENACC
  call initialisation_openacc
#endif
! ... avant d'initialiser MPI
  call mpi_init(code)

C Version

init_acc.c
#ifdef _OPENACC
void initialisation_openacc()
{
    char* local_rank_env;
    int local_rank;

    /* Initialisation d'OpenACC */
    #pragma acc init

    /* Récupération du rang local du processus via la variable d'environnement
       positionnée par Slurm, l'utilisation de MPI_Comm_rank n'étant pas encore
       possible puisque cette routine est utilisée AVANT l'initialisation de MPI */
    local_rank_env = getenv("SLURM_LOCALID");

    if (local_rank_env) {
        local_rank = atoi(local_rank_env);
        /* Définition du GPU à utiliser via OpenACC */
        acc_set_device_num(local_rank, acc_get_device_type());
    } else {
        printf("Erreur : impossible de déterminer le rang local du processus\n");
        exit(1);
    }
}
#endif

Example of usage:

init_acc_mpi.c
#ifdef _OPENACC
/* On initialise OpenACC... */
initialisation_openacc();
#endif
/* ... avant d'initialiser MPI */
MPI_Init(&argc, &argv);

CUDA Example

warning

This example only works when exactly one MPI process per GPU is allocated.

init_cuda.c
#include <cuda.h>
#include <cuda_runtime.h>

void initialisation_cuda()
{
    char* local_rank_env;
    int local_rank;
    cudaError_t cudaRet;

     /* Récupération du rang local du processus via la variable d'environnement
        positionnée par Slurm, l'utilisation de MPI_Comm_rank n'étant pas encore
        possible puisque cette routine est utilisée AVANT l'initialisation de MPI */
    local_rank_env = getenv("SLURM_LOCALID");

    if (local_rank_env) {
        local_rank = atoi(local_rank_env);
        /* Définition du GPU à utiliser pour chaque processus MPI */
        cudaRet = cudaSetDevice(local_rank);
        if(cudaRet != CUDA_SUCCESS) {
            printf("Erreur: cudaSetDevice a échoué\n");
            exit(1);
        }
    } else {
        printf("Erreur : impossible de déterminer le rang local du processus\n");
        exit(1);
    }
}

Example of usage:

init_cuda_mpi.c
/* On initialise CUDA... */
initialisation_cuda();
/* ... avant d'initialiser MPI */
MPI_Init(&argc, &argv);

Code Execution

During execution, you must ensure to load the same MPI library as the one used for code compilation with the command module, and then use the command srun to start it.

Support for CUDA-aware and GPUDirect is then enabled by default without any additional operations.

For more information on submitting multi-GPU jobs using MPI, see the page on executing a multi-GPU MPI CUDA-aware and GPUDirect job in batch.

Code Compilation​

Code Adaptation​

OpenACC Example​

Fortran Version​

C Version​

CUDA Example​

Code Execution​

Code Compilation

Code Adaptation

OpenACC Example

Fortran Version

C Version

CUDA Example

Code Execution