Jean Zay: CUDA-aware MPI and GPUDirect

For optimal performance, the Cuda-aware OpenMPI libraries supporting the GPUDirect are available on Jean Zay.

These MPI libraries effectuate communications using send and receive buffers allocated on the GPU memory. Thanks to the GPUDirect support, these transfers are done directly from GPU to GPU whenever possible, without an intermediate copy on the CPU memory.

Code compilation

It is necessary to compile the code by using one of the CUDA-aware MPI libraries available on Jean Zay.

After loading the compiler you want to use, one of the following modules should be loaded:

$ module avail openmpi/*-cuda
-------------- /gpfslocalsup/pub/modules-idris-env4/modulefiles/linux-rhel8-skylake_avx512 --------------
openmpi/3.1.4-cuda openmpi/3.1.6-cuda openmpi/4.0.2-cuda openmpi/4.0.4-cuda
 
$ module load openmpi/4.0.4-cuda

If OpenMPI has not been compiled yet for the chosen compiler, an error message will be displayed. Please contact the support team which can install a new version.

You can use the module show openmpi/<version> command to display the list of compilers for which OpenMPI has already been compiled. For example:

$ module show openmpi/4.0.2-cuda
------------------------------------------------------------------
/gpfslocalsup/pub/modules-idris-env4/modulefiles/linux-rhel8-skylake_avx512/openmpi/4.0.2-cuda:
 
module-whatis   {An open source Message Passing Interface implementation.}
prereq          intel-compilers/19.0.4 pgi/20.1 pgi/19.10 gcc/10.1.0 gcc/8.3.1
conflict        openmpi
conflict        intel-mpi
 
Available software environment(s):
- intel-compilers/19.0.4
- pgi/20.1
- pgi/19.10
- gcc/10.1.0
- gcc/8.3.1
 
If you want to use this module with another software environment,
please contact the support team.
-------------------------------------------------------------------

The compilation is done by using OpenMPI wrappers:

$ mpifort source.f90
 
$ mpicc source.c
 
$ mpic++ source.C

Since no particular option is necessary for the compilation, you may refer to the GPU compilation section of the index for more information on code compiling using the GPUs.

Adaptation du code

The utilisation of the CUDA-aware MPI GPUDirect functionality on Jean Zay requires an accurate initialisation order for CUDA or OpenACC and MPI in the code :

  1. Initialisation of CUDA or OpenACC
  2. Choice of the GPU which each MPI process should use (binding step)
  3. Initialisation of MPI.

Caution: Random crashes can occur if this initialisation order is not respected.

Therefore, a slight adjustment of your code could be needed in order for you to use this functionality on Jean Zay.

You will find below a subroutine example in Fortran and in C which allows the OpenACC initialisation before the MPI initialisation, then a CUDA example.

Important: This example only works when exactly one MPI process is assigned per GPU.

OpenACC example

Fortran version

init_acc.f90
#ifdef _OPENACC
subroutine initialisation_openacc
 
    USE openacc
 
    character(len=6) :: local_rank_env
    integer          :: local_rank_env_status, local_rank
 
    ! Initialisation of OpenACC
    !$acc init
 
    ! Recovery of the local rank of the process via the environment variable
    ! set by Slurm, as MPI_Comm_rank cannot be used here
    ! because this routine is used BEFORE the initialisation of MPI
    call get_environment_variable(name="SLURM_LOCALID", value=local_rank_env, status=local_rank_env_status)
 
    if (local_rank_env_status == 0) then
        read(local_rank_env, *) local_rank
        ! Define the GPU to use via OpenACC
        call acc_set_device_num(local_rank, acc_get_device_type())
    else
        print *, "Error: impossible to determine the local rank of the process"
        stop 1
    end if
end subroutine initialisation_openacc
#endif

Example of usage:

init_acc_mpi.f90
! Initialise OpenACC...
#ifdef _OPENACC
  call initialisation_openacc
#endif
! ... before initialising MPI
  call mpi_init(code)

C Version

init_acc.c
#ifdef _OPENACC
void initialisation_openacc()
{
    char* local_rank_env;
    int local_rank;
 
    /* Initialisation of OpenACC */
    #pragma acc init
 
    /* Recovery of the local rank of the process via the environment variable
       set by Slurm, as MPI_Comm_rank cannot be used here because this routine
       is used BEFORE the initialisation of MPI*/
 
    local_rank_env = getenv("SLURM_LOCALID");
 
    if (local_rank_env) {
        local_rank = atoi(local_rank_env);
        /* Define the GPU to use via OpenACC */
        acc_set_device_num(local_rank, acc_get_device_type());
    } else {
        printf("Error : impossible to determine the local rank of the process\n");
        exit(1);
    }
}
#endif

Example of usage :

init_acc_mpi.c
#ifdef _OPENACC
/* Initialise OpenACC... */
initialisation_openacc();
#endif
/* ... before initialising MPI */
MPI_Init(&argc, &argv);

CUDA example

init_cuda.c
#include <cuda.h>
#include <cuda_runtime.h>
 
void initialisation_cuda()
{
    char* local_rank_env;
    int local_rank;
    cudaError_t cudaRet;
 
     /* Recovery of the local rank of the process via the environment variable
        set by Slurm, as  MPI_Comm_rank cannot be used here because this routine
        is used BEFORE the initialisation of MPI*/
    local_rank_env = getenv("SLURM_LOCALID");
 
    if (local_rank_env) {
        local_rank = atoi(local_rank_env);
        /* Define the GPU to use for each MPI process */
        cudaRet = cudaSetDevice(local_rank);
        if(cudaRet != CUDA_SUCCESS) {
            printf("Erreur: cudaSetDevice has failed\n");
            exit(1);
        }
    } else {
        printf("Error : impossible to determine the local rank of the process\n");
        exit(1);
    }
}

Example of usage :

init_cuda_mpi.c
/* Initialise CUDA... */
initialisation_cuda();
/* ... before initialising MPI */
MPI_Init(&argc, &argv);

Execution of the code

For code execution, you need to be sure to load the same MPI library as the one used for the code compilation by using the module command, and then use the srun command to begin the execution.

The support of CUDA-aware and GPUDirect is thus activated by default without any supplementary operation.

For more information about submission of multi-GPU jobs using MPI, consult the page Execution of a multi-GPU Cuda-aware MPI and GPU job in batch.