Jean Zay: Usage of CUDA MPS

Introduction

The Multi-Process Service MPS is an implementation variant compatible with the CUDA programming interface. The MPS execution architecture is designed to let co-operative multi-process CUDA applications, generally for MPI jobs, use Hyper-Q functionalities on the very latest NVIDIA GPUs. Hyper-Q allows CUDA kernels to be processed simultaneously on the same GPU; this can improve performance when the GPU calculation capacity is underused by a single application process.

Usage

CUDA MPS is included by default in the different CUDA modules available to the users.

For a multi-GPU MPI batch job, the usage of CUDA MPS can be activated with the -C mps option. However, the node must be exclusively reserved via the --exclusive option.

  • For an execution via the default gpu partition (nodes with 40 physical cores and 4 GPUs) using only one node:

    mps_multi_gpu_mpi.slurm
    #!/bin/bash
    #SBATCH --job-name=gpu_cuda_mps_multi_mpi     # name of job
    #SBATCH --ntasks=40                   # total number of MPI tasks
    #SBATCH --ntasks-per-node=40          # number of MPI tasks per node (all physical cores)
    #SBATCH --gres=gpu:4                  # number of GPUs per node (all GPUs)
    #SBATCH --cpus-per-task=1             # number of cores per task 
    # /!\ Caution: In Slurm vocabulary, "multithread" refers to hyperthreading.
    #SBATCH --hint=nomultithread           # hyperthreading deactivated
    #SBATCH --time=00:10:00                # maximum execution time requested (HH:MM:SS)
    #SBATCH --output=gpu_cuda_mps_multi_mpi%j.out # name of output file
    #SBATCH --error=gpu_cuda_mps_multi_mpi%j.out  # name of error file (here, common with the output)
    #SBATCH --exclusive               # exclusively reserves the node 
    #SBATCH -C mps                     # the MPS is activated  
     
    # cleans out modules loaded in interactive and inherited by default
    module purge
     
    # loads modules
    module load ...
     
    # echo of launched commands
    set -x
     
    # execution of the code with binding via bind_gpu.sh: 4 GPUs for 40 MPI tasks.
    srun ./executable_multi_gpu_mpi

Submit script via the sbatch command:

$ sbatch mps_multi_gpu_mpi.slurm

Comments:

  • Similarly, you can execute your job on an entire node of the gpu_p2 partition (nodes with 24 physical cores and 8 GPUs) by specifying:
    #SBATCH --partition=gpu_p2# GPU partition requested 
    #SBATCH --ntasks=24# total number of MPI tasks 
    #SBATCH --ntasks-per-node=24# number of MPI tasks per node (all physical cores)
    #SBATCH --gres=gpu:8                  # number of GPUs per node (all GPUs)
    #SBATCH --cpus-per-task=1# number of cores per task 
  • Be careful, even if you use only part of the node, it has to be reserved in exclusive mode. In particular, this means that the entire node is invoiced.
  • We recommend that you compile and execute your codes in the same environment by loading the same modules.
  • In this example, we assume that the executable_mps_multi_gpu_mpi executable file is found in the submission directory, i.e. the directory in which the sbatch command is entered.
  • The calculation output file, gpu_cuda_mps_multi_mpi<numero_job>.out, is also found in the submission directory. It is created at the start of the job execution: Editing or modifying it while the job is running can disrupt the execution.
  • The module purge is made necessary by the Slurm default behaviour: Any modules which are loaded in your environment at the moment when you launch sbatch will be passed to the submitted job making the execution of your job dependent on what you have done previously.
  • To avoid errors in the automatic task distribution, we recommend using srun to execute your code instead of mpirun. This guarantees a distribution which conforms to the specifications of the resources you requested in the submission file.
  • Jobs have resources defined in Slurm by default, per partition and per QoS (Quality of Service). You can modify the limits or specify another partition and / or QoS as shown in our documentation detailing the partitions and QoS.
  • For multi-project users and those having both CPU and GPU hours, it is necessary to specify the project accounting (hours allocation of the project) on which to count the computing hours of the job as indicated in our documentation detailing the computing hours accounting.
  • We strongly recommend that you consult our documentation detailing the computing hours accounting to ensure that the hours consumed by your jobs are deducted from the correct accounting.

Documentation

Official documentation from NIVIDIA