Jean Zay: Execution of a multi-GPU Cuda-aware MPI and GPUDirect job in batch

Jobs are managed on all of the nodes by the software Slurm .

To submit a multi-GPU CUDA-aware MPI and GPUDirect job in batch on Jean Zay, you must create a submission script following the two examples given below.

IMPORTANT: It is necessary to execute the code using the same OpenMPI CUDA-aware library as you used for the compilation of your code. Moreover, the utilisation of the CUDA-aware MPI GPUDirect functionality on Jean Zay requires an accurate initialisation order for CUDA or OpenACC and MPI in the code. Please refer to the page MPI CUDA-aware et GPUDirect.

  • For an execution using 3 GPUs (on the same node) of the default gpu partition:

    multi_gpu_mpi_cuda-aware.slurm
    #!/bin/bash
    #SBATCH --job-name=multi_gpu_mpi_cuda-aware     # name of job
    # Other partitions are usable by activating/uncommenting
    # one of the 5 following directives:
    ##SBATCH -C v100-16g                 # uncomment to target only 16GB V100 GPU
    ##SBATCH -C v100-32g                 # uncomment to target only 32GB V100 GPU
    ##SBATCH --partition=gpu_p2          # uncomment for gpu_p2 partition (32GB V100 GPU)
    ##SBATCH -C a100                     # uncomment for gpu_p5 partition (80GB A100 GPU)
    # Here, reservation of 3x10=30 CPUs (for 3 tasks) and 3 GPUs (1 GPU per task) on a single node:
    #SBATCH --nodes=1                    # number of nodes
    #SBATCH --ntasks-per-node=3          # number of MPI tasks per node (= number of GPUs per node)
    #SBATCH --gres=gpu:3                 # number of GPUs per node (max 8 with gpu_p2, gpu_p5)
    # The number of CPUs per task must be adapted according to the partition used. Knowing that here
    # only one GPU per task is reserved (i.e. 1/4 or 1/8 of the GPUs of the node depending
    # on the partition), the ideal is to reserve 1/4 or 1/8 of the CPUs of the node for each task:
    #SBATCH --cpus-per-task=10           # number of cores per task (1/4 of the node here)
    ##SBATCH --cpus-per-task=3           # number of cores per task for gpu_p2 (1/8 of the 8-GPUs node)
    ##SBATCH --cpus-per-task=8           # number of cores per task for gpu_p5 (1/8 of the 8-GPUs node)
    # /!\ Caution: In Slurm vocabulary, "multithread" refers to hyperthreading
    #SBATCH --hint=nomultithread         # hyperthreading deactivated
    #SBATCH --time=00:10:00              # maximum execution time requested (HH:MM:SS)
    #SBATCH --output=multi_gpu_mpi%j.out # name of output file
    #SBATCH --error=multi_gpu_mpi%j.out  # name of error file (here, common with the output file)
     
    # Cleans out modules loaded in interactive and inherited by default 
    module purge
     
    # Uncomment the following module command if you are using the "gpu_p5" partition
    # to have access to the modules compatible with this partition.
    #module load cpuarch/amd
     
    # Loads modules
    module load ...
     
    # Echo of launched commands
    set -x
     
    # For the "gpu_p5" partition, the code must be compiled with the compatible modules.
    # Code execution
    srun ./executable_multi_gpu_mpi_cuda-aware
  • For an execution using 8 GPUs (i.e. 2 complete nodes) of the default gpu partition:

    multi_gpu_mpi_cuda-aware.slurm
    #!/bin/bash
    #SBATCH --job-name=multi_gpu_mpi_cuda-aware     # name of job
    # Other partitions are usable by activating/uncommenting
    # one of the 5 following directives:
    ##SBATCH -C v100-16g                 # uncomment to target only 16GB V100 GPU
    ##SBATCH -C v100-32g                 # uncomment to target only 32GB V100 GPU
    ##SBATCH --partition=gpu_p2          # uncomment for gpu_p2 partition (32GB V100 GPU)
    ##SBATCH -C a100                     # uncomment for gpu_p5 partition (80GB A100 GPU)
    # Here, reservation of 8x10=80 CPUs (4 tasks per node) and 8 GPUs (4 GPUs per node) on 2 nodes:
    #SBATCH --ntasks=8                   # total number of MPI tasks
    #SBATCH --ntasks-per-node=4          # number of MPI tasks per node (= number of GPUs per node)
    #SBATCH --gres=gpu:4                 # number of GPUs per node (max 8 with gpu_p2, gpu_p5)
    # The number of CPUs per task must be adapted according to the partition used. Knowing that here
    # only one GPU per task is reserved (i.e. 1/4 or 1/8 of the GPUs of the node depending
    # on the partition), the ideal is to reserve 1/4 or 1/8 of the CPUs of the node for each task:
    #SBATCH --cpus-per-task=10           # number of cores per task (a quarter of the node here)
    ##SBATCH --cpus-per-task=3           # number of cores per task for gpu_p2 (1/8 of the 8-GPUs node)
    ##SBATCH --cpus-per-task=8           # number of cores per task for gpu_p5 (1/8 of the 8-GPUs node)
    # /!\ Caution: In Slurm vocabulary, "multithread" refers to hyperthreading. 
    #SBATCH --hint=nomultithread         # hyperthreading deactivated 
    #SBATCH --time=00:10:00              # maximum execution time requested (HH:MM:SS)
    #SBATCH --output=multi_gpu_mpi%j.out # name of output file
    #SBATCH --error=multi_gpu_mpi%j.out  # name of error file (here, common with the output file)
     
    # Cleans out modules loaded in interactive and inherited by default
    module purge
     
    # Uncomment the following module command if you are using the "gpu_p5" partition
    # to have access to the modules compatible with this partition.
    #module load cpuarch/amd
     
    # Loads modules
    module load ...
     
    # Echo of launched commands
    set -x
     
    # For the "gpu_p5" partition, the code must be compiled with the compatible modules.
    # Code execution
    srun ./executable_multi_gpu_mpi_cuda-aware

Submit the script via the sbatch command:

$ sbatch multi_gpu_mpi_cuda-aware.slurm

Comments:

  • Similarly, you can execute your job on the gpu_p2 partition by specifying --partition=gpu_p2 and --cpus-per-task=3.
  • Similarly, you can execute your job on the gpu_p5 partition by specifying -C a100 and --cpus-per-task=8. Warning: the modules accessible by default are not compatible with this partition, it is necessary to first load the cpuarch/amd module to be able to list and load the compatible modules. For more information, see Modules compatible with gpu_p5 partition.
  • We recommend that you compile and execute your code in the same environment by loading the same modules.
  • In this example, we assume that the mpi_multi_gpu_exe executable file is found in the submission directory which is the directory in which we enter the sbatch command.
  • The computation output file mpi_gpu_multi<numero_job>.out is also found in the submission directory. It is created at the very beginning of the job execution: Editing or modifying it while the job is running can disrupt the execution.
  • The module purge is necessary because of Slurm default behaviour: Any modules which are loaded in your environment at the moment when you launch sbatch will be passed to the submitted job, thereby making your job execution dependent on what you have done previously.
  • To avoid errors from the automatic task allocation, we recommend that you use srun to execute your code instead of mpirun: This will guarantee a distribution consistent with the resources requested in your submission file.
  • All jobs have resources defined in Slurm per partition and per QoS (Quality of Service) by default. You can modify the limits by specifying another partition and / or QoS as shown in our documentation detailing the partitions and Qos.
  • For multi-project users and those having both CPU and GPU hours, it is necessary to specify the project accounting (hours allocation of the project) on which to count the computing hours of the job as indicated in our documentation detailing the project hours management.
  • We strongly recommend that you consult our documentation detailing the project hours management to ensure that the hours consumed by your jobs are deducted from the correct accounting.
  • The error ​CUDA failure: cuCtxGetDevice() during the execution probably means that you didn't respect the initialisation order for CUDA or OpenACC and MPI as indicated in the page CUDA-aware MPI and GPUDirect.