Jean Zay: Execution of a single-GPU job in batch

Jobs are managed on all of the nodes by the software Slurm .

To submit a single-GPU job in batch on Jean Zay, you must first create a submission script:

  • For a job with 1 GPU in default gpu partition:

    single_gpu.slurm
    #!/bin/bash
    #SBATCH --job-name=single_gpu        # name of job
    # Other partitions are usable by activating/uncommenting
    # one of the 5 following directives:
    ##SBATCH -C v100-16g                 # uncomment to target only 16GB V100 GPU
    ##SBATCH -C v100-32g                 # uncomment to target only 32GB V100 GPU
    ##SBATCH --partition=gpu_p2          # uncomment for gpu_p2 partition (32GB V100 GPU)
    ##SBATCH -C a100                     # uncomment for gpu_p5 partition (80GB A100 GPU)
    # Here, reservation of 10 CPUs (for 1 task) and 1 GPU on a single node:
    #SBATCH --nodes=1                    # we request one node
    #SBATCH --ntasks-per-node=1          # with one task per node (= number of GPUs here)
    #SBATCH --gres=gpu:1                 # number of GPUs per node (max 8 with gpu_p2, gpu_p5)
    # The number of CPUs per task must be adapted according to the partition used. Knowing that here
    # only one GPU is reserved (i.e. 1/4 or 1/8 of the GPUs of the node depending on the partition),
    # the ideal is to reserve 1/4 or 1/8 of the CPUs of the node for the single task:
    #SBATCH --cpus-per-task=10           # number of cores per task (1/4 of the 4-GPUs node)
    ##SBATCH --cpus-per-task=3           # number of cores per task for gpu_p2 (1/8 of 8-GPUs node)
    ##SBATCH --cpus-per-task=8           # number of cores per task for gpu_p5 (1/8 of 8-GPUs node)
    # /!\ Caution, "multithread" in Slurm vocabulary refers to hyperthreading.
    #SBATCH --hint=nomultithread         # hyperthreading is deactivated
    #SBATCH --time=00:10:00              # maximum execution time requested (HH:MM:SS)
    #SBATCH --output=gpu_single%j.out    # name of output file
    #SBATCH --error=gpu_single%j.out     # name of error file (here, in common with the output file)
     
    # Cleans out the modules loaded in interactive and inherited by default 
    module purge
     
    # Uncomment the following module command if you are using the "gpu_p5" partition
    # to have access to the modules compatible with this partition.
    #module load cpuarch/amd
     
    # Loading of modules
    module load ...
     
    # Echo of launched commands
    set -x
     
    # For the "gpu_p5" partition, the code must be compiled with the compatible modules.
    # Code execution
    ./single_gpu_exe

    To launch a Python script, it is necessary to replace the last line with:

    # Code execution
    python -u script_mono_gpu.py

    Comment: The Python option -u (= unbuffered) deactivates the buffering of standard outputs which are automatically effectuated by Slurm.

Next, submit this script via the sbatch command:

$ sbatch single_gpu.slurm

Comments:

  • To execute your job on the gpu_p2 partition, you have to specify --partition=gpu_p2 and --cpus-per-task=3.
  • To execute your job on the gpu_p5 partition, you have to specify -C a100 and --cpus-per-task=8. Warning: the modules accessible by default are not compatible with this partition, it is necessary to first load the cpuarch/amd module to be able to list and load the compatible modules. For more information, see Modules compatible with gpu_p5 partition.
  • We recommend that you compile and execute your code in the same environment by loading the same modules.
  • In this example, we assume that the single_gpu_exe executable file is found in the submission directory which is the directory in which we enter the sbatch command.
  • By default, Slurm buffers the standard outputs of a Python script, which can result in a significant delay between the script execution and the output visualisation in the logs. To deactivate this buffering, it is necessary to add the option -u (= unbuffered) to the python call. However, setting the PYTHONUNBUFFERED environment variable at 1 in your submission script (export PYTHONUNBUFFERED=1) has the same effect. This variable is set by default in the virtual environments installed on Jean Zay by the support team.
  • The computation output file gpu_single<numero_job>.out is also found in the submission directory. It is created at the very beginning of the job execution: Editing or modifying it while the job is running can disrupt the execution.
  • The module purge is made necessary by Slurm default behaviour: Any modules which are loaded in your environment at the moment when you launch sbatch will be passed to the submitted job.
  • All jobs have resources defined in Slurm per partition and per QoS (Quality of Service) by default. You can modify the limits by specifying another partition and / or QoS as shown in our documentation detailing the partitions and Qos.
  • For multi-project users and those having both CPU and GPU hours, it is necessary to specify the project accounting (hours allocation of the project) for which to count the computing hours of the job as indicated in our documentation detailing the project hours management.
  • We strongly recommend that you consult our documentation detailing the project hours management to ensure that the hours consumed by your jobs are deducted from the correct accounting.