,

Jean Zay: Memory allocation with Slurm on GPU partitions

The Slurm options --mem, --mem-per-cpu and --mem-per-gpu are currently disabled on Jean Zay because they do not allow you to properly configure the memory allocation per node of your job. The memory allocation per node is automatically determined by the number of reserved CPUs per node.

To adjust the amount of memory per node allocated to your job, you must adjust the number of CPUs reserved per task/process (in addition to the number of tasks and/or GPUs) by specifying the following option in your batch scripts, or when using salloc in interactive mode:

--cpus-per-task=...      # --cpus-per-task=1 by default 

Important: By default, --cpus-per-task=1. Therefore, if you do not modify this value as explained below, you will not be able to access all of the potentially accessible memory per reserved task/GPU. In particular, you risk quickly making memory overflows at the level of the processes running on the processors.

The maximum value that can be specified in --cpus-per-task=... depends on the number of processes/tasks requested per node (--ntasks-per-node=...) and the profile of the nodes used (different total number of cores per node) which depends on the partition used.

Note that there can also be memory overflow at the level of the GPUs because they have individual memory whose size varies depending on the partition used.

On the default gpu partition

Each node of the default gpu partition offers 156 GB of usable memory and 40 CPU cores. The memory allocation is computed automatically on the basis of:

  • 156/40 = 3.9 GB per reserved CPU core if hyperthreading is deactivated (Slurm option --hint=nomultithread).

Each compute node of the default gpu partition is composed of 4 GPUs and 40 CPU cores: You can therefore reserve 1/4 of the node memory per GPU by requesting 10 CPUs (i.e. 1/4 of 40 cores) per GPU:

--cpus-per-task=10     # reserves 1/4 of the node memory per GPU (default gpu partition)

In this way, you have access to 3.9*10 = 39 GB of node memory per GPU if hyperthreading is deactivated (if not, half of that memory).

Note: You can request more than 39 GB of memory per GPU if necessary (need for more memory per process). However, this will cause overcharging for the job (allocation by Slurm of additional GPU resources that will not be used). The GPU hours consumed by the job will then be calculated as if you reserved more GPUs but didn't use them and, therefore, without any benefit for the computation times (see comments at the bottom of the page).

On the gpu_p2 partition

The gpu_p2 partition is divided into two subpartitions:

  • The gpu_p2s subpartition with 360 GB usable memory per node
  • The gpu_p2l subpartition with 720 GB usable memory per node

As each node of this partition contains 24 CPU cores, the memory allocation is automatically determined on the basis of:

  • 360/24 = 15 GB per reserved CPU core on the gpu_p2s partition if hyperthreading is deactivated (Slurm option --hint=nomultithread)
  • 720/24 = 30 GB per reserved CPU core on the gpu_p2l partition if hyperthreading is deactivated

Each compute node of the gpu_p2 partition contains 8 GPUs and 24 CPU cores: You can reserve 1/8 of the node memory per GPU by reserving 3 CPUs (i.e. 1/8 of 24 cores) per GPU:

--cpus-per-task=3    # reserves 1/8 of the node memory per GPU (gpu_p2 partition)

In this way, you have access to:

  • 15*3 = 45 GB of node memory per GPU on the gpu_p2s partition
  • 30*3 = 90 GB of node memory per GPU on the gpu_p2l partition

if hyperthreading is deactivated (if not, half of that memory).

Note: You can request more than 45 GB (with gpu_p2s) or 90 GB (with gpu_p2l) of memory per GPU if necessary (need for more memory per process). However, this will cause overcharging of the job (allocation by Slurm of additional GPU resources which will not be used). The GPU hours consumed by the job will then be calculated as if you reserved more GPUs for the job but didn't use them and, therefore, without any benefit for the computation times (see comments at the bottom of the page).

On the gpu_p5 partition

Each node of the gpu_p5 partition offers 468 GB of usable memory and 64 CPU cores. The memory allocation is therefore computed automatically on the basis of:

  • 468/64 = 7.3 GB per reserved CPU core if hyperthreading is deactivated (Slurm option --hint=nomultithread).

Each compute node of the gpu_p5 partition is composed of 8 GPUs and 64 CPU cores: You can therefore reserve 1/8 of the node memory per GPU by requiring 8 CPUs (i.e. 1/8 of 64 cores) per GPU:

--cpus-per-task=8     # reserves 1/8 of the node memory per GPU (gpu_p5 partition)

In this way, you have access to 8*7.3 = 58 GB of node memory per GPU if hyperthreading is deactivated (if not, half of that memory).

Note: You can request more than 58 GB of memory per GPU if necessary (need for more memory per process). However, this will cause overcharging for the job (allocation by Slurm of additional GPU resources which will not be used). The GPU hours consumed by the job will then be calculated as if you had reserved more GPUs but didn't use them and, therefore, without any benefit for the computation times (see comments at the bottom of the page).

Comments

  • You can reserve more memory per GPU by increasing the value of --cpus-per-task as long as it does not exceed the total amount of memory available on the node. Important: The computing hours are counted proportionately. For example, if you reserve 1 GPU on the default gpu partition with the options --ntasks=1 --gres=gpu:1 --cpus-per-task=20, the invoice will be equivalent to a job running on 2 GPUs because of the option --cpus-per-task=20.
  • If you reserve a node in exclusive mode, you have access to the entire memory capacity of the node, regardless of the value of --cpus-per-task. The invoice will be the same as for a job running on an entire node.
  • The amount of memory allocated to your job can be seen by running the command:
    $ scontrol show job $JOBID     # searches for value of the "mem" variable

    Important: While the job is in the wait queue (PENDING), Slurm estimates the memory allocated to a job based on logical cores. Therefore, if you have reserved physical cores (with --hint=nomultithread), the value indicated can be two times inferior to the expected value. This is updated and becomes correct when the job is started.

  • To reserve resources on the prepost partition, you may refer to: Memory allocation with Slurm on CPU partitions. The GPU which is available on each node of the prepost partition is automatically allocated to you without needing to specify the --gres=gpu:1 option.