This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.
GPU Partitions
Available GPU Partitions
All DARI projects (Dynamic Access (DA), Regular Access (RA), ...) with GPU hours have access to Slurm partitions defined on Jean Zay.
V100 Partition: gpu_p13 (default)
Projects with GPU V100 hours have access by default to a partition allowing the use of all types of quadri-GPU accelerated nodes with 16GB or 32GB of memory.
By default, the execution time is 10 minutes and cannot exceed 100 hours (i.e. --time=HH:MM:SS ≤ 100:00:00, see the QoS for V100 below).
This partition includes both Nvidia V100 GPUs with 16 GB of memory and Nvidia V100 GPUs with 32 GB of memory. If you wish to limit yourself to a single type of GPU, you must specify this by adding one of the following Slurm directives to your scripts:
#SBATCH -C v100-16gto select nodes with GPUs with 16 GB of memory (i.e. ''gpu_p3'')#SBATCH -C v100-32gto select nodes with GPUs with 32 GB of memory (i.e. ''gpu_p1'')
If your job can run on GPUs with either 16 GB or 32 GB of memory, it is preferable not to target a specific type of node (i.e. do not specify the -C v100-16g or -C v100-32g option) to limit the queue waiting time for your jobs.
V100 Partition: gpu_p2
The gpu_p2 partition is accessible to all users with GPU V100 hours.
It allows you to run computations on the octo-GPU accelerated nodes of Jean Zay. These nodes are equipped with Nvidia V100 GPUs with 32 GB of memory.
By default, the execution time is 10 minutes and cannot exceed 100 hours (i.e. --time=HH:MM:SS ≤ 100:00:00, see the QoS for V100 below).
This partition includes nodes with 360 GB of RAM and others with 720 GB. Depending on the amount of memory required by your code, you can target one of the following sub-partitions:
- The gpu_p2s sub-partition gives access to octo-GPU V100 nodes with 360 GB of memory.
- The gpu_p2l sub-partition gives access to octo-GPU V100 nodes with 720 GB of memory.
If your job does not need more than 360 GB of memory, it is preferable not to target a specific type of node by specifying the --partition=gpu_p2 option to limit the queue waiting time for your jobs.
A100 Partition: gpu_p5
The gpu_p5 partition is accessible to all users with GPU A100 hours. It allows you to run computations on the 52 octo-GPU accelerated nodes of Jean Zay, which are equipped with Nvidia A100 GPUs connected by an NVLink SXM4 interconnect and having 80 GB of memory per GPU.
By default, the execution time is 10 minutes and cannot exceed 20 hours (i.e. --time=HH:MM:SS ≤ 20:00:00, see the QoS for A100 below).
To use this partition, you must specify the Slurm directive #SBATCH -C a100 in your scripts.
These nodes are equipped with AMD Milan EPYC 7543 processors (64 cores per node) unlike the other nodes which have Intel processors. It is therefore necessary to load the pre-module: module load arch/a100 before any other module to access the compatible modules and to recompile your codes specifically for this partition.
H100 Partition: gpu_p6
The gpu_p6 partition is accessible to all users with GPU H100 hours. It allows you to run computations on the 364 quadri-GPU accelerated nodes of Jean Zay, which are equipped with Nvidia H100 GPUs connected by an NVLink SXM5 interconnect and having 80 GB of memory per GPU.
By default, the execution time is 10 minutes and cannot exceed 100 hours (i.e. --time=HH:MM:SS ≤ 100:00:00, see the QoS for H100 below).
To use this partition, you must specify the Slurm directive #SBATCH -C h100 in your scripts.
These nodes have a specific hardware architecture, so it is necessary to load the pre-module: module load arch/h100 before any other module to access the compatible modules and to recompile your codes.
Summary
| CPU | GPU | Corresponding Slurm Option |
|---|---|---|
| 40 CPU + 160 GB usable RAM | 4 GPU V100 + 16 or 32 GB RAM | by default (no option) |
| 40 CPU + 160 GB usable RAM | 4 GPU V100 + 16 GB RAM | -C v100-16g |
| 40 CPU + 160 GB usable RAM | 4 GPU V100 + 32 GB RAM | -C v100-32g |
| 24 CPU + 360 or 720 GB usable RAM | 8 GPU V100 + 32 GB RAM | --partition=gpu_p2 |
| 24 CPU + 360 GB usable RAM | 8 GPU V100 + 32 GB RAM | --partition=gpu_p2s |
| 24 CPU + 720 GB usable RAM | 8 GPU V100 + 32 GB RAM | --partition=gpu_p2l |
| 64 CPU + 468 GB usable RAM | 8 GPU A100 + 80 GB RAM | -C a100 |
| 96 CPU + 468 GB usable RAM | 4 GPU H100 + 80 GB RAM | -C h100 |
The default time limits for partitions are deliberately low. For longer executions, you must specify an execution time limit, which must remain below the maximum allowed for the partition and the QoS used (see below). You must then use:
- either the Slurm directive
#SBATCH --time=HH:MM:SSin your job, - or the
--time=HH:MM:SSoption of thesbatch,sallocorsruncommands.
The default partition does not need to be specified to be used by jobs requesting GPUs. However, all others must be explicitly specified to be used. For example, for the prepost partition, you can use:
- either the Slurm directive
#SBATCH --partition=prepostin your job, - or the
--partition=prepostoption of thesbatch,sallocorsruncommands.
Any job requesting more than one node runs in exclusive mode: the nodes are not shared. In particular, this means that the billed hours are calculated based on all the requisitioned nodes, including those that are only partially used.
For example, on the default GPU partition, reserving 5 GPUs (i.e. 1 quadri-GPU node + 1 GPU) results in billing for 8 GPUs (i.e. 2 quadri-GPU nodes). However, the full memory of the reserved nodes is then available (around 160 GB usable per V100 GPU node).
The service partitions archive, compil, prepost and visu, whose usage is not billed, are accessible even if you do not have CPU hours.
Available GPU QoS
For each job submitted to a compute partition (i.e. other than archive, compil, prepost and visu), you can specify a QoS (Quality of Service) that will determine the limits and priority of your job.
To specify a QoS different from the default, you can choose to:
- use the Slurm directive
#SBATCH --qos=<qos_choisie>in your job, - or specify the
--qos=<qos_choisie>option to thesbatch,sallocorsruncommands
replacing <qos_choisie> with the name of the desired QoS.
Note that the names of the QoS differ depending on the partitions used.
QoS V100 Partition
-
QoS default for all V100 GPU jobs: qos_gpu-t3
- maximum duration: 20h00 of Elapsed time,
- maximum of 512 GPUs per job,
- maximum of 512 GPUs per user (all projects combined),
- maximum of 512 GPUs per project (all users combined).
-
QoS for longer executions that must be specified to be used (see above): qos_gpu-t4
- maximum duration: 100h00 of Elapsed time,
- maximum of 16 GPUs per job,
- maximum of 96 GPUs per user (all projects combined),
- maximum of 96 GPUs per project (all users combined),
- maximum of 256 GPUs for all jobs requesting this QoS.
-
QoS reserved solely for brief executions carried out as part of code development or execution tests and which must be specified to be used (see above): qos_gpu-dev
- a maximum of 10 jobs (running or queued) simultaneously per user,
- maximum duration: 2h00 of Elapsed time,
- maximum of 32 GPUs per job,
- maximum of 32 GPUs per user (all projects combined),
- maximum of 32 GPUs per project (all users combined),
- maximum of 512 GPUs for all jobs requesting this QoS.
| QoS | Time Limit | Resource Limit per Job | Limit per User (all projects combined) | Limit per Project (all users combined) | Limit per QoS (all users combined) |
|---|---|---|---|---|---|
| qos_gpu‑t3 (default) | 20h | 512 GPUs | 512 GPUs | 512 GPUs | |
| qos_gpu‑t4 | 100h | 16 GPUs | 96 GPUs | 96 GPUs | 256 GPUs |
| qos_gpu‑dev | 2h | 32 GPUs | 32 GPUs, maximum of 10 jobs (running or queued) simultaneously | 32 GPUs | 512 GPUs |
QoS A100 Partition
-
QoS default for all A100 GPU jobs: qos_gpu_a100-t3
- maximum duration: 20h00 of Elapsed time,
- maximum of 128 GPUs per job,
- maximum of 256 GPUs per user (all projects combined),
- maximum of 256 GPUs per project (all users combined).
-
QoS reserved solely for brief executions carried out as part of code development or execution tests and which must be specified to be used (see above): qos_gpu_a100-dev
- a maximum of 10 jobs (running or queued) simultaneously per user,
- maximum duration: 2h00 of Elapsed time,
- maximum of 32 GPUs per job,
- maximum of 32 GPUs per user (all projects combined),
- maximum of 32 GPUs per project (all users combined),
- maximum of 128 GPUs for all jobs requesting this QoS.
The A100 partition does not have a QoS allowing the execution of jobs longer than 20h due to the limited number of A100 nodes available.
Summary Table of Limits on A100 GPU QoS
| QoS | Time Limit | Resource Limit per Job | Limit per User (all projects combined) | Limit per Project (all users combined) | Limit per QoS (all users combined) |
|---|---|---|---|---|---|
| qos_gpu_a100‑t3 (default) | 20h | 128 GPUs | 256 GPUs | 256 GPUs | |
| qos_gpu_a100‑dev | 2h | 32 GPUs | 32 GPUs, maximum of 10 jobs (running or queued) simultaneously | 32 GPUs | 128 GPUs |
QoS H100 Partition
-
QoS default for all H100 GPU jobs: qos_gpu_h100-t3
- maximum duration: 20h00 of Elapsed time,
- maximum of 512 GPUs per job,
- maximum of 512 GPUs per user (all projects combined),
- maximum of 512 GPUs per project (all users combined).
-
QoS for longer executions that must be specified to be used (see above): qos_gpu_h100-t4
- maximum duration: 100h00 of Elapsed time,
- maximum of 16 GPUs per job,
- maximum of 64 GPUs per user (all projects combined),
- maximum of 64 GPUs per project (all users combined),
- maximum of 192 GPUs for all jobs requesting this QoS.
-
QoS reserved solely for brief executions carried out as part of code development or execution tests and which must be specified to be used (see above): qos_gpu_h100-dev
- a maximum of 10 jobs (running or queued) simultaneously per user,
- maximum duration: 2h00 of Elapsed time,
- maximum of 32 GPUs per job,
- maximum of 32 GPUs per user (all projects combined),
- maximum of 32 GPUs per project (all users combined),
- maximum of 384 GPUs for all jobs requesting this QoS.
| QoS | Time Limit | Resource Limit per Job | Limit per User (all projects combined) | Limit per Project (all users combined) | Limit per QoS (all users combined) |
|---|---|---|---|---|---|
| qos_gpu_h100‑t3 (default) | 20h | 512 GPUs | 512 GPUs | 512 GPUs | |
| qos_gpu_h100‑t4 | 100h | 16 GPUs | 64 GPUs | 64 GPUs | 192 GPUs |
| qos_gpu_h100‑dev | 2h | 32 GPUs | 32 GPUs, maximum of 10 jobs (running or queued) simultaneously | 32 GPUs | 384 GPUs |