⚠ INFORMATION
This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.

GPU Partitions

Available GPU Partitions

All DARI projects (Dynamic Access (DA), Regular Access (RA), ...) with GPU hours have access to Slurm partitions defined on Jean Zay.

V100 Partition: gpu_p13 (default)

Projects with GPU V100 hours have access by default to a partition allowing the use of all types of quadri-GPU accelerated nodes with 16GB or 32GB of memory.

By default, the execution time is 10 minutes and cannot exceed 100 hours (i.e. --time=HH:MM:SS ≤ 100:00:00, see the QoS for V100 below).

This partition includes both NVIDIA V100 GPUs with 16 GB of memory and NVIDIA V100 GPUs with 32 GB of memory. If you wish to limit yourself to a single type of GPU, you must specify this by adding one of the following Slurm directives to your scripts:

#SBATCH -C v100-16g to select nodes with GPUs with 16 GB of memory (i.e. ''gpu_p3'')
#SBATCH -C v100-32g to select nodes with GPUs with 32 GB of memory (i.e. ''gpu_p1'')

Note

If your job can run on GPUs with either 16 GB or 32 GB of memory, it is preferable not to target a specific type of node (i.e. do not specify the -C v100-16g or -C v100-32g option) to limit the queue waiting time for your jobs.

V100 Partition: gpu_p2

The gpu_p2 partition is accessible to all users with GPU V100 hours.

It allows you to run computations on the octo-GPU accelerated nodes of Jean Zay. These nodes are equipped with NVIDIA V100 GPUs with 32 GB of memory.

By default, the execution time is 10 minutes and cannot exceed 100 hours (i.e. --time=HH:MM:SS ≤ 100:00:00, see the QoS for V100 below).

This partition includes nodes with 360 GB of RAM and others with 720 GB. Depending on the amount of memory required by your code, you can target one of the following sub-partitions:

The gpu_p2s sub-partition gives access to octo-GPU V100 nodes with 360 GB of memory.
The gpu_p2l sub-partition gives access to octo-GPU V100 nodes with 720 GB of memory.

Note

If your job does not need more than 360 GB of memory, it is preferable not to target a specific type of node by specifying the --partition=gpu_p2 option to limit the queue waiting time for your jobs.

A100 Partition: gpu_p5

The gpu_p5 partition is accessible to all users with GPU A100 hours. It allows you to run computations on the 52 octo-GPU accelerated nodes of Jean Zay, which are equipped with NVIDIA A100 GPUs connected by an NVLink SXM4 interconnect and having 80 GB of memory per GPU.

By default, the execution time is 10 minutes and cannot exceed 20 hours (i.e. --time=HH:MM:SS ≤ 20:00:00, see the QoS for A100 below).

To use this partition, you must specify the Slurm directive #SBATCH -C a100 in your scripts.

Attention

These nodes are equipped with AMD Milan EPYC 7543 processors (64 cores per node) unlike the other nodes which have Intel processors. It is therefore necessary to load the pre-module: module load arch/a100 before any other module to access the compatible modules and to recompile your codes specifically for this partition.

H100 Partition: gpu_p6

The gpu_p6 partition is accessible to all users with GPU H100 hours. It allows you to run computations on the 364 quadri-GPU accelerated nodes of Jean Zay, which are equipped with NVIDIA H100 GPUs connected by an NVLink SXM5 interconnect and having 80 GB of memory per GPU.

By default, the execution time is 10 minutes and cannot exceed 100 hours (i.e. --time=HH:MM:SS ≤ 100:00:00, see the QoS for H100 below).

To use this partition, you must specify the Slurm directive #SBATCH -C h100 in your scripts.

Attention

These nodes have a specific hardware architecture, so it is necessary to load the pre-module: module load arch/h100 before any other module to access the compatible modules and to recompile your codes.

Summary

CPU	GPU	Corresponding Slurm Option
40 CPU + 160 GB usable RAM	4 GPU V100 + 16 or 32 GB RAM	by default (no option)
40 CPU + 160 GB usable RAM	4 GPU V100 + 16 GB RAM	`-C v100-16g`
40 CPU + 160 GB usable RAM	4 GPU V100 + 32 GB RAM	`-C v100-32g`
24 CPU + 360 or 720 GB usable RAM	8 GPU V100 + 32 GB RAM	`--partition=gpu_p2`
24 CPU + 360 GB usable RAM	8 GPU V100 + 32 GB RAM	`--partition=gpu_p2s`
24 CPU + 720 GB usable RAM	8 GPU V100 + 32 GB RAM	`--partition=gpu_p2l`
64 CPU + 468 GB usable RAM	8 GPU A100 + 80 GB RAM	`-C a100`
96 CPU + 468 GB usable RAM	4 GPU H100 + 80 GB RAM	`-C h100`

Attention

The default time limits for partitions are deliberately low. For longer executions, you must specify an execution time limit, which must remain below the maximum allowed for the partition and the QoS used (see below). You must then use:

either the Slurm directive #SBATCH --time=HH:MM:SS in your job,
or the --time=HH:MM:SS option of the sbatch, salloc or srun commands.

The default partition does not need to be specified to be used by jobs requesting GPUs. However, all others must be explicitly specified to be used. For example, for the prepost partition, you can use:

either the Slurm directive #SBATCH --partition=prepost in your job,
or the --partition=prepost option of the sbatch, salloc or srun commands.

Attention

Any job requesting more than one node runs in exclusive mode: the nodes are not shared. In particular, this means that the billed hours are calculated based on all the requisitioned nodes, including those that are only partially used.

For example, on the default GPU partition, reserving 5 GPUs (i.e. 1 quadri-GPU node + 1 GPU) results in billing for 8 GPUs (i.e. 2 quadri-GPU nodes). However, the full memory of the reserved nodes is then available (around 160 GB usable per V100 GPU node).

Note

The service partitions archive, compil, prepost and visu, whose usage is not billed, are accessible even if you do not have CPU hours.

Available GPU QoS

For each job submitted to a compute partition (i.e. other than archive, compil, prepost and visu), you can specify a QoS (Quality of Service) that will determine the limits and priority of your job.

To specify a QoS different from the default, you can choose to:

use the Slurm directive #SBATCH --qos=<qos_choisie> in your job,
or specify the --qos=<qos_choisie> option to the sbatch, salloc or srun commands

replacing <qos_choisie> with the name of the desired QoS.

Note that the names of the QoS differ depending on the partitions used.

QoS V100 Partition

QoS default for all V100 GPU jobs: qos_gpu-t3
- maximum duration: 20h00 of Elapsed time,
- maximum of 640 GPUs per job,
- maximum of 640 GPUs per user (all projects combined),
- maximum of 640 GPUs per project (all users combined).
QoS for longer executions that must be specified to be used (see above): qos_gpu-t4
- maximum duration: 100h00 of Elapsed time,
- maximum of 16 GPUs per job,
- maximum of 96 GPUs per user (all projects combined),
- maximum of 96 GPUs per project (all users combined),
- maximum of 256 GPUs for all jobs requesting this QoS.
QoS reserved solely for brief executions carried out as part of code development or execution tests and which must be specified to be used (see above): qos_gpu-dev
- a maximum of 10 jobs (running or queued) simultaneously per user,
- maximum duration: 2h00 of Elapsed time,
- maximum of 32 GPUs per job,
- maximum of 32 GPUs per user (all projects combined),
- maximum of 32 GPUs per project (all users combined),
- maximum of 512 GPUs for all jobs requesting this QoS.

QoS	Time Limit	Resource Limit per Job	Limit per User (all projects combined)	Limit per Project (all users combined)	Limit per QoS (all users combined)
qos_gpu-t3 (default)	20h	640 GPUs	640 GPUs	640 GPUs
qos_gpu-t4	100h	16 GPUs	96 GPUs	96 GPUs	256 GPUs
qos_gpu-dev	2h	32 GPUs	32 GPUs, maximum of 10 jobs (running or queued) simultaneously	32 GPUs	512 GPUs

QoS A100 Partition

QoS default for all A100 GPU jobs: qos_gpu_a100-t3
- maximum duration: 20h00 of Elapsed time,
- maximum of 128 GPUs per job,
- maximum of 256 GPUs per user (all projects combined),
- maximum of 256 GPUs per project (all users combined).
QoS reserved solely for brief executions carried out as part of code development or execution tests and which must be specified to be used (see above): qos_gpu_a100-dev
- a maximum of 10 jobs (running or queued) simultaneously per user,
- maximum duration: 2h00 of Elapsed time,
- maximum of 32 GPUs per job,
- maximum of 32 GPUs per user (all projects combined),
- maximum of 32 GPUs per project (all users combined),
- maximum of 128 GPUs for all jobs requesting this QoS.

Attention

The A100 partition does not have a QoS allowing the execution of jobs longer than 20h due to the limited number of A100 nodes available.

Summary Table of Limits on A100 GPU QoS

QoS	Time Limit	Resource Limit per Job	Limit per User (all projects combined)	Limit per Project (all users combined)	Limit per QoS (all users combined)
qos_gpu_a100-t3 (default)	20h	128 GPUs	256 GPUs	256 GPUs
qos_gpu_a100-dev	2h	32 GPUs	32 GPUs, maximum of 10 jobs (running or queued) simultaneously	32 GPUs	128 GPUs

QoS H100 Partition

QoS default for all H100 GPU jobs: qos_gpu_h100-t3
- maximum duration: 20h00 of Elapsed time,
- maximum of 512 GPUs per job,
- maximum of 512 GPUs per user (all projects combined),
- maximum of 512 GPUs per project (all users combined).
QoS for longer executions that must be specified to be used (see above): qos_gpu_h100-t4
- maximum duration: 100h00 of Elapsed time,
- maximum of 16 GPUs per job,
- maximum of 64 GPUs per user (all projects combined),
- maximum of 64 GPUs per project (all users combined),
- maximum of 192 GPUs for all jobs requesting this QoS.
QoS reserved solely for brief executions carried out as part of code development or execution tests and which must be specified to be used (see above): qos_gpu_h100-dev
- a maximum of 10 jobs (running or queued) simultaneously per user,
- maximum duration: 2h00 of Elapsed time,
- maximum of 32 GPUs per job,
- maximum of 32 GPUs per user (all projects combined),
- maximum of 32 GPUs per project (all users combined),
- maximum of 384 GPUs for all jobs requesting this QoS.

QoS	Time Limit	Resource Limit per Job	Limit per User (all projects combined)	Limit per Project (all users combined)	Limit per QoS (all users combined)
qos_gpu_h100-t3 (default)	20h	512 GPUs	512 GPUs	512 GPUs
qos_gpu_h100-t4	100h	16 GPUs	64 GPUs	64 GPUs	192 GPUs
qos_gpu_h100-dev	2h	32 GPUs	32 GPUs, maximum of 10 jobs (running or queued) simultaneously	32 GPUs	384 GPUs

Available GPU Partitions​

V100 Partition: gpu_p13 (default)​

V100 Partition: gpu_p2​

A100 Partition: gpu_p5​

H100 Partition: gpu_p6​

Summary​

Available GPU QoS​

QoS V100 Partition​

QoS A100 Partition​

Summary Table of Limits on A100 GPU QoS

QoS H100 Partition​

Available GPU Partitions

V100 Partition: gpu_p13 (default)

V100 Partition: gpu_p2

A100 Partition: gpu_p5

H100 Partition: gpu_p6

Summary

Available GPU QoS

QoS V100 Partition

QoS A100 Partition

QoS H100 Partition