Getting Started on the Supercomputer

⚠ INFORMATION
This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.

Introduction

Are you new to the Jean Zay supercomputer?

On this page, you will find the main points essential for accessing Jean Zay, making your first connection and submitting your first job.

This page is primarily aimed at new users of IDRIS. It is designed to be concise to ensure a quick start with the supercomputer. For more information on the various points covered, please refer to the full documentation.

And for a quick overview of useful Linux, Module and SLURM commands for Jean Zay users, see the cheatsheet above.

Do you have a question or request?
The IDRIS User Support is available,
Monday to Thursday from 9 am to 6 pm and Friday from 9 am to 5:30 pm:

📥 assist@idris.fr
☎️ +33 (0)1 69 35 85 55

Presentation of the Jean Zay machine

Jean Zay is a supercomputer composed of five partitions:

one scalar partition (nodes equipped only with CPUs)
and four accelerated partitions (hybrid nodes equipped with both CPUs and GPUs).

All nodes access a shared file system via a very high bandwidth interconnect network.

INFO

A complete hardware description is available on the page The Jean Zay supercomputer.

Here is an overview of the machine's architecture:

jean_zay_material_description

All DARI projects with CPU or/and GPU hours have compute partitions defined on Jean Zay. These allow users to choose the type of resource (CPU or GPU) they wish to use. The table below summarises the main characteristics of these partitions.

Partition	Name	CPUs per node	CPU RAM	GPUs per node	GPU RAM
CPU	cpu_p1	40	192 GB	-	-
quadri-GPU V100	p13	40	192 GB	4	16 GB / 32 GB
octo-GPU V100	p2	24	384 GB / 768 GB	8	32 GB
octo-GPU A100	p5	64	512 GB	8	80GB
quadri-GPU H100	p6	96	512 GB	4	80GB

INFOS

For more information on the different partitions, see the pages on SLURM CPU partitions or/and SLURM GPU partitions.
All projects with CPU or/and GPU hours also have access to partitions dedicated to pre- and post-processing, visualisation, compilation or archiving. On these partitions, computing hours are not deducted from your allocation. For more information on these partitions, see the page SLURM CPU partitions.

Accessing the supercomputer

Any request to open an account on the Jean Zay machine is made on the eDARI portal.

This request involves a request to join a scientific project with computing hours on the supercomputer. You can therefore join an existing project with the agreement of its project leader or create your own project. Creating a scientific project is done via a request for computing hours on theeDARI portal.

Beforehand, we recommend that you consult the GENCI note detailing the terms of access to national resources. You will find, among other things, the conditions and criteria for eligibility to obtain computing hours and an account on the supercomputers.

It is the public operator GENCI (National High-Performance Computing Facility) that manages the allocation of computing resources for all national centres (CINES, IDRIS and TGCC).

To compute on Jean Zay:

To be able to compute on Jean Zay, you will need to complete the following three steps:

Create a user account on eDARI
Request an hour allocation (unless joining an existing project)
Request the creation of a Jean Zay user account (with project attachment)

1. eDARI User Account
2. Computing Hour Allocation
3. Jean Zay User Account

Create a user account on eDARI

First, you must create a user account on https://www.edari.fr/user/login using your institutional email address.

warning

This account is solely intended to access your personal space on the eDARI site to carry out all administrative procedures (requests for computing hours, opening accounts on Jean Zay, etc.) which must be done from eDARI. It has no link with any potential Jean Zay user account.

You will find more information on hour requests and account opening requests on Jean Zay in the following video:

Requesting a computing hour allocation

There are two types of access to resources depending on the number of hours requested:

regular access and
dynamic access.

The main differences between them are summarized below:

Regular Access (AR)	Dynamic Access (AD)
≥ 500k CPU hours ≥ 50k normalized GPU hours^∗	< 500k CPU hours < 50k normalized GPU hours^∗
Two allocation sessions per year: - in May (submission of application before Feb.) - in Nov. (submission of application before Sept.)	Open all year round: validation of an AD takes a few days
Technical and scientific reviews	No review

(^∗) 50k normalized GPU hours = 50k V100 hours or 25k A100 hours or 12.5k H100 hours.

INFO

For further information on the request and the different types of access, consult the page concerning Hour Requests.

Connection

First connection via SSH

The first connection to Jean Zay must be made via SSH, from an institutional address registered in the IDRIS filters and associated with your computing account:

ssh login@jean-zay.idris.fr

For your first connection, you must use your initial password consisting of the concatenation

of the randomly generated password by IDRIS (sent by email)
and the password you entered when requesting to open an account (eDARI).

This password will be changed immediately upon first connection (automatic procedure) to set your current password. You will find an example of a first connection in Password Management.

ATTENTION

On the first connection, the initial password is requested twice (once for the connection and a second time to change the initial password).
Being immediately disconnected after the new password has been accepted (all authentication tokens updated successfully) is normal.

Once connected, you arrive on one of the 5 login nodes of Jean Zay. These nodes, shared by all users, are dedicated to setting up the computing environment and should not be used for calculations (they are not equipped with GPUs).
Unlike the compute nodes, the login nodes have an HTTP/HTTPS proxy allowing data to be downloaded from remote servers (via the HTTP/HTTPS protocol with commands git or wget for example).

INFO

For more information on connecting to Jean Zay (targeting a specific login node, SSH connections by key or with a certificate, etc.), see the page SSH Access and Shells.

Connection via JupyterHub

The IDRIS teams have set up a JupyterHub solution that allows the use of Jupyter Notebooks and other applications such as VSCode, MLflow or Dask via a web interface without prior SSH connection to the machine.

IMPORTANT

An initial SSH connection to Jean Zay is essential before using JupyterHub.

Managing your data

Disk spaces

Each user has a personal HOME space (unique even for users attached to multiple projects).
In addition, for each project the user participates in, 4 disk spaces with various characteristics are accessible: WORK, SCRATCH, STORE and DSDIR.

IMPORTANT

To store your files, be sure to choose the best disk space according to their respective characteristics. This is essential to avoid saturating your quotas (failed calculations) or data loss (automatically purged disk space).

The table below summarises their main characteristics. For more information on how to better use your disk spaces, see the page Disk Spaces.

Disk Space	Default Capacity	Usage	Command
$HOME	Quotas 3 GB / 150k inodes per user	- home directory when connecting interactively - intended for small files (e.g. configuration files) - unique in the case of multi-project login	`cd $HOME`
$WORK	Quotas 5 TB / 500k inodes per project *	- workspace and permanent storage - designed to accommodate large files (e.g. input/output data)	User-specific part: `cd $WORK` Common part accessible to all project users: `cd $ALL_CCFRWORK`
$SCRATCH	Very large security quotas 4.6 PB shared by all users	- workspace and semi-temporary storage - Lifetime of files not read or modified: 30 days - Optimal performance for read/write operations	User-specific part: `cd $SCRATCH` Common part accessible to all project users: `cd $ALL_CCFRSCRATCH`
$STORE	Quotas 50 TB / 100k inodes per project *	- archive space - accessible from the login nodes and the prepost, archive, compil and visu partitions	User-specific part: `cd $STORE` Common part accessible to all project users: `cd $ALL_CCFRSTORE`
$DSDIR	-	- space visible to all Jean Zay users - contains models and large public databases - set up by the IDRIS teams - read-only	`cd $DSDIR`

^* the project quotas can be increased on request from the project leader or their deputy via the Extranet interface or on request to User Support.

💡 Disk quotas

You can check the usage of your disk spaces using one of these two commands:

idr_quota_user for a view of your personal usage as a user;
idr_quota_project for an overview of your project and the consumption of each of its members.

INFO

For more information, see the page on Disk Quotas and Viewing Usage Rates.

💡 Best practices for database management

To avoid saturating your disk spaces, check if the model or database you need is already available on DSDIR.
If downloading is necessary (non-public database) and its volume requires you to download it to your SCRATCH (very large quotas), keep a copy of your database in the form of archives in the STORE (the SCRATCH is a semi-permanent space). You can then easily restore your database if files have been deleted.
WORK or SCRATCH?
- WORK: Your files are not subject to any automatic deletion procedure, but the read and write performance is worse than that of the SCRATCH. The quotas are also more restrictive.
- SCRATCH: Very large quotas and better read and write performance. But, files not accessed for 30 days are automatically deleted!

INFOS

If you are working on a public database, we can download it for you into the shared disk space DSDIR. The data will then be accessible to all users.
For more information on best practices for managing your data, see the page Databases.

If you need to transfer data between your machine and Jean Zay, you can use the commands related to SSH (sftp and scp).

  # Sending a local file to Jean Zay
  scp localSource login@jean-zay.idris.fr:JZDestination

  # Retrieving a file from Jean Zay to the local machine
  scp login@jean-zay.idris.fr:JZSource localDestination

  # Connecting to the remote server via SFTP
  sftp login@jean-zay.idris.fr destination

  # Sending a local file to Jean Zay
  sftp> put localSource JZDestination

  # Retrieving a file from Jean Zay to the local machine
  sftp> get JZSource localDestination

IMPORTANT

For this to work, your machine must be registered in the IDRIS filters or you must go through a registered machine!

INFO

For more information on how to transfer data in batch, see the page Transferring data between IDRIS and your login machine and this cheatsheet.

Computing environment

IDRIS provides a catalogue of tools (virtual environments, compiled libraries, etc.) accessible via the command module.

The `module` command

To load the products installed on Jean Zay, you need to use the command module. The table below summarises the basic module commands.

Action	Module command
display the modules containing the requested package	`idr_module_search <package>`
display the complete catalogue	`module avail`
search for a specific tool	`module avail <package>`
get info on a module	`module show <package>`
load a module	`module load <package>/<version>`
unload a module	`module unload <package>`
display the list of loaded modules	`module list`
start from a clean environment	`module purge`

ATTENTION

To access the modules adapted to the A100 or H100 partition, you must first load one of the following modulefiles:

For the A100 partition: module load arch/a100
For the H100 partition: module load arch/h100

INFOS

The list of modules can be enriched on request by contacting support via assist@idris.fr.
For more information on using the command module, see the page Modules

Modules and conda virtual environments

Pre-installed conda virtual environments by IDRIS are accessible via the command module.

The environment is activated automatically ( conda activate ) when the module is loaded ( module load conda ).
ATTENTION, it is not deactivated ( conda deactivate ) when the module is unloaded ( module unload conda ).

Once the environment is activated, you can view all the Python packages it contains using the commands pip list and conda list.

INFOS

It is strongly recommended to use the environments installed by us to obtain the best performance, pool resources and avoid saturating your quotas.
Any environment can be enriched on request by contacting support via assist@idris.fr.

Modules and compilation

Different compilers and libraries are available on Jean Zay ( module avail ) and can be activated using the command module load.

ATTENTION

We strongly recommend that you consult our web page on the use of the command module and the management of dependencies between the versions of the libraries and those of the compilers.

INFO

For more information on compilation and the different compilers available, see the dedicated pages.

Job submission

Two working modes are possible:

in batch
in interactive

Batch work allows you to close the interactive session after submitting a job, while interactive work requires you to keep the session open to avoid interrupting the execution.

ATTENTION

It is strongly discouraged to perform calculations on the login nodes as this can slow them down (or even crash them), which would impact all other users connected to the same node!
In addition, the limits set on these nodes (1 CPU per user and 30 minutes of CPU time per process) do not allow for good performance.

INFO

For more information on these two working modes, see the pages Batch Execution and Interactive Execution.

You will find examples below for a quick start.

Batch execution - Example of a SLURM script

Access to computing resources is managed by the Slurm manager for all users.

There are 2 essential steps to working in batch:

Creating the Slurm script: a file containing the Slurm directives for resource reservation and the commands to be executed.
Submitting the job: the Slurm script is submitted to the manager via the Slurm commands sbatch or srun for execution on the requested resources.

INFO

Access to the various hardware partitions of the machine depends on the type of job submitted (CPU or GPU) and the Slurm partition requested for its execution (See SLURM CPU partitions and SLURM GPU partitions for more information).

ATTENTION

The batch mode does not allow the user to intervene during the execution of the script commands (except to interrupt the job). Therefore, file transfers must be done without having to type a password.
The compute nodes have no Internet access, which prohibits any downloading (Git repositories, Python/Conda installation, …) from these nodes. If necessary, downloads must be done from the login nodes or from the pre/post-processing nodes before code execution: either interactively, or via the batch submission of cascading jobs.

Below are examples of scripts for executing an MPI code in HPC and a Python script in AI:

HPC Example
AI Example

Here is an example of a CPU submission script for a batch MPI job on Jean Zay:

1. Content of the intel_mpi.slurm file:

#!/bin/bash
#SBATCH --job-name=MPIJob                 # job name
#SBATCH --ntasks=80                       # Total number of MPI processes
#SBATCH --ntasks-per-node=40              # Number of MPI processes per node
# /!\ Warning, the following line is misleading but in Slurm vocabulary
# "multithread" refers to hyperthreading.
#SBATCH --hint=nomultithread              # 1 MPI process per physical core (no hyperthreading)
#SBATCH --time=00:10:00                   # Maximum execution time requested (HH:MM:SS)
#SBATCH --output=MPIJob%j.out             # Standard output file
#SBATCH --error=MPIJob%j.out              # Standard error file (here merged with standard output)

# go to the submission directory
cd ${SLURM_SUBMIT_DIR}

# purge interactively loaded and default inherited modules
module purge

# load modules
module load intel-all/19.0.4

# echo launched commands
set -x

# code execution
srun ./exec_mpi

2. Submission of the script via the sbatch command:

sbatch intel_mpi.slurm

INFO

For more information on job execution, including different examples (MPI, OpenMP, MPMD, CUDA MPS), see the dedicated page.

Here is an example of a script for execution on the octo-GPU A100 partition:

1. In the gpu_a100.slurm file:

#!/bin/bash
#SBATCH --job-name=GPUJob               # job name
#SBATCH --output=GPUJob%j.out           # output file (%j = job ID)
#SBATCH --error=GPUJob%j.err            # error file (%j = job ID)
#SBATCH --constraint=a100               # request A100 80 GB GPUs
#SBATCH --nodes=2                       # reserve 2 nodes
#SBATCH --ntasks=16                     # reserve 16 tasks (or processes)
#SBATCH --gres=gpu:8                    # reserve 8 GPUs per node
#SBATCH --cpus-per-task=8               # reserve 8 CPUs per task (and associated memory)
#SBATCH --time=20:00:00                 # maximum allocation time (HH:MM:SS)
# #SBATCH --qos=qos_gpu_a100-t3         # QoS (default so commented out)
#SBATCH --hint=nomultithread            # disable hyperthreading
#SBATCH --account=xyz@a100              # A100 accounting (xyz to be replaced
                                        # by the project's unix group)

module purge                            # purge default inherited modules

conda deactivate                        # deactivate default inherited environments

module load arch/a100                   # select modules compiled for A100
module load pytorch-gpu/py3/2.3.0       # load modules

set -x                                  # enable commands echo
srun python script.py                   # execute the script

The table below summarizes the SLURM options to use to select a specific hardware partition:

GPU Partition	Corresponding SLURM option
quadri-GPU V100 with 16 or 32 GB RAM	by default (no option)
quadri-GPU V100 with 16 GB RAM	`--constraint v100-16g`
quadri-GPU V100 with 32 GB RAM	`--constraint v100-32g`
octo-GPU V100 (with CPU 384GB or 768GB RAM)	`--partition=gpu_p2`
octo-GPU V100 (with CPU 384GB RAM)	`--partition=gpu_p2s`
octo-GPU V100 (with CPU 768GB RAM)	`--partition=gpu_p2l`
octo-GPU A100	`--constraint=a100`
quadri-GPU H100	`--constraint=h100`

In the example above, we request two nodes of the gpu_p5 partition (octo-GPU A100 nodes).

#SBATCH --constraint=a100 : Selects the octo-GPU A100 nodes.

jean_zay_partition_p5

Each of these two nodes gives us access to:

8 NVIDIA A100 GPUs with 80 GB of memory
64 CPU cores
512 GB of RAM memory

#SBATCH --nodes=2 : Reserves two nodes of the same partition.
#SBATCH --ntasks=16 : The total number of tasks/processes reserved, distributed across the 2 nodes.

TIP

It is often equal to the total number of reserved GPUs because generally 1 GPU is allocated per task/process.

#SBATCH --gres=gpu:8 : The number of reserved GPUs per node. That is 16 GPUs in total because we reserve 2 nodes.
#SBATCH --cpus-per-task=8 : The number of CPU cores reserved per task. Here, there will be 8 tasks per node (16/2) so we reserve 8x8=64 CPUs per node (meaning all CPUs and therefore the memory of each node).

TIP

It is recommended to reserve the same proportion of resources per node for CPUs as for GPUs. This avoids overcharging when calculating the computing hours used by the job (see Viewing the consumption of computing hours).

The following table summarizes the ideal proportions per node depending on the partition used:

Partition	Number of GPUs	Number of CPUs
gpu_p13	1 V100	10 CPU cores
gpu_p2	1 V100	3 CPU cores
gpu_p5	1 A100	8 CPU cores
gpu_p6	1 H100	24 CPU cores

#SBATCH --time=20:00:00 : Maximum execution time. The maximum time limits depend on the partition and QoS used (see QoS below).
#SBATCH --qos=qos_gpu_a100-t3 : For each job submitted on a compute partition (so other than archive, compil, prepost, and visu), you can specify a QoS (Quality of Service) that will determine the limits and priority of your job. There are 3 different types of QoS:
- QoS dev: reserved for short runs (max. 2h) carried out as part of development
- QoS t3: default QoS (max. 20h)
- QoS t4: for longer runs (max. 100h)
TIP
Each QoS assigns a different priority to your job. For short tests (less than 2 hours), prefer the dev QoS.

INFO
For more information on QoS, the different limits defined by each of them, and the various commands to specify them depending on the partition, consult the dedicated page.
#SBATCH --hint=nomultithread : Disables hyperthreading, forcing the use of a single thread per physical core (using real cores and not logical threads).
#SBATCH --account=xyz@a100 : Specifies the accounting for tracking hours. The project must have an A100 allocation to use A100 GPUs. Here xyz must be replaced by the project's Unix group.

INFO

For more information on hour accounting, refer to the Computing hours page.

set -x : Enables the display of executed commands. Each command will be logged.

2. Submission of the Slurm script
Once the submission directives are defined, it is recommended to enter the commands in the following order:

go to the execution directory WORK, SCRATCH or JOBSCRATCH;
copy the input files necessary for the execution;
launch the execution with sbatch a100.slurm;
optionally copy the result files you wish to keep if you used SCRATCH or JOBSCRATCH.

Tracking the progress of a SLURM job

The table below summarises the main commands for submitting the SLURM script and tracking its progress.

Command	Function
`sbatch <script>`	submit a batch Slurm script
`squeue -u $USER`	track the submission status of your jobs
`scontrol show job <jobid>`	display all parameters of a submitted job
`scancel <jobid>`	cancel the execution of a job

INFO

You can connect via SSH to the compute nodes assigned to your jobs to monitor the execution of your calculations and control resource usage ( top, htop, nvidia-smi,...) : ssh <numéro du nœud>

Interactive execution

Any execution in interactive mode requires reserving resources via the Slurm manager. The time it takes to allocate these resources varies depending on the machine load.

ATTENTION

It is impossible to predict the moment at which the requested resources will be allocated. If you are not in front of your machine at that moment, resources will be reserved for your use without you using them.

From machines declared in the IDRIS filters, you have SSH access to the login nodes. You then have 2 options:

Open a terminal directly on a compute node on which you reserve resources via the command srun.
- Example with reservation of a GPU for 1 hour on the default partition:
```
login@jean-zay3:∼$ srun --ntasks=1 --gres=gpu:1 --time=1:00:00 ... --pty bash
srun: job 123456 queued and waiting for resources
srun: job 123456 has been allocated resources
login@r13i0n8:∼$
```
- You are then connected to the compute node and can execute your code/script:
```
login@r13i0n8:∼$ ./script.py
```
- To disconnect:
```
login@r13i0n8:∼$ exit
exit
login@jean-zay3:∼$
```
ATTENTION
- MPI is not supported in this configuration.
- When the time limit (here 1 hour) is reached, the connection to the compute node is automatically cut. The execution is therefore interrupted prematurely.
Make a resource allocation via the command salloc and chain executions on these resources via the command srun.
- Example with reservation of a GPU for 1 hour from the default partition:
```
login@jean-zay1:∼$ salloc --ntasks=1 --gres=gpu:1 <other-options>
salloc: Pending job allocation 654321
salloc: job 654321 queued and waiting for resources
salloc: job 654321 has been allocated resources
salloc: Granted job allocation 654321
```
- When the allocation is effective, you can chain various executions:
```
login@jean-zay1:∼$ srun python script_0.py
...
login@jean-zay1:∼$ srun python script_1.py
...
```
- To release the resources:
```
login@jean-zay1:∼$ exit
exit
login@jean-zay1:∼$ salloc: Relinquishing job allocation 654321
```
ATTENTION
- When the time limit (here 1 hour) is reached, the allocation of the compute node is automatically terminated. Any execution in progress is therefore interrupted prematurely.

INFO

For more information on interactive execution, see the dedicated page..

Further information

IDRIS provides various training courses for users of HPC and AI scientific computing.

🚀 HPC Training
🤖 AI Training

For any questions or requests, IDRIS User Support is available Monday to Thursday from 9 am to 6 pm and Friday from 9 am to 5:30 pm:

📥 assist@idris.fr
☎️ +33 (0)1 69 35 85 55

IDRIS organizes workshops on getting started with the supercomputer and optimizing your computing codes.

👨🏽‍💻 Jean Zay Workshop

Getting Started on the Supercomputer

Introduction

Presentation of the Jean Zay machine

Accessing the supercomputer

To compute on Jean Zay:

Create a user account on eDARI

Requesting a computing hour allocation

Requesting the creation of a computing account

Connection

First connection via SSH

Connection via JupyterHub

Managing your data

Disk spaces

💡 Disk quotas

💡 Best practices for database management

Computing environment

The `module` command

Modules and conda virtual environments

Modules and compilation

Job submission

Batch execution - Example of a SLURM script

Tracking the progress of a SLURM job

Interactive execution

Further information

Training

Contact IDRIS

Workshops

Introduction​

Presentation of the Jean Zay machine​

Accessing the supercomputer​

To compute on Jean Zay:​

Create a user account on eDARI​

Requesting a computing hour allocation​

Requesting the creation of a computing account​

Connection​

First connection via SSH​

Connection via JupyterHub​

Managing your data​

Disk spaces​

💡 Disk quotas​

💡 Best practices for database management​

Transferring data between your login machine and Jean Zay​

Computing environment​

The module command​

Modules and conda virtual environments​

Modules and compilation​

Job submission​

Batch execution - Example of a SLURM script​

Tracking the progress of a SLURM job​

Interactive execution​

Further information​

Training

Contact IDRIS

Workshops

Introduction

Presentation of the Jean Zay machine

Accessing the supercomputer

To compute on Jean Zay:

Create a user account on eDARI

Requesting a computing hour allocation

Requesting the creation of a computing account

Connection

First connection via SSH

Connection via JupyterHub

Managing your data

Disk spaces

💡 Disk quotas

💡 Best practices for database management

Transferring data between your login machine and Jean Zay

Computing environment

The `module` command

Modules and conda virtual environments

Modules and compilation

Job submission

Batch execution - Example of a SLURM script

Tracking the progress of a SLURM job

Interactive execution

Further information