⚠ INFORMATION
This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.

Running a job interactively

On Jean Zay, interactive access to computing resources can be achieved in several ways.

Access to the login node is via an ssh connection:

ssh login@jean-zay.idris.fr

The resources of an interactive node are shared among all users connected to that node: therefore, interactive access on the login node is reserved solely for compiling and debugging scripts.

Warning

On the login nodes, RAM memory is limited to 5 GB shared among all processes of a user and the CPU time per process is limited to 30 minutes to ensure better resource sharing.

Any interactive execution of your codes must be performed on the CPU or GPU compute nodes using:

either the srun command:
- to obtain a terminal on a CPU or GPU compute node within which you can execute your code,
- or to directly execute your code on the partition of your choice.
or the salloc command to make a CPU or GPU resource reservation allowing for multiple executions.

Note

It is preferable to submit a batch job for computations requiring significant resources (in terms of CPU cores or GPUs or maximum duration) as it will not be affected by a potential loss of connection or crash of the login node.

Obtaining a terminal on a CPU compute node

It is possible to open a terminal directly on a CPU compute node where resources are reserved for you (here 4 cores) using the following command:

srun --pty --ntasks=1 --cpus-per-task=4 --hint=nomultithread [--other-options] bash

Remarks:

The --pty option allows you to obtain an interactive terminal and the command passed to srun is a shell (here bash).
The --hint=nomultithread option ensures the reservation of physical cores (no hyperthreading).
By default, the allocated CPU memory is proportional to the number of reserved cores. For example, if you request 1/4 of the physical cores of a node, you will have access to 1/4 of its memory. You can consult our documentation on this subject:Memory allocation on CPU partitions.
--other-options can contain all the usual Slurm options for configuring jobs (--time=, etc.): see the documentation on batch submission scripts in the Execution/Job Control section.
All reservations have resources defined in Slurm by a partition and a default "Quality of Service" QoS (Quality of Service). You can modify the limits by specifying a partition and/or a QoS as indicated in our documentation detailing CPU partitions and QoS.
For multi-project accounts as well as those with CPU and GPU hours, it is essential to specify the hour allocation on which to deduct the computing hours of the job as indicated in our documentation detailing the management of computing hours to ensure that the hours consumed by your jobs are deducted from the correct allocation.

The terminal is operational after obtaining the allocation:

$ srun --pty --ntasks=1 --cpus-per-task=4 --hint=nomultithread bash
srun: job 1365358 queued and waiting for resources
srun: job 1365358 has been allocated resources
bash-4.2$ hostname
noeud123

You can verify that your interactive job has started using the squeue command, and obtain complete information on the job status with the scontrol show job <identifiant du travail> command.

Once the terminal is operational, you can launch your executables in the usual way: ./votre_executable.

MPI Execution

If you wish to start an MPI program in this configuration, you must use srun again to launch the execution: srun ./votre_executable_mpi.

Note that hyperthreading is not usable via MPI in this configuration.

To exit interactive mode:

bash-4.2$ exit 

Warning

If you do not exit interactive mode yourself, the maximum allocation duration (default or specified with the --time option) is applied: the job remains on the machine doing nothing and the computing hours will be deducted from the project you specified.

Interactive execution on the CPU partition

If you do not need to open a terminal on a compute node, it is also possible to start the interactive execution of a code on the CPU compute nodes directly from the login node using the following command (here with 4 tasks):

srun --ntasks=4 --hint=nomultithread[--other-options] ./mon_executable

Remarks:

The command passed to srun is then an executable.
The --hint=nomultithread option ensures the reservation of physical cores (no hyperthreading).
By default, the allocated CPU memory is proportional to the number of reserved cores. For example, if you request 1/4 of the physical cores of a node, you will have access to 1/4 of its memory. You can consult our documentation on this subject: Memory allocation with Slurm.
--other-options can contain all the usual Slurm options for configuring jobs (--time=, etc.): see the documentation on batch submission scripts in the Execution/Job Control section.
All reservations have resources defined in Slurm by a partition and a default "Quality of Service" QoS (Quality of Service). You can modify the limits by specifying a partition and/or a QoS as indicated in our documentation detailing CPU partitions and QoS.
For multi-project accounts as well as those with CPU and GPU hours, it is essential to specify the hour allocation on which to deduct the computing hours of the job as indicated in our documentation detailing the management of computing hours to ensure that the hours consumed by your jobs are deducted from the correct allocation.

Obtaining a terminal on a GPU compute node

It is possible to open a terminal directly on a converged compute node where resources are reserved for you (here 1 GPU on the default GPU partition) using the following command:

srun --pty --nodes=1 --ntasks-per-node=1 --cpus-per-task=10 --gres=gpu:1 --hint=nomultithread[--other-options] bash

Remarks:

The --pty option allows you to obtain an interactive terminal and the command passed to srun is a shell (here bash).
The --hint=nomultithread option ensures the reservation of physical cores (no hyperthreading).
The memory allocated for the job is proportional to the number of requested cores/CPU. For example, if you request 1/4 of the physical cores/CPU of a node, you will have access to 1/4 of its memory. On the default GPU partition, the --cpus-per-task=10 option thus allows you to reserve 1/4 of the node's memory per GPU (if 1 task per GPU). On the gpu_p2 partition (--partition=gpu_p2), you must specify --cpus-per-task=3 to reserve 1/8 of the node's memory per GPU (if 1 task per GPU). On the gpu_p5 partition (-C a100), you must specify --cpus-per-task=8 to reserve 1/8 of the node's memory per GPU (if 1 task per GPU). On the gpu_p6 partition (-C h100), you must specify --cpus-per-task=24 to reserve 1/4 of the node's memory per GPU (if 1 task per GPU). Thus you will be consistent with the configuration of the nodes used and avoid overcharging of hours. You can consult our documentation on this subject: Memory allocation with Slurm.
--other-options can contain all the usual Slurm options for configuring jobs (--time=, etc.): see the documentation on batch submission scripts in the Execution/Job Control section.
All reservations have resources defined by a partition and a "Quality of Service" QoS (Quality of Service) set by default in Slurm. You can modify the limits by specifying a partition and/or a QoS as indicated in our documentation detailingGPU partitions and QoS.
For multi-project accounts as well as those with CPU and GPU hours, it is essential to specify the hour allocation on which to deduct the computing hours of the job as indicated in our documentation detailing the management of computing hours to ensure that the hours consumed by your jobs are deducted from the correct allocation.

The terminal is operational after obtaining the allocation:

$ srun --pty --nodes=1 --ntasks-per-node=1 --cpus-per-task=10 --gres=gpu:1 --hint=nomultithread bash
srun: job 1369723 queued and waiting for resources
srun: job 1369723 has been allocated resources
bash-4.2$ hostname
noeud234
bash-4.2$ nvidia-smi 
Fri Apr 10 19:09:08 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14       Driver Version: 430.14       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1C:00.0 Off |                    0 |
| N/A   44C    P0    45W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You can verify that your interactive job has started using the squeue command, and obtain complete information on the job status with the scontrol show job <identifiant du travail> command.

Once the terminal is operational, you can launch your executables in the usual way ./votre_executable.

Warning

MPI is currently not usable in this configuration.

To exit interactive mode, use the exit command:

bash-4.2$ exit 

Warning

If you do not exit interactive mode yourself, the maximum allocation duration (default or specified with the --time option) is applied: the job remains on the machine doing nothing and as many computing hours will be deducted from the project you specified.

Interactive execution on the GPU partition

If you do not need to open a terminal on a GPU compute node, it is also possible to start the interactive execution of a code on the converged compute nodes directly from the login node using the following command (here with 4 GPUs on the default GPU partition):

srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=10 --gres=gpu:4 --hint=nomultithread [--other-options] ./mon_executable

Remarks:

The command passed to srun is then an executable.
The --hint=nomultithread option ensures the reservation of physical cores (no hyperthreading).
The memory allocated for the job is proportional to the number of requested cores/CPU. For example, if you request 1/4 of the physical cores/CPU of a node, you will have access to 1/4 of its memory. On the default GPU partition, the --cpus-per-task=10 option thus allows you to reserve 1/4 of the node's memory per GPU (if 1 task per GPU). On the gpu_p2 partition (--partition=gpu_p2), you must specify --cpus-per-task=3 to reserve 1/8 of the node's memory per GPU (if 1 task per GPU). On the gpu_p5 partition (-C a100), you must specify --cpus-per-task=8 to reserve 1/8 of the node's memory per GPU (if 1 task per GPU). On the gpu_p6 partition (-C h100), you must specify --cpus-per-task=24 to reserve 1/4 of the node's memory per GPU (if 1 task per GPU). Thus you will be consistent with the configuration of the nodes used and avoid overcharging of hours. You can consult our documentation on this subject: Memory allocation with Slurm.
--other-options can contain all the usual Slurm options for configuring jobs (--time=, etc.): see the documentation on batch submission scripts in the Execution/Job Control section.
All reservations have resources defined by a partition and a "Quality of Service" QoS (Quality of Service) set by default in Slurm. You can modify the limits by specifying another partition and/or a QoS as indicated in our documentation detailing GPU partitions and QoS.
For multi-project accounts as well as those with CPU and GPU hours, it is essential to specify the hour allocation on which to deduct the computing hours of the job as indicated in our documentation detailing the management of computing hours to ensure that the hours consumed by your jobs are deducted from the correct allocation.

Reservation of reusable resources for multiple interactive executions

Each interactive execution started as described in the previous sections corresponds to a different job. Like all jobs, they are likely to be placed on hold for a shorter or longer period of time if the computing resources are not available.

If you wish to chain several interactive executions, it may be relevant to first reserve resources that can be reused for several executions. You will then have to wait for the resources to be available only once at the time of reservation and not for each execution.

The reservation of resources is done via the salloc command as illustrated below for 4 tasks.

CPU salloc example
GPU salloc example

salloc --ntasks=4 --hint=nomultithread [--other-options]

salloc --nodes=1 --ntasks-per-node=4 --cpus-per-task=10 --gres=gpu:4 --hint=nomultithread [--other-options]

The options are the same as for the srun command (see sections above).

The reservation becomes usable after the allocation of resources:

salloc: Pending job allocation 1367065
salloc: job 1367065 queued and waiting for resources
salloc: job 1367065 has been allocated resources
salloc: Granted job allocation 1367065

You can verify that your reservation is active using the squeue command, and obtain complete information on the job status with the scontrol show job <identifiant du travail> command.

You can then start interactive executions using the srun command:

$ srun [--other-options] ./code

Note

If you do not specify any options for the srun command, the options used for the salloc (e.g. the number of tasks) will be used by default.

Warning

After reserving resources with salloc, you are still connected to the login node (you can verify this using the hostname command). It is imperative to use the srun command so that your executions use the reserved resources.
If you forget to release the reservation, the maximum allocation duration (default or specified with the --time option) is applied and as many computing hours are deducted from the project you specified. It is therefore necessary to do it explicitly:

$ exit
exit
salloc: Relinquishing job allocation 1367065 

Connecting to the login node​

Obtaining a terminal on a CPU compute node​

Interactive execution on the CPU partition​

Obtaining a terminal on a GPU compute node​

Interactive execution on the GPU partition​

Reservation of reusable resources for multiple interactive executions​

Connecting to the login node

Obtaining a terminal on a CPU compute node

Interactive execution on the CPU partition

Obtaining a terminal on a GPU compute node

Interactive execution on the GPU partition

Reservation of reusable resources for multiple interactive executions