Python scripts for automated execution of GPU jobs

Automated scripts are available to users of Jean Zay for the execution of GPU jobs via the SLURM job manager. These scripts were created for use in a notebook opened on a front end for executing jobs distributed on the GPU computing nodes.

The scripts are developed by the IDRIS support team and installed in all the PyTorch and TensorFlow modules.

Importing the functions:

 from idr_pytools import gpu_jobs_submitter, display_slurm_queue, search_log 

Submission of GPU jobs

The gpu_jobs_submitter script enables the submission of GPU jobs into the SLURM queue. It automates the creation of SLURM files which meet our requirements and it submits the jobs for execution via the sbatch command.

The automatically created SLURM files are found in the Slurm folder. You can consult them to check the configuration.

Arguments :

  • srun_commands (obligatory): The command to execute with srun; for AI, this is often a Python script. Example: 'my_script.py -b 64 -e 6 --learning-rate 0.001'. If the first word is a file with a .py extension, the python -u command is automatically added before the script name. It is also possible to indicate a list of commands in order to submit more than one job. Example: ['my_script.py -b 64 -e 6 --learning-rate 0.001', 'my_script.py -b 128 -e 12 --learning-rate 0.01'].
  • n_gpu: The number of GPUs to reserve for a job. By default 1 GPU, and a maximum of 512 GPUs. It is also possible to indicate a list of the number of GPUs. Example: n_gpu=[1, 2, 4, 8, 16]. In this way, a job will be created for each element in the list. If more than one command is specified in the preceding srun_commands argument, each command will be run on all of the requested configurations.
  • module (required if using the modules): Name of the module to load. Only one module name is authorised.
  • singularity (required if using a Singularity container): Name of the SIF image to load. The idrcontmgr command will have previously been applied. See the documentation on using Singularity containers.
  • name: Name of the job. It will be displayed in the SLURM queue and integrated in the log names. By default, the python script name indicated in srun_commands is used.
  • n_gpu_per_task: The number of GPUs associated with a task. By default, this is 1 GPU per task, in accordance with data parallelism configuration. However, for model parallelism or for the Tensorflow distribution strategies, it will be necessary to associate more than one GPU to a task.
  • time_max: The maximum duration of the job. By default: '02:00:00'.
  • qos: The default QoS is 'qos_gpu-t3'. If not using the default QoS, use 'qos_gpu-t4' or 'qos_gpu-dev'.
  • partition: The default partition is 'gpu_p13'. If not using the default partition, use 'gpu_p2', 'gpu_p2l' or 'gpu_p2s'.
  • constraint: 'v100-32g' or 'v100-16g'. When you use the default partition, it allows forcing the use of the 32GB GPUs or 16GB GPUs.
  • cpus_per_task: The number of CPUs to associate with each task. By default: 10 for the default partition or 3 for the gpu_p2 partition. It is advised to leave the default values.
  • exclusive: Forces the exclusive use of a node.
  • account: The GPU hour accounting to use. Obligatory if you have more than one hour accounting and/or project. For more information, you can refer to our documentation about project hours management.
  • verbose: 0 by default. The value 1 adds traces of NVIDIA debugging in the logs.
  • email: The email address for status reports of your jobs to be sent automatically by SLURM.
  • addon: Enables adding additional command lines to the SLURM file; for example, 'unset PROXY', or to load a personal environment:
    addon="""source .bashrc
    conda activate myEnv"""

Return:

  • jobids: List of the jobids of submitted jobs.

Note for A100:

  • To use the A100 80GB partition with your xxx@a100 account, you just need to specify it with account=xxx@a100. Then, the addition of the constraint and the module necessary to use this partition will be automatically integrated into the generated .slurm file.

Example

  • Command launched:
    jobids = gpu_jobs_submitter(['my_script.py -b 64 -e 6 --learning-rate 0.001',
                                'my_script.py -b 128 -e 12 --learning-rate 0.01'],
                                 n_gpu=[1, 2, 4, 8, 16, 32, 64],
                                 module='tensorflow-gpu/py3/2.4.1',
                                 name="Imagenet_resnet101")
  • Display:
    batch job 0: 1 GPUs distributed on 1 nodes with 1 tasks / 1 gpus per node and 3 cpus per task
    Submitted batch job 778296
    Submitted batch job 778297
    batch job 2: 2 GPUs distributed on 1 nodes with 2 tasks / 2 gpus per node and 3 cpus per task
    Submitted batch job 778299
    Submitted batch job 778300
    batch job 4: 4 GPUs distributed on 1 nodes with 4 tasks / 4 gpus per node and 3 cpus per task
    Submitted batch job 778301
    Submitted batch job 778302
    batch job 6: 8 GPUs distributed on 1 nodes with 8 tasks / 8 gpus per node and 3 cpus per task
    Submitted batch job 778304
    Submitted batch job 778305
    batch job 8: 16 GPUs distributed on 2 nodes with 8 tasks / 8 gpus per node and 3 cpus per task
    Submitted batch job 778306
    Submitted batch job 778307
    batch job 10: 32 GPUs distributed on 4 nodes with 8 tasks / 8 gpus per node and 3 cpus per task
    Submitted batch job 778308
    Submitted batch job 778309
    batch job 12: 64 GPUs distributed on 8 nodes with 8 tasks / 8 gpus per node and 3 cpus per task
    Submitted batch job 778310
    Submitted batch job 778312

Interactive display of the SLURM queue

It is possible to display the SLURM queue with the waiting jobs on a notebook with the following command:

!squeue -u $USER

However, this only displays the present status.

The display_slurm_queue function enables having a dynamic display of the queue, refreshed every 5 seconds. This function only stops when the queue is empty which is useful in a notebook for having a sequential execution of the cells. If the jobs last too long, it is possible to stop the execution of the cell (without impact on the SLURM queue) and take control of the notebook.

Arguments:

  • name: Enables a filter by job name. The queue will only display jobs with this name.
  • timestep: Refresh delay. By default, 5 seconds.

Example

  • Command run:
    display_slurm_queue("Imagenet_resnet101")
  • Display:
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                778312    gpu_p2 Imagenet  ssos040 PD       0:00      8 (Priority)
                778310    gpu_p2 Imagenet  ssos040 PD       0:00      8 (Priority)
                778309    gpu_p2 Imagenet  ssos040 PD       0:00      4 (Priority)
                778308    gpu_p2 Imagenet  ssos040 PD       0:00      4 (Priority)
                778307    gpu_p2 Imagenet  ssos040 PD       0:00      2 (Priority)
                778306    gpu_p2 Imagenet  ssos040 PD       0:00      2 (Priority)
                778305    gpu_p2 Imagenet  ssos040 PD       0:00      1 (Priority)
                778304    gpu_p2 Imagenet  ssos040 PD       0:00      1 (Priority)
                778302    gpu_p2 Imagenet  ssos040 PD       0:00      1 (Priority)
                778301    gpu_p2 Imagenet  ssos040 PD       0:00      1 (Priority)
                778296    gpu_p2 Imagenet  ssos040 PD       0:00      1 (Resources)
                778297    gpu_p2 Imagenet  ssos040 PD       0:00      1 (Resources)
                778300    gpu_p2 Imagenet  ssos040 PD       0:00      1 (Resources)
                778299    gpu_p2 Imagenet  ssos040  R       1:04      1 jean-zay-ia828
    

Search log paths

The search_log function enables finding the paths to the log files of the jobs executed with the gpu_jobs_submitter function.

The log files have a name with the following format: '{name}@JZ_{datetime}_{ntasks}tasks_{nnodes}nodes_{jobid}'. Arguments:

  • name: Enables filtering by job name.
  • contains: Enables filtering by date, number of tasks, number of nodes or jobids. The character '*' enables contenating more than one filter. Example: contains='2021-02-12_22:*1node'
  • with_err: By default, False. If True, returns a dictionary with the paths of both the output files and the error files listed in chronological order. If False, returns a list with only the paths of the output files listed in chronological order.

Example:

  • Command run:
    paths = search_log("Imagenet_resnet101")
  • Display:
    ['./slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:46_8tasks_4nodes_778096.out',
     './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:49_8tasks_4nodes_778097.out',
     './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:53_8tasks_4nodes_778099.out',
     './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:57_8tasks_8nodes_778102.out',
     './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:24:04_8tasks_8nodes_778105.out',
     './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:24:10_8tasks_8nodes_778110.out',
     './slurm/log/Imagenet_resnet101@JZ_2021-04-07_17:53:49_2tasks_1nodes_778310.out',
     './slurm/log/Imagenet_resnet101@JZ_2021-04-07_17:53:52_2tasks_1nodes_778312.out']
  • Comment: The paths are listed in chronological order. If you want the last 2 paths, you simply need to use the following command:
    paths[-2:]