Jean Zay: Execution of multi-step and cascade jobs

Using the step notion with Slurm

Some users have developed complex processing chains (data flows) which consists of stringing jobs which could have different characteristics (number of cores, calculation time and memory needed). The job output files are often used as the input files of the following job which adds relationships of interdependency between the jobs. Slurm manages this problem in a simple and effective way: Each step is defined in a Slurm job which has its own resources (number of cores, memory, time). A multi-step job consists of defining as many steps as there are jobs to execute and also defining the interdependency relationships between these steps. In this way, the reserved resources correspond exactly to the resources used at each step.

Job chaining

To submit a multi-step job on Jean Zay, you must:

  • Create a Bash script which submits several Slurm jobs (one job per step): At the submission of each one of the computing steps, you recover the corresponding JOB_ID which is transmitted during the submission of the following step. The JOB_ID is the fourth field in the information returned by the sbatch command (cut command).
    In the following example, four steps are submitted; each step (except the first one) depends on the preceding step and will not execute unless the preceding step has completely finished without any problem (--dependency=afterok).

    multi_steps.bash
    #!/bin/bash
    JID_JOB1=`sbatch  job1.slurm | cut -d " " -f 4`
    JID_JOB2=`sbatch  --dependency=afterok:$JID_JOB1 job2.slurm | cut -d " " -f 4`
    JID_JOB3=`sbatch  --dependency=afterok:$JID_JOB2 job3.slurm | cut -d " " -f 4`
    sbatch  --dependency=afterok:$JID_JOB3 job4.slurm

    Important: This script is not a Slurm job. It is a Bash script to be launched in the following way:

    $ chmod +x multi_steps.bash
    $ ./multi_steps.bash


  • Write all the steps (jobN.slurm) as if they were independent jobs: Each step submitted via the sbatch command is a classic Slurm job as those described in the documentation available in the index sections: Execution/commands of a CPU code or Execution/Commands of a GPU code. In this way, you can independently specify the partition, the QoS, the CPU time and the number of nodes necessary for each step.

Caution:

  • As these are independent jobs, the value of the $JOBSCRATCH variable will be different in each step. The files needing to be shared between two steps should not, therefore, be saved in this JOBSCRATCH space but in the semi-permanent directory SCRATCH, or a permanent directory such as the WORK but with a slower bandwidth: See all the characteristics of the disk spaces.
  • In case one of the chained jobs fails, the remaining jobs will stay pending with reason “DependencyNeverSatisfied”. They will never be executed and you need to cancel them manually using the scancel command. If you want those jobs to be automatically cancelled, you need to add the –kill-on-invalid-dep=yes option when submitting them.
    In the following example, this option is used to run a job (job2.slurm) only if the previous one (job1.slurm) failed (–dependency=afternotok:$JID_JOB1) and prevent it from remaining pending if the previous job ended well (–kill-on-invalid-dep=yes). In addition, the last job (job3.slurm) will be executed if one of the previous two (job1.slurm or job2.slurm) has completed successfully (–dependency=afterok:$JID_JOB1?afterok:$JIB_JOB2):

    multi_steps.bash
    #!/bin/bash
    JID_JOB1=`sbatch job1.slurm | cut -d " " -f 4`
    JID_JOB2=`sbatch --dependency=afternotok:$JID_JOB1 --kill-on-invalid-dep=yes job2.slurm | cut -d " " -f 4`
    sbatch --dependency=afterok:$JID_JOB1?afterok:$JIB_JOB2 job3.slurm

Examples of multi-steps jobs using the STORE

Unarchiving some data from the STORE before running a job

  • Submission script extract_data.slurm for the data preparation step:
extract_data.slurm
#SBATCH --job-name=Extraction # job name
#SBATCH --partition=archive   # we use the "archive" or "prepost" partitions from which the STORE is accessible
#SBATCH --ntasks=1            # the job is sequential
#SBATCH --hint=nomultithread  # 1 process per physical CPU core (no hyperthreading)
#SBATCH --time=02:00:00       # maximum elapsed time (HH:MM:SS)
#SBATCH --output=%x_%j.out    # output and error file (%x = job name, %j = job id)
# If you have multiple projects or various types of computing hours,
# we need to specify the account to use even if you will not be charged
# for jobs using the "prepost" or "archive" partitions.
##SBATCH --account=...
 
# Unarchiving some data from the STORE to the SCRATCH
cd $SCRATCH/mon_calcul
tar -xvf $STORE/monarchive.tar
  • Submission script compute.slurm for the computation step:
compute.slurm
#SBATCH --job-name=Compute    # job name
# Add the Slurm directives needed for your job
  ...
# Computation using the data unarchived from the STORE to the SCRATCH
cd $SCRATCH/mon_calcul
srun ...
  • Commands to execute to chain the jobs:
multi_steps.bash
#!/bin/bash
# Submit data extraction job and save the JobId for the dependency
JOB_ID_EXTRACT_DATA=`sbatch extract_data.slurm | cut -d " " -f 4`
# The computation step is only executed if the data preparation step has run successfully
sbatch --dependency=afterok:$JOB_ID_EXTRACT_DATA compute.slurm

Archiving some data to the STORE after a job

  • Submission script compute.slurm for the computation step:
compute.slurm
#SBATCH --job-name=Compute    # job name
# Add the Slurm directives needed for your job
   ...
# Computation producing some data to be archived for long-term storage
cd $SCRATCH/mon_calcul
srun ...
  • Submission script archive_data.slurm for the data archiving step:
archive_data.slurm
#SBATCH --job-name=Archive    # job name
#SBATCH --partition=archive   # we use the "archive" or "prepost" partitions from which the STORE is accessible
#SBATCH --ntasks=1            # the job is sequential
#SBATCH --hint=nomultithread  # 1 process per physical CPU core (no hyperthreading)
#SBATCH --time=02:00:00       # maximum elapsed time (HH:MM:SS)
#SBATCH --output=%x_%j.out    # output and error file (%x = job name, %j = job id)
# If you have multiple projects or various types of computing hours,
# we need to specify the account to use even if you will not be charged
# for jobs using the "prepost" or "archive" partitions.
##SBATCH --account=...
 
# Archive some data from the SCRATCH to the STORE
cd $SCRATCH/mon_calcul
tar -cvf $STORE/monarchive.tar resultats*.h5
  • Commands to execute to chain the jobs:
multi_steps.bash
#!/bin/bash
# Submit computation job and save the JobId for the dependency
JOB_ID_COMPUTE=`sbatch compute.slurm | cut -d " " -f 4`
# The data archiving step is only executed if the computation step has run successfully
sbatch --dependency=afterok:$JOB_ID_COMPUTE archive_data.slurm