Ada, Adapp: Multi-step and cascade jobs

Use of the LoadLeveler step notion

Certain users have developed complex processing chains (data flows) which involves stringing jobs which have different characteristics (number of cores, necessary time and memory). The output files of a job are often used as the input files of the following job, which adds relationships of interdependency beween the jobs. LoadLeveler permits managing this problem in a simple and effective way by using the step notion and multi-step jobs.  Each step is defined in a sub-job to which are associated its own resources (cores, memory, time), in other words, a class. Thus, a multi-step job is defining as many steps as there are sub-jobs to execute, as well as defining the interdependency relationships between these steps. This way, at each step, the reserved resources correspond exactly to the resources used.

Let us take a concrete example: A user needs to first launch a sequential job (e.g. compilation), then a second job of parallel calculation, and finally, a third job, sequential again (post-processing of the previously generated files). It would be regrettable to reserve cores for the whole job when they are only necessary for the parallel execution. This would not only be detrimental to the efficient utilization of the machine, but would also penalize the user who will be billed for the number of cores reserved multiplied by the total Elapsed time for all of the job.

With a multi-step job, it is sufficient to specify a sequential step first, followed secondly by a parallel step, and lastly, by another sequential step: All three steps are within only one LoadLeveler job. The resources which otherwise would have been reserved but remain unused, are not reserved any more and, therefore, will not be billed to the project account (example 1).

In a multi-step job, the TMPDIR  directory is always permanent:   The contents of this directory are retained between the different steps of the job. In addition, there is no inheritance of characteristics from one step to another (number of cores, Elapsed time, memory). On the other hand, at the beginning of each new step, there is an inheritance of the default limits of the executing class.

Comments:

  • If the sequential step does consist only of transferring files from or towards Ergon using the mfget / mfput commands, then it is necessary to use the dedicated class archive created specifically for this. Routing towards this class is done by using the LoadLeveler keyword  # @ class = archive  in the submission script. It is recommended to keep the archive class default limit values: In particular, no time limit is to be specified. The maximum limits of the class (30 minutes in CPU time and 20 hours in Elapsed time) are taken by default. Because of the security provided by the mfput / mfget commands, any occasional problems on Ergon are avoided. Use of the archive class is not billed to the users; a permanent follow-up is in place to avoid any deviation in its utilisation (example 2).
  • If not using mfput / mfget, it is sufficient to use the available standard classes.

The concept of multi-step jobs can also be useful for recuperating files generated in the TMPDIR if there was an abrupt interruption in the job execution due to exceeding the Elapsed time limits. In a classic job (which is comprised of only one step), the files are erased from the TMPDIR as soon as the job is killed; consequently, files are permanently lost.  One way of remedying this problem is to create a job consisting of two steps: a first step associated with the code execution; then, a second unconditional step, executed whether the preceding step ran correctly or not, which will copy the files into a permanent disk space (HOME, WORKDIR or on Ergon). Because the TMPDIR is permanent during a multi-step job, the files created during the first step will always be accessible in the second step (example 3).

Example 1 : Standard multi-step job

In the first example, we are going to string the execution of a sequential program (in a sequential class), the execution of an MPI programme on 128 processes (in a parallel class), and then, the  execution of a sequential code of post-processing results (in a sequential class).  The submission file, for example, is the following:

multi-steps.ll
  #=========== Global directives ===========
  # @ job_name = multi-steps
  # @ output = $(job_name).$(step_name).$(jobid)
  # @ error = $(output)
 
  #=========== Step 1 directives ===========
  #======= Sequential preprocessing ========
  # @ step_name = sequential_preprocessing
  # @ job_type = serial
  # @ wall_clock_limit = 600
  # @ queue
 
  #=========== Step 2 directives ===========
  #============= Parallel step =============
  # @ step_name = parallel_step
  # @ dependency = (sequential_preprocessing == 0)
  #   (executed only if previous step completed without error)
  # @ job_type = parallel
  # @ total_tasks = 128
  # @ wall_clock_limit = 3600
  # @ queue
 
  #=========== Step 3 directives ===========
  #======= Sequential postprocessing =======
  # @ step_name = sequential_postprocessing
  # @ dependency = (parallel_step == 0)
  #   (executed only if previous step completed without error)
  # @ job_type = serial
  # @ wall_clock_limit = 1200
  # @ queue
 
  case ${LOADL_STEP_NAME} in
 
    #============ Step 1 commands ============
    #======= Sequential preprocessing ========
    sequential_preprocessing )
      set -x
      cd $TMPDIR
      cp ${LOADL_STEP_INITDIR}/input.data .
      cp ${LOADL_STEP_INITDIR}/prog_parallel.exe .
      cp ${LOADL_STEP_INITDIR}/preprocess.exe .
      ./preprocess.exe
    ;;
 
    #============ Step 2 commands ============
    #============= Parallel step =============
    parallel_step )
      set -x
      cd $TMPDIR
      poe ./prog_parallel.exe
    ;;
 
    #============ Step 3 commands ============
    #======= Sequential postprocessing =======
    sequential_postprocessing )
      set -x
      cd $TMPDIR
      cp ${LOADL_STEP_INITDIR}/postprocess.exe .
      ./postprocess.exe
      cp output.dat ${LOADL_STEP_INITDIR}/.
    ;;
 
  esac

This job is organised in two sections: The first contains all of the directives for the LoadLeveler Workload Scheduler, while the second groups the batch commands to be executed in the different steps.  The two sections could be mixed together but this would decrease the readability and, therefore, the comprehension of the job.

The LoadLeveler directives of each step must end with a # @ queue directive.

It is also necessary to specify the dependency directives if you want to ensure that the different steps are executed correctly, one after the other. Without these directives, all the steps begin independently as soon as the necessary resources are available. In this example, the # @dependency directive of steps 2 and 3 indicates that these steps must not execute until the preceding step has terminated without error (return value equal to zero).

The submission of this job actually creates three sub-jobs, each containing the same script shell but using different resources (cores, memory, time); these sub-jobs are only distinquished by their step names. For each of them to execute different commands, it is necessary to make a jump statement for each step by using their step names (saved in the variable LOADL_STEP_NAME) in a case structure.  The use of the case is simple:  You just need to follow the example provided above. However, don't forget to add ;; (double semi-colons) to separate each list of commands, and esac at the end.

In order to submit a job containing three steps, go into the directory containing the file above multi-steps.ll and type:

$ llsubmit multi-steps.ll

Note that at the starting up of each step, the default directory is the one where the job was submitted:  It is here that the output files will be written. Furthermore, it is indispensible to specify a different output file for each step. If not, the output of the last step overwrites the preceding outputs (see the output line in the submission file). 

Example 2:  Multi-step job for transferring files using Ergon

In the second example, the first step consists of copying the files from Ergon into the execution directory; the second, of using these files for the parallel execution of a code on 32 processes; and finally, the third step archives the result file on the Ergon machine. Steps 1 and 3, which correspond to data transfer from and towards ergon using the mfget / mfput commands, are carried out in the dedicated archive class.  The submission file, here called multi-steps-archive.ll, is the following:

multi-steps-archive.ll
  #=========== Global directives ===========
  # @ job_name = multi-steps-archive
  # @ output = $(job_name).$(step_name).$(jobid)
  # @ error = $(output)
 
  #=========== Step 1 directives ===========
  #============== get_file =================
  # @ step_name = get_file
  # @ job_type = serial
  # @ class = archive
  # (don't specify wall_clock_limit with this class)
  # @ queue
 
  #=========== Step 2 directives ===========
  #============= Parallel step =============
  # @ step_name = parallel_step
  # @ dependency = (get_file == 0)
  # (executed only if previous step completed without error)
  # @ job_type = parallel
  # @ total_tasks = 32
  # @ wall_clock_limit = 3600
  # @ queue
 
  #=========== Step 3 directives ===========
  #======= Sequential postprocessing =======
  # @ step_name = put_file
  # @ dependency = (parallel_step >= 0) && (get_file == 0)
  # (executed even if previous step completed with an error)
  # @ job_type = serial
  # @ class = archive
  # (don't specify wall_clock_limit with this class)
  # @ queue
 
  case $LOADL_STEP_NAME in
 
    #============ Step 1 commands ============
    #======= Sequential preprocessing ========
    get_file )
      set -ex
      cd $TMPDIR
      mfget dataset1.in dataset.in
      mfget dataset2.in dataset_bis.in
    ;;
 
    #============ Step 2 commands ============
    #============= Parallel step =============
    parallel_step )
      set -x
      cd $TMPDIR
      cp ${LOADL_STEP_INITDIR}/code_MPI.exe .
      ls -al
      ./code_MPI.exe
    ;;
 
    #============ Step 3 commands ============
    #======= Sequential postprocessing =======
    put_file )
      set -x
      cd $TMPDIR
      mfput dataresult.out result.out
    ;;
 
  esac

Note the usage of the command set -ex in the first step: It causes the immediate interruption of the step while it is being executed if a command sends back an error (return code different than zero). In the same way, taking into account the dependency relationship # @ dependency = (get_file == 0), the second step will not run unless both of the mfget commands have been correctly executed.  If the command set -ex is not used, the dependency test of the step being run is done only on the return code of the last command of the preceding step.

Example 3:  Multi-step job for the unconditional copying of result files at the end of execution

This third example uses the specific property of TMPDIR permanence in multi-step jobs to copy the files created during the first step (which was executed in the temporary TMPDIR) into a permanent file system. This copying will always be carried out whether the preceding step executed correctly or not (for example, after an abrupt stop for exceeding the Elapsed time limit). The submission file, here called multi-steps-copy.ll, is the following:

multi-steps-copy.ll
  #=========== Global directives ===========
  # @ job_name = multi-steps-copy
  # @ output = $(job_name).$(step_name).$(jobid)
  # @ error = $(output)
 
  #=========== Step 1 directives ===========
  #============= Execution step =============
  # @ step_name = execution_step
  # @ job_type = serial
  # @ parallel_threads = 8
  # @ wall_clock_limit = 3600
  # @ queue
 
  #=========== Step 2 directives ===========
  #=============== Copy step ===============
  # @ step_name = copy_file
  # @ dependency = (execution_step >= 0)
  # (executed even if previous step completed with an error)
  # @ job_type = serial
  # @ queue
 
  case ${LOADL_STEP_NAME} in
    #============ Step 1 commands ============
    #============= Execution step =============
    execution_step )
      set -x
      cd $TMPDIR
      cp ${LOADL_STEP_INITDIR}/code_OpenMP.exe .
      ./code_OpenMP.exe
    ;;
 
    #============ Step 2 commands ============
    #=============== Copy step ===============
    copy_file )
      set -x
      cd $TMPDIR
      cp *.res ${LOADL_STEP_INITDIR}/Output
    ;;
 
  esac

In Step 1, for example, if the execution of code_OpenMP.exe exceeds the Elapsed time set at 3600 seconds and the first step is interrupted, the files (created in TMPDIR) will not be erased thanks to the TMPDIR permanence in multi-step jobs. It is then possible, during the following step, to copy them into a permanent disk space such as WORKDIR. The TMPDIR is only erased at the very end of the whole job, after Step 2 completion.

Launching of a LoadLeveler job by another LoadLeveler job (cascade jobs)

Generally, a batch must be launched from a permanent disk space; its output will be lost if it cannot be written in the same place where the submission was done. Yet, in all LoadLeveler jobs, it is recommended to begin in the temporary TMPDIR disk space to run the executable files (for performance reasons). Because of this, if such a job wishes to submit another job, it will be necessary to change the disk space beforehand: One solution consists of submitting all of the stringed jobs using the initial submission directory of the first job; this directory can be accessed via the variable LOADL_STEP_INITDIR.

Example with two submission scripts to run two executable files

 $ ls
   prog1 prog2 script1.ll script2.ll
   
script1.ll
    # Elapsed time in sec. per processus
    # @ wall_clock_limit = 400
    # Name of this LoadLeveler job
    # @ job_name = Sortie1
    # job Standard Output file name
    # @ output = $(job_name).$(jobid)
    # job Standard Error file name
    # @ error = $(job_name).$(jobid)
    # @ queue
 
    set -x
    cd $TMPDIR
    cp ${LOADL_STEP_INITDIR}/prog1 .
    ./prog1 > resultat1
    cp resultat1 ${LOADL_STEP_INITDIR}
 
    # Launching of the second script which is
    # also found in the submission directory
    cd ${LOADL_STEP_INITDIR}
    llsubmit script2.ll
script2.ll
    # Elapsed time in sec. per processus
    # @ wall_clock_limit = 400
    # Name of this LoadLeveler job
    # @ job_name = Sortie2
    # job Standard Output file name
    # @ output = $(job_name).$(jobid)
    # job Standard Error file name
    # @ error = $(job_name).$(jobid)
    # @ queue
 
    set -x
    cd $TMPDIR
    cp ${LOADL_STEP_INITDIR}/prog2 .
    ./prog2 > resultat2
    cp resultat2 ${LOADL_STEP_INITDIR}

To submit the totality of steps (script1 and script2):

$ llsubmit script1.ll