Turing:  Code execution in multi-step jobs with file transfers using Ergon

Attention:  This section only concerns jobs using the Ergon, machine for file transfers.  In other cases, refer to the page for multi-step jobs.

Jobs often include a file recuperation phase on Ergon, a parallel computing phase, and finally an archiving phase with recopying of result files on Ergon.

It would be regrettable to monopolize hundreds of Blue Gene/Q cores for the file transferring phase which is purely sequential.  It is for this reason that it is obligatory to use the step notion of LoadLeveler when transferring files with Ergon.

Below is an example of a job which strings data recuperation on Ergon, execution of an MPI programme on 1024 cores, and finally, the archiving of results.  The submission code, calling it job_multi_transfer.ll, is the following:

job_multi_tranfert.ll
  #=========== Global directives ===========
  #@ shell    = /bin/bash
  #@ job_name = test_multi-steps
  #@ output   = $(job_name).$(step_name).$(jobid)
  #@ error    = $(output)
 
  #=========== Step 1 directives ===========
  #======= Sequential preprocessing ========
  #@ step_name = sequential_preprocessing
  #@ job_type  = serial
  #@ class     = archive
  #@ queue
 
  #=========== Step 2 directives ===========
  #============= Parallel step =============
  #@ step_name  = parallel_step
  #@ dependency = (sequential_preprocessing == 0)
  # (executed only if previous step completed without error)
  #@ job_type   = bluegene
  #@ bg_size    = 64
  #@ wall_clock_limit = 1:00:00
  #@ queue
 
  #=========== Step 3 directives ===========
  #======= Sequential postprocessing =======
  #@ step_name  = sequential_postprocessing
  #@ dependency = (parallel_step >= 0)
  # (executed even if previous step completed with an error)
  #@ job_type   = serial
  #@ class      = archive
  #@ queue
 
  case $LOADL_STEP_NAME in
 
    #============ Step 1 commands ============
    #======= Sequential preprocessing ========
    sequential_preprocessing )
      set -ex
      cd $tmpdir
 
      mfget input_par/parameters.nml
      mfget inputs/big_data.in
      ;;
 
    #============ Step 2 commands ============
    #============= Parallel step =============
    parallel_step )
      set -x
      cd $tmpdir
      runjob --ranks-per-node 32 --np 2048 --mapping ABCDET : $LOADL_STEP_INITDIR/exe/my_exec my_args
      ;;
 
    #============ Step 3 commands ============
    #======= Sequential postprocessing =======
    sequential_postprocessing )
      set -x
      cd $tmpdir
      mfput big_result.tar outputs/big_result.tar
      ;;
  esac

In order to submit this job with three steps, go into the directory containing job_multi_transfer.ll and enter:

  llsubmit job_multi_tranfert.ll

The only difference between this and the submission of multi-step jobs (which do not use file transfers on Ergon), is in the use of an additional LoadLeveler directive in the sequential steps.  When using Ergon for file transfers, it is mandatory to use the following LoadLeveler directive:
# @ class = archive.
This directive specifies that the sequential steps are to be executed in a special class which is strictly reserved for the Ergon file transfer commands mfput and mfgetThis class directive must not be specified in any other job not using Ergon.

Unlike the standard classes, the archive class can be paused by IDRIS in the event that the Ergon machine is unavailable.  In this way, jobs will not be lost: When the Ergon machine becomes available again, the class is re-opened and the jobs continue to run normally.

Please see multi-step jobs (not using Ergon) for further explanations about their usage.