Turing: Execution of a parallel code in batch

The LoadLeveler system is responsible for managing jobs on all the nodes.   In order to submit a batch job, you must begin by writing a submission script.  This step is explained below.  (The submission commands and job follow-up are explained in detail on another page:  control commands for batch jobs.)

Multi-step jobs are addressed in this section and multi-step jobs with transfers are addressed on this page.

Simple parallel jobs

The following is an example of a job submission to execute a code on 1024 cores (64 compute nodes, each with 16 cores).  This submission is made, in supposing that it is called job.ll, via the command:

  llsubmit job.ll

The submission file contains the following lines:

job.ll
  # @ job_name = job_simple
  # @ job_type = BLUEGENE
  # Job standard output file
  # @ output = $(job_name).$(jobid)
  # Job standard error file
  # @ error = $(output)
  # Maximum elapsed time request
  # @ wall_clock_limit = 1:00:00
  # Execution block size 
  # @ bg_size = 64
  # @ queue
 
  #To have the command echoes
  set -x
 
  # Copy executable and input file to TMPDIR
  # Warning: if you need to transfer important volumes
  # of data, please use a multi-step job
  cp my_code $TMPDIR
  cp data.in $TMPDIR
  cd $TMPDIR
 
  #Run job with 32 processes by compute node (2 processes by core,
  # maximum allowed 4 proc/core) for a total of 2048 processes
  runjob --ranks-per-node 32 --np 2048 : ./my_code my_args
 
  # Copy output file to submission directory
  # Warning: if you need to transfer important volumes
  # of data, please use a multi-step job
  # $LOADL_STEP_INITDIR is the submission directory
  cp data.out $LOADL_STEP_INITDIR/

The submission script is separated into two parts:  The first part contains the LoadLeveler commands, lines beginning with #@ (with or without spaces); the second contains the script to be executed.

Attention:  If you need to copy or move large volumes of data, use the multi-step jobs or, if you are transferring files with Ergon, use multi-step jobs with file transfers .

LoadLeveler Directives

A job on the Blue Gene/Q must contain a certain number of LoadLeveler directives for it to be run correctly.   (The values can be upper or lowercase letters except for the file names.)

  • Obligatory directives:
    • # @ job_type = BLUEGENE: specifies  the type of job step to process. Only BLUEGENE is accepted to turn on the compute nodes.  SERIAL can be used for the pre-or post-processing phases on the front-end or the file transfers (see multi-step jobs and multi-step jobs with file transfers)
    • # @ bg_size:  the size of the reserved execution block. This variable can only have certain values (if others are entered, the system will reject the job): 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, or 4096 (see this page for the reasons).
    • # @ wall_clock_limit:  maximum duration of code execution (given in elapsed or wall clock time).  This is given either in format HH:MM:SS, or directly in seconds.
    • # @ queue:  the last LoadLeveler directive.  This marks the end of the job step.  (Further directives will be ignored.)
  • Optional directives:
    • # @ bg_connectivity can take different values:  Torus, Mesh, Either, or a choice of the first four dimensions of the 5D Torus (for example, to have a torus in the 1st and 4th dimensions:  Torus Mesh Mesh Torus).  The 5th dimension is not to be specified because it always has a size 2 torus.  It is not possible to have a real 5D torus unless there is an execution block of at least one midplane (512 compute nodes).   With this number of compute nodes, it is possible to request a torus in the 5 dimensions, but the waiting time can be longer.  (With less than a midplane, it is not possible to obtain a torus in all 5 dimensions.)  Counting the 5th dimension (which is always included in a torus), there can be a maximum of 2 dimensions for 64 nodes, 3 for 128 nodes, and 4 for 256 nodes.  Mesh is by default.  Activation of the torus can have a significantly positive impact on the communication performance.
    • # @ job_name:  the name of your job (free choice).
    • # @ output:  name of the standard output file.
    • # @ error:  name of the standard error file.  If you want to put the standard error in the same file as the standard output, it is sufficient to put command equal to $(output).
    • # @ notification:  If this is put at the value complete, you will automatically receive a summary at the end of the execution

This is not an exhaustive list.  You can find all the commands on the IBM documentation.  (Some of the directives are neither adapted to nor allowed on the IDRIS Turing machine.)

Script

The executed script should not, in general, contain the sequential phase (except for very rapid operations such as copying a small file or making directory changes).  This is so that the reserved compute cores are not waiting inactively; the time wasted is counted in your allocation of hours.  In the case of sequential phases, it is necessary to use multi-step jobs (see  multi-step jobs and multi-step jobs with file transfers).

The execution of a parallel code is always done by calling the runjob command.  It is necessary to specify the number of MPI processes which must be run (option --np), and the value must be chosen carefully.  It is also necessary to specify the number of processes to run on each compute node (option --ranks-per-node, see here).  The choice made for mapping (option --mapping) can have an important influence on the performance of MPI communications (see here).   The value which is advised for this last choice is ABCDET (value by default).  However, this option is not obligatory.  The name of the executable file and the list of its arguments can be given at the end, just after the symbol “:”. 

Useful options for runjob are given on this page.