Ada: Execution of a sequential job in batch

The jobs are managed on all the nodes by the software LoadLeveler.  They are distributed into classes principally in function of the Elapsed time, the number of cores, and the memory requested.  You can consult the structure of batch classes on Ada here.

To submit a batch job from Ada, it is necessary to do the following:

  • Create a submission script. Here is an example stocked in the file pg_seq.ll :
  $ more pg_seq.ll

  # @ job_type = serial
  # Max. CPU time of a process hhh:mm:ss (1h30mn here)
  # @ wall_clock_limit=1:30:00
  # Name of the LoadLeveler job
  # @ job_name = Mono
  # File of the job standard output
  # @ output   = $(job_name).$(jobid)
  # File of job error output
  # @ error    = $(job_name).$(jobid)
  # @ queue
  # To have the command echoes (Bash shell)
  set -x
  # Go into the temporary directory TMPDIR
  cd $TMPDIR
  # The LOADL_STEP_INITDIR variable is automatically set by
  # LoadLeveler to the directory in which you type the command llsubmit
  cp $LOADL_STEP_INITDIR/a.out .
  # Execution of the program
  ./a.out
  ls -lrt
  • Submit this script (only from Ada) via the command llsubmit:
 $ llsubmit pg_seq.ll

Remarks

  • Since 4 March 2014, we have default positioned the MP_USE_BULK_XFER variable at yes in order to activate the RDMA. This functionality increases the collective communication performance as well as the computation/communication overlapping. However, some codes can have reduced perormances when this variable is positioned at yes. You can disactivate the RDMA for your code by valorisant the variable at no just before the execution of your binary (export MP_USE_BULK_XFER=no or setenv MP_USE_BULK_XFER no).
  • In this example, we are supposing that the executable file a.out is found in the submission directory, that is the directory from which we enter the command llsubmit (the LOADL_STEP_INITDIR is automatically referenced by LoadLeveler).
  • The output file of the Mono.job_number computation is also found in the submission directory.  It is created at the beginning of the job execution; editing or modifying this file during job execution can disrupt it.
  • Memory : The default value is 3.5 GB.  The maximum value that you can request is 20.0 GB via the keyword # @ as_limit = 20.0gb.
  • The Elapsed time limit associated with the keyword wall_clock_limit is relative to the entire job.  It is also possible to limit the CPU time of each command executed in the job with the keyword cpu_limit. The combined use of the two keywords wall_clock_limit and cpu_limit allows you to ensure the execution of the last instructions of a job (those following the execution of your binary file):
# @ job_type = serial
# @ wall_clock_limit=1:00:00
# @ cpu_limit=45:00
# @ job_name = myjob
# @ output = $(job_name).$(jobid)
# @ error  = $(job_name).$(jobid)
# @ environment = MY_JOBID=$(jobid)
# @ queue

set -x

# Copy the files you need for your computation into the TMPDIR:
cp -p ... $TMPDIR
cd $TMPDIR

# Run:
./my_prog

ls -alrt

# Save the result files which interest you into the WORKDIR:
mkdir $WORKDIR/results.${MY_JOBID}
cp ... $WORKDIR/results.${MY_JOBID}

# Since the WORKDIR has no backup, save the important result files in Ergon:
mfput -v ...

The execution of your binary file consumes most of the job's Elapsed time and CPU.  Therefore, it is your binary file which is going to reach the CPU time limit (determined by the keyword cpu_limit).  However, you don't know in advance how long this computing phase will last.  Moreover, the TMPDIR directory is automatically erased at the end of the job, so you must save your files before this erasure occurs.  To be sure that enough time will remain to carry out these safeguards, choose a program cpu_limit value sufficiently inferior to the job's wall_clock_limit value. The writing of an executable file on the disk can be significantly slowed down by the overall load of the machine; the Elapsed time of the job's my_prog executable file is going to fluctuate but not its CPU time. It is wise, therefore, to leave a comfortable margin between the wall_clock_limit and the cpu_limit. It is not possible to give you a more precise guideline about this because the CPU/Elapsed time ratio varies for each executable file:  You must proceed by successive tries.