Ada : Executing a hybrid MPI/OpenMP job in batch under the Intel environment

Jobs are managed on all the nodes by the LoadLeveler software. Jobs are distributed in the classes mainly in function of the elapsed time, the number of cores and the requested memory. You may consult the structure of the classes on Ada here.

To optimise code performance during executions under an Intel environment, IDRIS automatically sets certain environment variables. Consequently, the Intel I_MPI_PIN_DOMAIN is set to allow binding the MPI processes and the threads (hybrid jobs) on the physical cores of the machine.

Attention : A code compiled under intel/2015.2 or intel/2016.2 CANNOT be executed under the IBM environment (via poe): There is incompatibility between the MPI IBM library currently used and the MPI Intel 5.0.xx and 5.1.xx which are part of these new Intel environments.

Attention : The intel/2013.0 environment cannot be used for a batch execution (LoadLeveler job). In fact, the associated mpirun command is bugged and sends back the following error message:

$ mpirun -np 8 ./a.out
[mpiexec@ada295] HYDT_bscd_ll_launch_procs (./tools/bootstrap/external/ll_launch.c:67): ll does not support user-defined host lists
*** glibc detected *** mpiexec.hydra: munmap_chunk(): invalid pointer: 0x00000000024bb660 ***

To submit a hybrid MPI + threads (OpenMP or pthreads) in batch, you must:

  • Create a submission script. The following is an example registered in the intel_mixte.ll :
intel_mixte.ll
# Arbitrary name of the LoadLeveler job
# @ job_name = Intel_mixte
# Standard job output file
# @ output   = $(job_name).$(jobid)
# Error job output file
# @ error    = $(job_name).$(jobid)
# Type of job
# @ job_type = mpich
# Number of MPI processes requested
# @ total_tasks = 8
# Number of OpenMP/pthreads tasks per MPI process
# @ nb_threads  = 4
# @ resources = ConsumableCpus($(nb_threads))
# Permits passing total_tasks to mpirun via NB_TASKS
# as well as the number of threads/processes
# @ environment = OMP_NUM_THREADS=$(nb_threads); NB_TASKS=$(total_tasks)
# Job time hh:mm:ss (1h30mn here)
# @ wall_clock_limit = 1:30:00
# @ queue
 
# Recommendation : Compile and execute your codes under the same Intel environment.  
# If necessary, therefore, use the module command to load the appropriate environment.
# For example, if your code is compiled with Intel/2016.2, uncomment the following line:
#module load intel/2016.2
 
# To have the command echoes
set -x
 
# Temporary work directory
cd $TMPDIR
# The LOADL_STEP_INITDIR vaiable is automatically set by
# LoadLeveler at the directory where the llsubmit is typed
cp $LOADL_STEP_INITDIR/a.out .
# The max. STACK memory used (default 4MB) (here 16 MB) per
# the private variables of each thread.
export KMP_STACKSIZE=16m
# It is also possible to use OMP_STACKSIZE
# export OMP_STACKSIZE=16M
 
# Execution of a parallel hybrid program (MPI + threads).
mpirun -np $NB_TASKS ./a.out
  • Submit this script via the llsubmit command:
$ llsubmit  intel_mixte.ll

Comments :

  • We recommend that you compile and execute your codes under the same Intel environment: Use the same command module load intel/… at the execution and the compilation.
  • In this example, we suppose that the a.out executable is located in the submission directory which is the directory in which the llsubmit command is entered. (The LOADL_STEP_INITDIR variable is automatically set by LoadLeveler.)
  • The Intel_mixte.numero_job calculation output file will also be created in the submission directory: Editing or modifying it during the execution can block it.
  • For an Intel type execution, you must specify # @ job_type = mpich (instead of # @ job_type = parallel for IBM poe).
  • The number of MPI processes is indicated by the directive # @ total_tasks =… as for a hybrid IBM job.
  • The number of threads per MPI process is indicated by two directives: # @ nb_threads = … (do not use any other name than nb_threads) and # @ resources = ConsumableCpus($(nb_threads)) (instead of # @ parallel_threads =… for a hybrid IBM job). This trick, permitted by LoadLeveler, allows reusing this nb_threads variable in the other LoadLeveler directives and transmitting its value to the execution environment. In this way, the number of threads per MPI process is defined only one time and its value is automatically transferred into the script shell instructions which avoids reserving more resources than are really used.
  • ATTENTION: The number of MPI processes (total_tasks) and the number of threads per MPI process (nb_threads) to be indicated must be such as the total number of cores reserved (total_tasks * nb_threads), being equal or inferior to 2048.
  • The binary execution is carried out via the mpirun command with the parameter being the total number of MPI processes (mpirun -np $NB_TASKS ./a.out).
  • Note that the number of MPI processes per compute node is automatically indicated via the I_MPI_PERHOST environment variable.
  • Memory: The default value is 3.5 GB per reserved core (therefore, per thread). If you request more than 64 cores (total_tasks * nb_threads > 64), then you cannot go beyond the limit of 3.5 GB. If you reserve 64 or fewer cores, the maximum request is 7.0 GB per reserved core via the keyword as_limit. Note that you must specify a limit per MPI process corresponding to (nb_threads * 7.0 Go). For example, if each MPI process generates 4 threads : # @ as_limit = 28.0gb
  • The private OpenMP variables are stored in the zones called STACK associated with each thread. Each of them has a default limit of 4MB. To go beyond this limit, for example to go up to 16MB per thread, it is necessary to use the KMP_STACKSIZE=16m or OMP_STACKSIZE=16M environment variable. Note that when the value of KMP_STACKSIZE is set, OMP_STACKSIZE is automatically adjusted to the same value.
  • If your job contains relatively long sequential commands (pre- or post-processing, transfers or archiving of large files, …), the use of multi-step jobs may be justified.