Ada : MPMD execution of a code coupling in batch under the Intel environment

The MPMD (Multiple Program Multiple Data) is supported on Ada under the Intel environment. Different executables are launched and communicate with each by using MPI. All the MPI processes are included within the same MPI_COMM_WORLD communicator.

By request only, it is possible to access to a relatively nonrestrictive MPMD mode which allows bypassing certain limits of “classical” MPMD modes when executing heterogenous programs.

MPMD coupling of MPI codes

As an example, the following is a batch job which carries out the coupling of 3 MPI codes; 8 MPI processes are generated:

  • The MPI process rank 0 is generated from the executable ./a.out.
  • The 4 MPI processes, ranks 1-4, are generated from ./b.out.
  • The 3 MPI processes, ranks 5-7, are generated from ./c.out.
intel_mpmd_mpi.ll
# @ job_name = intel_mpmd
# @ output   = $(job_name).$(jobid)
# @ error    = $(output)
# @ job_type = mpich
# @ total_tasks = 8
# @ wall_clock_limit = 0:30:00
# @ queue
 
# Recommendation: Compile and execute your codes under the same Intel environment.
# Therefore, if necessary, use the module command to load the appropriate environment.
# For example, if your code is compiled with Intel/2016.2, uncomment the following line:
# module load intel/2016.2
 
# MPMD execution
mpirun -np 1 ./a.out : -np 4 ./b.out : -np 3 ./c.out

MPMD coupling of hybrid MPI/OpenMP codes

MPMD coupling of hybrid MPI/OpenMP codes is carried out in the same way as above but it is necessary to also specify the number of OpenMP threads to use for each executable. This can be done by putting the option -env before each executable appearing in the command line which defines the coupling.

The mpmd_mpi_openmp.ll batch job carries out the coupling of 3 MPI/OpenMP codes; 13 MPI processes are generated:

  • The MPI process rank 0 is generated from the executable ./a.out.
  • The 4 MPI processes, ranks 1-4, are generated from ./b.out. Each process generates 2 OpenMP threads.
  • The 8 MPI processes, ranks 5-12, are generated from ./c.out. Each process generates 4 OpenMP threads.

The resources reserved with the directives of the LoadLeveler batch manager are the following:

  • #@total_tasks=13 (13 MPI processes)
  • #@parallel_threads=4 (the maximum number of threads per MPI process)

In this way, there are 13*4= 52 cores reserved (or 2 Ada nodes).

intel_mpmd_mpi_openmp.ll
# @ job_name = intel_mpmd
# @ output   = $(job_name).$(jobid)
# @ error    = $(output)
# @ job_type = mpich
# @ total_tasks = 13
# @ nb_threads  = 4
# @ resources = ConsumableCpus($(nb_threads))
# @ wall_clock_limit = 0:30:00
# @ queue
 
# Recommendation: Compile and execute your codes under the same Intel environment.
# Therefore, if necessary, use the module command to load the appropriate environment.
# For example, if your code is compiled with Intel/2016.2, uncomment the following line:
# module load intel/2016.2
 
# MPMD execution
mpirun -np 1 -env OMP_NUM_THREADS=1 ./a_mixte.out : -np 4 -env OMP_NUM_THREADS=2 ./b_mixte.out : -np 8  -env OMP_NUM_THREADS=4 ./c_mixte.out

Advanced MPMD coupling of heterogeneous codes

The couplings presented above are relatively simple to put in place but as soon as the codes are heterogeneous, coupling them has two major drawbacks:

  • The same quantity of memory (by default, 3.5 GB per thread) is allocated to all the executables with no possibiity of going beyond this even if there remains unused memory.
  • The same number of cores is reserved for each executable which sometimes causes an overbooking of resources. For example, to couple 16 purely MPI processes with 8 hybrid (MPI/OpenMP) processes, each using 2 OpenMP threads, it would be necessary to reserve (16 + 8) x 2 = 48 cores or 2 Ada nodes whereas only one node would have theoretically been sufficient (16 + 8 x 2 = 32 cores).

To avoid these limitations, we offer (by request only) the possibility of accessing an advanced MPMD mode which allows reserving complete Ada nodes. The user is then responsible for:

  • The distribution of these processes on the assigned nodes.
  • The placement of the processes (and eventually the threads) on the cores of each node.

In this execution mode, the entire memory of the node (128 GB except for the large-memory nodes which have 256 GB RAM) is available without any restrictions being applied to the processes. It is the responsibility of the user to ensure that the sum total of memory used by all the processes is inferior to the quantity of memory available on the node..

Requests for access should be addressed to the IDRIS user support team ().

Reserving nodes

The LoadLeveler instruction # @ node = N allows reserving N number of complete nodes. The environment variable LOADL_PROCESSOR_LIST set by LoadLeveler allows you to know the list of nodes assigned to the job which is running.

To reserve the large-memory nodes, it is necessary to add the instruction # @ requirements = (Memory > 200000) to the submission script.

Launching the processes

Launching the processes is carried out via building a configuration file which will then be used through the mpirun command. In this case, the one and only option of the mpirun command must be -configfile path/to/the/config/file.

Each line of the configuration file describes the running of an executable file on a specific node:

  • The option -host node_name allows specifying on which node the executable must be launched.
  • The option -n M allows specifying M number of MPI processes to start on the node chosen for this executable.

The binding of the processes to a core (pure MPI code) or several cores (hybrid code) is carried out via the use of the I_MPI_PIN_DOMAIN environment variable. This environment variable can be set for all the nodes prior to the mpirun call or independently for each node by using the -env option of mpirun (it is not necessary to repeat this option for all the executables launched on the same node).

The I_MPI_PIN_DOMAIN environment variable is an array with the form [Mask1, Mask2, …, MaskP] where:

  • P is the total number of MPI processes launched on the node.
  • MaskI is a hexadecimal binary mask which defines the set of cores to which MPI rank I process of the node is binded: Core J is part of the rank I set if the bit J of MaskI is set at 1. The number of bits for which the value is 1 defines the number of OpenMP threads used if this is a hybrid code. For example, if the mask has the hexadecimal value A (or binary value 1010), then the executable will be attached to cores 1 and 3.

Attention: The I_MPI_PIN_DOMAIN variable numbers the logical cores (and not physical cores). It is, therefore, possible to number the 64 logical cores on each compute node possessing 32 physical cores because HyperThreading is activated on Ada.

Examples of submission scripts

  • Example using different distributions of the executables on the nodes assigned to the job:
intel_mpmd_avance.ll
# @ job_name = intel_mpmd
# @ output   = $(job_name).$(jobid)
# @ error    = $(output)
# @ job_type = mpich
# @ node = 2
# @ wall_clock_limit = 0:30:00
# @ queue
 
 
# Recommendation: Compile and execute your codes under the same Intel environment.
# Therefore, if necessary, use the module command to load the appropriate environment.
# For example, if your code is compiled with Intel/2016.2, uncomment the following line:
# module load intel/2016.2
 
# Construction of a table containing the names of the different nodes
IFS=', ' read -r -a noeuds <<< "$LOADL_PROCESSOR_LIST"
 
# Creation of the configuration file to launch:
# - 46 pure MPI processes "a.out" : 31 on the first node (cores 0 to 30) and 15 on the second node (cores 0 to 14)
# - 2 pure MPI processes "b.out": 1 on the first node (core 31) and 1 on the second node (core 15)
# - 2 hybrid MPI processes "c_mixte.out" on the second node, each one having 8 OpenMP threads  (on cores 16 to 23 and 24 to 31 respectively)
cat <<EOF > $TMPDIR/configfile
-host ${noeuds[0]} -n 31 -env I_MPI_PIN_DOMAIN=[1,2,4,8,10,20,40,80,100,200,400,800,1000,2000,4000,8000,10000,20000,40000,80000,100000,200000,400000,800000,1000000,2000000,4000000,8000000,10000000,20000000,40000000,80000000] ./a.out
-host ${nodes[0]} -n 1 ./b.out
-host ${nodes[1]} -n 15 -env I_MPI_PIN_DOMAIN=[1,2,4,8,10,20,40,80,100,200,400,800,1000,2000,4000,8000,FF0000,FF000000] ./a.out
-host ${nodes[1]} -n 1 ./b.out
-host ${nodes[1]} -n 2 ./c_mixte.out
EOF
 
mpirun -configfile $TMPDIR/configfile
  • Simple example using the same distribution of the executables on all the nodes assigned to the job:
intel_mpmd_avance.ll
# @ job_name = intel_mpmd
# @ output   = $(job_name).$(jobid)
# @ error    = $(output)
# @ job_type = mpich
# @ node = 2
# @ wall_clock_limit = 0:30:00
# @ queue
 
# Creation of the configuration file to launch:
# - 2 hybrid MPI processes "a_mixte.out"
# - 1 pure MPI process "b.out"
# - 1 pure MPI process "c.out"
# per compute node.
 For host in $LOADL_PROCESSOR_LIST
 do
 	echo "-host $host -n 2 ./a_mixte.out" >> $TMPDIR/configfile
 	echo "-host $host -n 1 ./b.out" >> $TMPDIR/configfile
 	echo "-host $host -n 1 ./c.out" >> $TMPDIR/configfile
 done
 
# Use the same placement for all the compute nodes:
# - the first "a_mixte.out" process of the node has 2 threads attached to cores 0 and 2 (0x5 = 101 in binary)
# - the second "a_mixte.out" process of the node has 2 threads attached to cores 1 and 3 (0xA = 1010 in binary)
# - the "b.out" process of the node is attached to core 4 (0x10 = 10000 in binary)
# - the "c.out" process of the node is attached to core 5 (0x20 = 100000 in binary)
export I_MPI_PIN_DOMAIN=[5,A,10,20]
 
mpirun -configfile $TMPDIR/configfile