Turing: MPMD execution of a code coupling in batch

The MPMD (Multiple Program Multiple Data) execution model is supported on Turing. Different executable files are launched and communicate with each other by using MPI. All the MPI processes are included within the same MPI_COMM_WORLD communicator.

On Turing, this execution model is effected with the help of a text file (called mapfile in the examples). This file contains two sections:

  • A section for specifying the associations between the MPI processes and the executable files by using the directives #mpmdbegin, #mpmdcmd, #mpmdend
  • The second section indicates the MPI process coordinates within the ABCDET torus. The first line corresponds to the ABCDET coordinates of the MPI process 0, the second line to the coordinates of the MPI process 1, …

Example of MPMD coupling of a few MPI processes

The mpmd.ll example describes a 3-node execution of 7 MPI processes generated from 3 MPI/OpenMP codes :

  • The MPI process rank 0, generated from ./a_mixte.out, will run on one node. The execution core coordinates will be A=B=C=D=E=T=0.
  • The 2 MPI processes, ranks 1 and 2, generated from ./b_mixte.out, will run on one node. The execution core coordinates will be: A=B=C=D=0, E=1, T=0..1
  • The 4 MPI processes from ranks 3 to 6, generated from ./c_mixte.out, will run on one node. The execution core coordinates will be: A=B=C=0, D=1, E=0, T=0..3

In the following example, each MPI process will have the possibility of executing a maximum of 16 OpenMP threads because the maximum number of MPI processes per node is 4 (option –rank-per-node 4 of the runjob command).

mpmd.ll
# @ job_name = mpmd
# @ output   = $(job_name).$(jobid)
# @ error    = $(output)
# @ job_type = bluegene
# @ bg_size = 3
# @ wall_clock_limit = 0:30:00
# @ queue
 
# Building the mapfile
cat >>mapfile <<EOF
#mpmdbegin 0-0
#mpmdcmd ./a_mixte.out
#mpmdend
#mpmdbegin 1-2
#mpmdcmd ./b_mixte.out
#mpmdend
#mpmdbegin 3-6
#mpmdcmd ./c_mixte.out
#mpmdend
0 0 0 0 0 0
0 0 0 0 1 0
0 0 0 0 1 1
0 0 0 1 0 0
0 0 0 1 0 1
0 0 0 1 0 2
0 0 0 1 0 3
EOF
 
runjob \
       --ranks-per-node 4 \
       --np 7 \
       --mapping mapfile \
       --exe ./a_mixte.out

Example of MPMD coupling of a large number of MPI processes

The difficulty can be found in the building of the mapfile. The numbering of the cores within the ABCDET system requires knowing the maximum bounds of the A, B, C, D, E, T coordinates. In a batch job, they can be retrieved with the help of the LOADL_BG_SHAPE environment variable, as in the example mpmd-bgsize64.ll. The LOADL_BG_SHAPE variable, automatically referenced in a batch job, gives the profile of the allocated block.

For example, the LOADL_BG_SHAPE variable for bg_size=64, expressed in number of nodes in each dimension AxBxCxDxE, can have the value 2x2x4x2x2 :

  • A, B, D and E vary from 0 to 1.
  • C varies from 0 to 3.
  • T varies from 0 to T_MAX-1 where T_MAX is the number of MPI processes per node, specified in the argument –ranks-per-node of runjob.

If bg_size>=512, the LOADL_BG_SHAPE variable is expressed in number of midplanes in each dimension AxBxCxD (E is considered as equal to 2). Retrieval of the information given by LOADL_BG_SHAPE is described in the bash function max_abcdet, as shown below.

The numbering by default of the MPI process within the ABCDET system of coordinates consists of incrementing T (the letter farthest to the right), then E, D, C, B, A. This is carried out by the bash function incr_abcdet, shown below.

For more information about the mapfile, you may consult:

In mpmd-bgsize64.ll, we describe an example of the coupling of 769 MPI processes which are generated by 3 different MPI/OpenMP codes:

  • The MPI process rank 0, generated by ./a_mixte.out. It runs on one node.
  • The 512 MPI processes, ranks 1 to 512, generated by ./b_mixte.out. They run on 32 nodes.
  • The 256 MPI processes, ranks 513 to 768, generated by ./c_mixte.out. They run on 16 nodes.

The –ranks-per-node option indicates 16 MPI processes per node: Each MPI process will have the possibility, therefore, of executing up to 4 OpenMP threads.

mpmd-bgsize64.ll
# @ job_name = mpmd
# @ output   = $(job_name).$(jobid)
# @ error    = $(output)
# @ job_type = bluegene
# @ bg_size  = 64
# @ wall_clock_limit = 0:10:00
# @ queue
 
 
max_abcdet () {
# Based on the distribution of nodes or midplanes supplied by LOADL_BG_SHAPE (LoadLeveler)
#Cahculation of the maximum coordinates in each direction of the 5D Torus 
# (a_max, b_max, c_max, d_max, e_max, t_max)
    shape=(${LOADL_BG_SHAPE//x/ })
    case ${#shape[*]} in
        5) 
   # LOADL_BG_SHAPE expressed in number of nodes
            ((a_max=shape[0]))
            ((b_max=shape[1]))
            ((c_max=shape[2]))
            ((d_max=shape[3]))
            ((e_max=shape[4]))
            ;;
        4)
   # LOADL_BG_SHAPE expressed in number of midplanes      
            ((a_max=shape[0]*4))
            ((b_max=shape[1]*4))
            ((c_max=shape[2]*4))
            ((d_max=shape[3]*4))
            ((e_max=2))
            ;;
        *) echo " *** ERREUR : LOADL_BG_SHAPE=${LOADL_BG_SHAPE} unexpected"
            ;;
    esac
    ((t_max=RUNJOB_RANKS_PER_NODE))
}
 
 
incr_abcdet () {
# Incrementation of the coordinates of the 5D Torus (a, b, c, d, e, t)
    ((t=(t+1)%t_max))
    if ((t==0))
    then
	((e=(e+1)%e_max))
	if ((e==0))
	then
	    ((d=(d+1)%d_max))
	    if ((d==0))
	    then
		((c=(c+1)%c_max))
		if ((c==0))
		then
		    ((b=(b+1)%b_max))
		    if ((b==0)) 
		    then
			((a+=1))
		    fi
		fi
	    fi
	fi
    fi
}
 
set -x
 
# Runjob parametre: option  --ranks-per-node
export RUNJOB_RANKS_PER_NODE=16
 
# 1st section of the mapfile
# Association between the MPI process number and the executable file
cat >mapfile <<EOF
#mpmdbegin 0-0
#mpmdcmd ./a_mixte.out
#mpmdend
#mpmdbegin 1-512
#mpmdcmd ./b_mixte.out
#mpmdend
#mpmdbegin 513-768
#mpmdcmd ./c_mixte.out
#mpmdend
EOF
 
# Calculate the upper bounds of the coordinates within the 5D Torus 
max_abcdet
 
# Initialisation of the coordinates of the 5D Torus (a, b, c, d, e, t)
((a=b=c=d=e=t=0))
 
# 2nd section of the mapfile
# Writing of the coordinates of each MPI process
 
# 1 MPI process rank 0 / 1 node occupied
echo "$a $b $c $d $e $t" >> mapfile
 
# Passage to the following node, as it is a different executable file
((t=t_max-1))
incr_abcdet
 
# 512 MPI processes ranks 1 to 512 / 32 nodes occupied
for i in {1..512}
do
    echo "$a $b $c $d $e $t" >> mapfile
    incr_abcdet
done
 
# 256 MPI processes ranks 513 to 768 / 16 nodes occupied
for i in {1..256}
do
    echo "$a $b $c $d $e $t" >> mapfile
    incr_abcdet
done
 
# MPMD execution under the control of the mapfile
runjob \
       --ranks-per-node $RUNJOB_RANKS_PER_NODE \
       --np 769 \
       --mapping mapfile \
       --exe ./a_mixte.out

Comments and limitations

  • It is not possible to have several executable files on one node. Only one executable file is launched per node.
  • The number of MPI processes (specified in the –ranks-per-node argument of runjob) must correspond to the maximum number of MPI processes placed on each of the nodes.
  • As for an execution with only one executable file, the available memory on all of the nodes depends on the –ranks-per-node argument of runjob.
  • For several hybrid MPI/OpenMP executable files, the number of OpenMP threads must be identical. If this number is not specified with –envs OMP_NUM_THREADS, it is equal to the minimum number of threads needed to bind all the hardware threads of the node.
  • The executable file specified in the –exe argument of runjob is an argument ignored by the command. However, it must be specified.