Turing : Choice of number of processes per compute node and process mapping

Choice of number of processes per compute node

Attention : The choice of number of processes used per compute node has important consequences on your job accounting (see notes on job accounting). To remind you, the billed amount is determined by the number of reserved cores multiplied by the clock time consumed.

Each compute node has 16 cores dedicated to the computations (a 17th core is reserved for the operating system). Each core is capable of executing up to 4 hardware threads. A hardware thread can be used for a thread (either OpenMP or not), as well as for an MPI process. In summary, it is possible to execute up to 64 MPI processes or threads, or a mix of both (for example, 4 MPI processes, each one having 16 OpenMP threads).

The Blue Gene/Q cores are capable of executing up to 2 operations per clock cycle: an instruction in floating point plus one other instruction (an operation on whole integers, a loading or writing instruction in the memory, or a control instruction). The floating point instruction can be done on a vector of 4 double-precision real numbers and thereby execute 4 floating point operations per cycle. Moreover, if these are FMA type operations (Fused Multiply-Add, i.e. a*x+b), there can be 8 floating point operations per cycle.

However, the 2 instructions being executed per cycle must be done by 2 different hardware threads. Executing more than one, or even several processes or threads per core can result in performance gains. The advantage of this becomes obvious in the situation where a thread is blocked while waiting for memory access whilst another thread still has enough data to continue working during this time. On most applications, running two hardware threads per core seems optimal.

The choice of number of processes per node is made via the option --ranks-per-node of runjob (see runjob command). For example:

runjob --np 2048 --ranks-per-node 32 : ./my_executable my_arg1 my_arg2

The accepted values are 1, 2, 4, 8, 16, 32 and 64. The available memory per process is determined by dividing 16 GiB by this value. (In reality, the available memory is slightly less than this because a small part of it is reserved for the operating system.) The following table shows the different possibilities:

--ranks-per-node Number of MPI processes/node Maximum number of threads per process Max. memory per MPI process
1 1 64 16 GiB
2 2 32 8 GiB
4 4 16 4 GiB
8 8 8 2 GiB
16 16 4 1 GiB
32 32 2 512 MiB
64 64 1 256 MiB

Size of the execution block

Each job is launched in a block, which is a grouping of compute nodes. Because of the machine's hardware architecture, a block must contain a multiple of 64 compute nodes (or a multiple of 1024 cores). Attention : Only certain block sizes are authorised: 64, 128, 256, 512, 1024, 2048 or 4096 compute nodes. In the case of a different choice, the system will reject the job.

The size is coded in the jobs by using the LoadLeveler directive: bg_size.

# @ bg_size = 64

On Turing, it is also possible to execute jobs using only a fraction of the compute nodes available inside a block (bg_size lower than 64). In that case, network resources (5D-torus) could be shared with other jobs. This could have an impact on the MPI communication performance and on the I/O (1 I/O node for 64 compute nodes). Only some subblocks sizes are permitted: 1, 2, 4, 8, 16 or 32 compute nodes.

Choice of the number of MPI processes

You must also be careful about the manner of coding the number of MPI processes in relation to the block size (in the command: mpirun). The number of MPI processes must be specified in the command: runjob (option --np).

The number of MPI processes must not surpass the resources available in the block that you have been allocated. Moreover, the number of MPI processes must not surpass the number of compute nodes multiplied by the number of processes per node (specified with the option --ranks-per-node of runjob). It is possible to use a number of processes inferior to these values but the wasted resources will be included in your job accounting.

MPI process mapping

The positioning or mapping of MPI processes on the machine cores can have an important influence on the communication performance.

The MPI communication network of the Blue Gene/Q is characterized by a 5D torus type topology. This means that each compute node is directly connected to 10 neighbouring nodes (two in each direction).

If all the point-to-point communications on the machine take place between neighbouring processes in the topology, it will be possible to approach the theoretical peak performance of this network and thereby obtain very high communication performance and a very good scalability.

Tore 5D

If communications take place with processes which are distanced from each other in the topology, the performance will be reduced because both the latency and the traffic on the links between each compute node will increase. In fact, a message will need to go through several links in order to reach its target destination. Therefore, it is necessary to try to place the communicating processes close together in the topology.

The five axes of the topology are designated A, B, C, D and E. There is also a sixth dimension called T which corresponds to different processes found in the interior of a compute node.

By default, the numbering of the MPI processes is done according to an ABCDET mapping. This means that the processes are placed consecutively by varying the last coordinate T, followed by E, then D, C, B and A. The advised mapping is the same: ABCDET. This should give good performance for the majority of the codes as the processes of closely numbered ranks will also be physically closer.

The choice of mapping is done via the option --mapping of runjob. This option will accept either a predefined mapping of the form ABCDET and all its permutations (TABDCE, BATCDE…) or the name of another mapping file. A mapping file must contain one line per MPI process. Each line must give the coordinates of the process which corresponds to the line number in the topology (one integer per dimension A, B, C, D, E and T, separated by spaces).

Example of a job submission with a TEDCBA mapping :

job.ll
  # @ shell            = /bin/bash
  # @ job_type         = BLUEGENE
  # @ job_name         = job_tedcba
  # @ output           = $(job_name).$(jobid)
  # @ error            = $(job_name).$(jobid)
  # @ wall_clock_limit = 1:00:00
  # @ bg_size          = 64
  # @ queue
 
  runjob --np 2048 --ranks-per-node 32 --mapping TEDCBA : ./my_executable my_arg1 my_arg2