Turing : Optimisation of MPI communications and scalability

Introduction

The MPI communications library is used by most of the current parallel applications which are able to run on the Blue Gene/Q. The communications, however, are not instantaneous and carry additional costs in execution time and can also limit the scalability of a computation code (number of cores which can be used efficiently).

The objective of this page is to present elements for consideration and possible ways to optimise MPI communications, thereby improving an application's scalability.

Methodology

As explained on the page Introduction to optimisation, it is necessary to begin with the profiling of your application. The profiling is to determine which communications are the most critical and to identify the possible problems of workload balancing between the different processes. This step is absolutely necessary as it is often very difficult to guess where the resources are used the most (estimations are generally incorrect).

Moreover, an initial profiling is very useful for comparing the resulting performance gains.

You must, of course, begin optimisation by attacking the parts which consume the most time and then continue progressively towards the sections which consume less. You should limit your optimisation efforts to the higher end of resource consumption because the potential performance gains decrease rapidly considering the time and work necessary to do the optimisation.

The procedure to follow will be divided into two phases:

  • The first and often the most important optimisation takes place during the design of the application. It concerns choosing the right algorithms.
  • The second phase, after the computation code functions correctly, can take place in several steps:
  1. Carry out a profiling.
  2. Choose a section of code to improve.
  3. Optimise the chosen zone.
  4. Verify that the results are correct.
  5. When you are satisfied with this code section, return to the second step.

During the optimisation, it can be useful to repeat the profiling measures to be sure that your application is functioning as expected.

Algorithms

It is essential to choose the right algorithms during the design phase, even before writing your application. It is necessary to decide which algorithms are the most adapted to your computation code in terms of the type of problem being processed, the size, the scalability (number of processes), and so on. In addition, you should think about developing a general code which can be extended and continue to be used for several more years.

To reduce the time spent in MPI communications, it is often possible to duplicate certain calculations on different processes in order to decrease the frequency of the communications and/or the quantity of data exchanged. This can greatly improve the scalability of your application. However, you must find the right compromise between the volume of communications and the quantity of duplicated operations. Too much replication has an impact on both the quantity of memory used and the execution time on each process.

While a code exists, it is always possible to go back and change the algorithms but it is easier to start out well. A code has a long lifespan and will probably change algorithms over time. Moreover, certain libraries allow changing algorithms easily without having to fundamentally modify your code (e.g. PETSc).

Load balancing

Load balancing between different processes is one of the essential elements for obtaining good scalability. It is absolutely necessary that the quantity of work which each process needs to effectuate be equivalent to all the others. If even one of the processes has a heavier load than the others, all of the other processes will have to wait.

Domain decomposition is usually carried out with adaptive mesh refinement in order to increase calculation resolution in the zones where calculations are more significant . Adding or destroying meshes during the iterations introduces an imbalance. There are means of restoring the load balance by exchanging meshes between processes. Unfortunately, these means are rather complex and costly. Therefore, to find the best compromise between load balance and the cost of the operation, the mesh exchanges should only be done on a limited basis (for example, all the n iterations). Load balancing is often done by distributing the same number of meshes to each process. However, this is not always sufficient. According to the models and algorithms used, calculations on different meshes can have time differences. In this case, it is necessary to divide the meshes so that the calculation times are as similiar as possible between the different processes.

If it is not possible to obtain a perfect balance, it would be better if only a few processes have less work rather than a few processes having more work. In the first case, there would be less waste of calculation time.

The initialisation and finalisation phases can be very time-consuming when the number of processes increases. To improve load balance on massively parallel applications, it is advised to avoid phases where only one (or a few) processes effectuate operations and for which one process distributes (or collects) data from all the other processes. This generally functions perfectly with a few dozen processes but becomes a bottleneck when there are several thousand processes. This type of operation, with one master process collecting or distributing data, is also to be avoided during the execution. If it is judged necessary to proceed in that manner, it is often possible to improve the situation by adding another process level which serves as intermediary between the master process and the slave processes. The communications will then be managed by these intermediary processes with each of them having a limited number of slave processes.

MPI profiling tools can be used to find the load balancing problem.

Computation/communication overlapping

In the development phase, you should write your code using blocking communications. The Blue Gene/Q has a DMA (Direct Memory Access) motor on its 5D torus network. It allows message transmission between the different machine nodes without interrupting the cores. A message can be transferred while the cores are working. Overlapping communications with computations is, therefore, completely possible. For applications suitable for this, it is possible through overlapping to mask a good portion of the time taken by the communications.

The MPI library provides non-blocking procedures for point-to-point communications (such as MPI_Isend and MPI_Irecv). It is strongly advised to use them if it is possible to effectuate operations between the initialisation of a communication and the moment when the transmitted data are used. It must not be forgotten, however, that there are more additional costs than in blocking communications: Non-blocking communications require more calls to MPI procedures and it is necessary to manage the requests. It is also easier to make mistakes.

Non-blocking communications should not be used in too large a quantity, however, as they consume memory resources. If too many requests occur on a process simultaneously, there is a risk of a lack of memory (especially on a Blue Gene/Q machine).

Attention : To completely activate the computation/communications overlapping on Turing, it is necessary to set the environment variable PAMID_THREAD_MULTIPLE at 1 (by adding the option --envs "PAMID_THREAD_MULTIPLE=1" to runjob). However, this option does not function if you launch 64 processes per node.

Process mapping

When there is communication between two processes, their messages pass through the machine network (except if they are in the same compute node). The Blue Gene/Q has a 5D torus network for MPI commmunications which connects each compute node to its 2 neighbours, one in each direction (for a total of 10 neighbours). This network is used by all of the communications.

On the massively parallel machines, the way processes are mapped on the machine can have a considerable impact on the performance. Let us consider the case of the 5D torus through which most of the messages pass. Most of the time, an application alternates the computation and communication phases (with possible overlapping). This means that at certain moments, all of the processes are communicating. If each process must exchange data with one or several other processes far away in the network, the risk of poor performance is elevated. Indeed, the network links are shared and if the processes are not neighbours, the messages will have to traverse several links. Because these links need to manage a multitude of messages simultaneously, this results in a reduced speed for each communication. Another effect of non-proximity is the increase in latencies which particularly affects small-size messages.

To optimise mapping, it is necessary that all communications be made between direct neighbours or, even better, between processes situated on the same compute node.

Process mapping on Turing is static. This means that during the launching of the application, each process is positioned on a given compute core and will not move during the execution. It is possible to impose the position of each MPI process in function of its rank in the MPI_COMM_WORLD on the 5D torus. The procedure to follow is described on the following web page or in the document Blue Gene/Q Application Development.

Other advice

This section identifies some practices for obtaining good MPI performance and scalability. Following this advice can help you to improve your compute code. Some of these points are generic and should give good results on every type of architecture while others are more specific to the architecture of Blue Gene/Q.

  • Use non-blocking point-a-point communications during optimisation (see above section, “Computation/communication overlapping”Computation/communications overlapping).
  • Communicate with neighbours which are nearby in the network (see above section, “Process mapping”Process mapping).
  • Avoid using buffered sends (for example, MPI_Bsend) as this requires supplementary memory copies which are costly in time and in memory occupation. Standard sends (MPI_Send) generally perform better.
  • Avoid using synchronous sends (for example, MPI_Ssend) as these are non-local operations and can result in supplementary latencies. Standard sends (MPI_Send) generally perform better.
  • Avoid using non-contiguous derived datatypes as the MPI library must effectuate packing operations and, therefore, memory copies and the utilisation of a larger amount of memory for storage. These datatypes can also cause non-optimal memory alignment problems.
  • Be careful to limit the number of messages being transferred simultaneously at a certain place (via MPI_Isend, for example). This problem is accentuated for an eager protocol (small messages).
  • Do not saturate certain processes with messages (for example, all the processes sending a message to only one other process). This risks saturating the memory of the receiver process and could also cause a bottleneck in the application.
  • In a non-blocking send (MPI_Isend), never write in the send buffer before verifying that the send has terminated.
  • Limit the number of messages. It is better to send a large message than numerous small ones.
  • Use collective communications as much as possible rather than point-to-point communications (except if all the communications take place with direct neighbours).
  • Avoid synchronisation barriers (MPI_Barrier) as most of them are not necessary.
  • Only communicate when it is necessary. Avoid sending messages containing useless information or messages which have not changed between two iterations.

Hybrid MPI/OpenMP programming

Hybrid MPI/OpenMP programming can increase scalability. The OpenMP/multithread approach can only function inside a shared memory node. On Turing, this allows going up to 64 threads per process as there are 16 cores per compute node and each core can execute up to 4 threads (or processes).

On the other hand, it is completely possible to have an MPI process with several OpenMP threads on each compute node. This approach has several advantages:

  • For a given number of cores, the number of MPI processes is decreased so the number of messages is also decreased. Moreover, the messages are larger and, therefore, the latencies are negligible: The application will spend less time in the communications.
  • Memory needs are reduced because of sharing memory between process threads.

For Hybrid MPI/OpenMP to function well, it is necessary, of course, to have a good OpenMP performance inside of each MPI process.

For more information

The principle documents about the Blue Gene/Q include optimisation and MPI optimisation. These documents can also help you to better understand the Turing functioning.