Turing : Sequential optimisation

Introduction

Sequential optimisation consists of reducing resource usage by the individual processes (execution time or memory usage). Optimisation is carried out on the individual performance of each process and neither during their interactions (see MPI optimisation) nor the I/Os (I/O optimisation).

The goal of this document is to provide you with ideas for reducing the resources consumed by your code. Certain aspects will be illustrated with short examples. It is not possible, however, to completely exhaust the subject.

Method

As explained on the page Introduction to optimisation, you must begin by profiling your application in order to determine which code section uses the most time (or memory) and, therefore, where it is most pertinent to optimise. This step is absolutely necessary as it is often very difficult to guess where the resources are being used the most (errors in this regard are generally the rule).

In addition, an initial profiling will be very useful when comparing performance gains obtained from optimisation.

You should begin by attacking the parts which consume the most, of course, and continue progressively towards the sections which consume less. The potential gains decrease very rapidly in relationship with the time and work needed to carry out optimisation.

The optimisation procedure will be divided into two phases. The first form of optimisation, and often the most important, is done during the design stage; that is, by choosing the correct algorithms. Ideally, it is also at this point where you decide which external libraries you will use.

In the second phase, after the calculation code is functionning correctly, optimisation can be carried out in several steps:

  1. Choose the correct compiling options.
  2. Carry out a profiling.
  3. Choose a code section to improve.
  4. Optimise the chosen zone.
  5. Verify that the computing results are correct.
  6. Verify that there has been a performance gain.
  7. Once you are satisfied with the results of this code section, return to the 3rd point and choose another code section.
  8. Carry out a profiling (and continue with the above steps as needed).

While you are still optimising, it could be useful to redo a profiling to be sure that your application is behaving as expected.

Algorithms and design

During the design phase, before beginning to write your application, it is essential to choose the right algorithms. You need to decide which ones are most adapted to your calculation code in terms of the type of problem treated, the size and the scalability (number of processes). You should not underestimate the size of the problems which will be studied and also try to design a code which can still be used in several years (or even longer).

It is always possible to change the algorithms during the life of a code but it is easier if you make the right choices at the very beginning. A code having a long lifespan will probably have algorithm changes over time. Some libraries allow changing algorithms very simply without having to fundamentally modify your code (for example, by using PETSc).

Libraries

Numerous digital computing and scientific libraries exist. These libraries have generally been highly optimised by the manufacturers or by teams of specialised developers in these domains. It is, therefore, strongly advised to use these rather than to try to reinvent the wheel. Creating everything by yourself often requires a considerable amount of work and it is very difficult to equal the performance of these libraries.

The use of pre-existing libraries is, therefore, crucial to the performance of your calculation code.

The available libraries on Turing are presented on this page.

The IBM Mathematical Acceleration Subsystem (MASS) libraries

MASS libraries can sometimes result in very important performance gains without any effort on your part and without any modification of your sources. These libraries replace the principal classic mathematical operations (cos, sin, exp, sqrt, power, …) by versions which have a much higher performance. The disadvantage is a very slight reduction in the precision of the operations (negligable for most of the applications).

Compiler

Optimisation options

Changing the compiler's optimisation options is the simplest sequential optimisation which can be done. The biggest advantage of this is that you don't need to modify your source code.

The prinipal optimisation options are described on this page. Never forget to verify that your application results are still correct. Aggressive options can modify the behaviour of your application, introduce a bug or generate an incorrect executable file. You should also verify that there has indeed been an improvement in performance as there can be losses.

While you are testing the compiler options, it could be interesting to take a look at compiler documentation. The other compiler options can also prove to be useful.

How to help the compiler

Modern compilers, particularly the Blue Gene/Q compiler, are capable of carrying out many optimisation operations which needed to be done by hand only a few years ago (loop unrolling, loop fusion, calculation of invariants, …). These days, it is generally better to not do this by hand as some of these optimisations are no longer adapted to certain architectures. In addtion, doing this by hand would risk complicating the work of the compiler's optimiser and could provoke performance losses.

The following are ways which can facilitate the work of the compiler. The list is non-exhaustive, however, and the results are not guaranteed. Note that the advice concerning the loops is only to be followed if the loops are called numerous times and when each passage or iteration is relatively short.

  • Avoid doing tests inside a loop.
  • Avoid calling functions or subprograms in the loops.
  • Do not overuse the pointers.
  • Eliminate the GOTOs which prevent the compiler from making branch predictions.

Caches

Generally, access to memory is very slow but the processors have much faster memory caches which serve as intermediaries between the computing units and the main memory. The cache memories have lower latencies and allow storing part of the memory contents closer to the processor. Turing possesses 2 cache levels called L1 and L2. The L1 caches are the closest to the computing cores and are the fastest. The L2 cache which is larger but slower is the last level of cache before going to the main memory.

Optimising the use of caches

Because access to the main memory is extremely costly, it is essential to use the memory caches as much as possible.

In order to take best advantage of the caches, you should, first of all, use the complete contents of each cache line. For example, if you space out the double precision floating points accessed in an array to every 8th element, you will not be using the L1 cache efficiciently. In fact, each of these elements is stored in a different cache line (an L1 cache line has 64 bytes, that is, 8 double precision floating points). In this example, only one element out of 8 of each cache line would be used. The others would have been loaded for nothing. Therefore, you need to try to work on contiguous memory data. This is what we call spatial locality.

You should also use the same data several times. If the data found in the caches is re-used several times, it is not necessary to search for it in the main memory each time you need it. This is what we call temporal locality.

When using the temporal locality, we often work by blocks. Instead of effecting an operation on each element of a large array, followed by another operation on all the same elements of this array, it is more efficient to take a block of the array in the cache, carry out the series of operations on all the elements of the block, go to the following block and carry out the series of operations, and so on.

The worst situation is when memory access is done randomly or done at large intervals (more than 128 bytes between the values) making it very difficult to use the caches and to obtain good performance. In this situation, if the data used requires several operations on each value, a possible optimisation consists of creating intermediate arrays which contain a copy of this data. These arrays should not be too large in order to allow the caches to hold them. The advantage is to be able to load all the data in the cache, carry out operations on it and then to recopy the new data into their original sructures. This type of optimisation only gives performance gains if the number of operations to be done per element is sufficiently large because it requires two additional copying operations.

Optimising the use of caches is done, therefore, by using the spatial and temporal localities when writing algorithms. Only the re-use of data in the caches along with the use of the complete content of the cache lines allows approaching the maximum power of the machine.

Miscellaneous advice

The following is some advice to help improve the sequential performance of a calculation code:

  • Avoid tests inside the calculation loops.
  • Avoid calls to functions or subroutines inside the calculation loops.
  • Avoid indirect references to the memory.
  • Avoid random access to the memory.
  • Avoid datatype conversions.

Memory occupation

In general, code optimisation is carried out using the criteria of execution time. However, on architectures such as the Blue Gene/Q, the memory per computing core is relatively limited (1 GiB, or only 256 MiB is you launch 4 processes per core). Certain applications, therefore, could have problems of memory occupation. Memory consumption optimisation consists of reducing the memory needs of an application.

Reducing memory consumption can sometimes have a negative impact on execution time. In fact, it is usual to create arrays to contain data which will be re-used rather than re-calculated. You must, therefore, choose the best compromise.

To optimise memory occupation, several paths can be explored. There are still numerous codes which create unnecessarily large work arrays. This is frequently the case with codes originally written in Fortran 77 and which do not use the dynamic allocation functions of the memory. The solution here is rather simple: Use the dynamic memory allocation functions and only allocate what is needed.

In addition, many codes are not written homogeneously for each process. For example, a process can distribute or collect data for all of the application's processes. You must avoid this situation because the size of the data structures on a specialised process rapidly becomes prohibitive when the number of processes increases. This is also not good for the scalability of the application as this process can saturate very easily.

Another way to save memory is to use MPI and OpenMP (pthreads) hybrid programming. In fact, the different threads run in the same shared memory space. This allows you to avoid duplicating a part of the data.

The MPI library also consumes memory resources. It is possible to reduce these needs by modifying certain environment variables. You can find a list of these and their characteristics in the annexes of the document, Blue Gene/Q Application Development.

During certain memory usages, and particularly for codes doing many allocation/disallocation cycles of variable sizes, the memory can fragment. In consequence, it may be impossible to allocate data structures even though there is enough memory available. This phenomenon is discussed here.

On Blue Gene/Q, the memory of each node is divided by the number of processes (option -ranks-per-node of the runjob command). Of the 16 GiB, we must subtract the memory reserved for the operating system (16 MiB) and the shared memory space (64 MiB by default if several processes per node). The latter value can be changed if necessary by setting the BG_SHAREDMEMSIZE environment variable but a minimum is required for the MPI library (if insufficient, it will crash). In fact, the default share is not equitable between all the processes. Some will have less memory available than others. It is possible to force an equitable sharing by setting BG_MAPCOMMONHEAP to 1. However, if you do that, the memory protection between the processes will not be as strict. This can make difficult to diagnose some bugs.

Additional information

You can find numerous documents on internet dealing with sequential optimisation. Although these documents usually concern standard processors, the majority of the ideas can be adapted to the Blue Gene/Q.

The principal documents concerning the optimisation of Blue Gene/Q can also help you to better understand the functionning of Turing.