Turing : Performance measurement with libhpcidris

Attention : This functionality is only available beginning with the libhpcidris 5.0 version.

Introduction

IDRIS has developed the libhpcidris library for measuring the performance of an application. By instrumenting the code, measurements can be made of the number of floating-point (FLOP/s) or integer operations, certain memory throughputs, network bandwidth and input-output. This library carries out its measurements by reading the hardware counters on each compute node.

The following example shows what is obtained on the standard output when using the subroutines HPCIDRIS_F03_MPI_hardcount_start and HPCIDRIS_F03_MPI_hardcount_stop in Fortran or HPCIDRIS_MPI_hardcount_start and HPCIDRIS_MPI_hardcount_stop in C with a detail level of 0 (HPCIDRIS_LEVEL_DEFAULT) :

-----------------------------------------------------------------------
                PERFORMANCE (libhpcidris version 5.0)
              (C)2009-2014 Philippe WAUTELET, IDRIS-CNRS
-----------------------------------------------------------------------
 Elapsed time: 86.877202s (on process 0)
 Processes: 8192 (16 per node), threads per process: 1
 Reserved nodes (bg_size): 512 (IO nodes: 8)
 Torus dimensions: 4x4x4x4x2, is a torus: 1 1 1 1 1
-------------------------------------------------------------------------------------------
                |        Sum       |          Average         |      Volume    |   %peak  |
-------------------------------------------------------------------------------------------
FLOP/s total    |    2.622TFLOP/s  |  320.069MFLOP/s  /thread |  227.780TFLOP  |    2.50% |
FLOP/s FPU      |    1.562TFLOP/s  |  190.708MFLOP/s  /thread |  135.719TFLOP  |    5.96% |
 NoFMA FPU      |  804.623GFLOP/s  |   98.221MFLOP/s  /thread |   69.899TFLOP  |    6.14% |
 FMA FPU        |  757.658GFLOP/s  |   92.488MFLOP/s  /thread |   65.819TFLOP  |    2.89% |
FLOP/s QPU      |    1.060TFLOP/s  |  129.361MFLOP/s  /thread |   92.061TFLOP  |    1.01% |
 NoFMA QPU      |  531.613GFLOP/s  |   64.894MFLOP/s  /thread |   46.182TFLOP  |    1.01% |
 FMA QPU        |  528.111GFLOP/s  |   64.467MFLOP/s  /thread |   45.878TFLOP  |    0.50% |
FLoat inst/s    |    1.465TFLINS/s |  178.856MFLINS/s /thread |  127.284TFLINS |   11.18% |
Integer op/s    |  897.173GINTOP/s |  109.518MINTOP/s /thread |   77.940TINTOP |    6.84% |
Instructions/s  |    4.957TINST/s  |  605.120MINST/s  /thread |  430.638TINST  |   18.91% |
---------------------------------------------------------------------------------------------
L2 read         |    5.371TiB/s    |  687.446MiB/s    /thread   |   466.562TiB   |    2.82% |
DDR read        |    3.004TiB/s    |    6.008GiB/s    /node     |   260.983TiB   |   15.12% |
DDR write       |    2.727TiB/s    |    5.454GiB/s    /node     |   236.901TiB   |   13.73% |
Network sent    |   20.062GiB/s    |   40.124MiB/s    /node     |     1.702TiB   |    0.21% |
IO reads        |   89.249MiB/s    |   11.156MiB/s    /ION      |     7.571GiB   |    0.29% |
IO writes       |    1.524GiB/s    |  195.052MiB/s    /ION      |   132.379GiB   |    5.11% |
---------------------------------------------------------------------------------------------

Measurements and their meanings

Attention : All the values presented in this section are very subjective and should only be used as a rough estimate. The values can digress notably (from what is proposed here) depending on the type of application, the data set, etc.

Number of floating point operations per second (FLOP/s)

The number of floating point operations is a measurement of the use of computing units on the real numbers. The operations included here are additions, subtractions, multiplications, divisions and some other more specific operations (e.g., square roots or absolute values).

As on all the scalar processor machines, most of the applications (even those well-optimised) are only capable of using a fraction of the theoretical performance peak. An application using between 5% and 10% of this value is generally considered as having a very satisfactory performance. However, this very much depends on the type of application, the size of the problem studied, etc.

The values are divided into four groups:

  • NoFMA FPU: Non-FMA (Non-Fused Multiply-Add) and non-vector floating-point operations
  • FMA FPU: FMA (Fused Multiply-Add) and non-vector floating point operations
  • NoFMA QPU: Non-FMA (Non-Fused Multiply-Add) operations using the floating-point vector unit
  • FMA QPU: FMA (Fused Multiply-Add) operations using the floating-point vector unit

The FLOP/s FPU field is the sum of NoFMA FPU and FMA FPU. FLOP/s QPU is the sum of NoFMA QPU and FMA QPU. FLOP/s total is the sum of FLOP/s FPU and FLOP/s QPU.

The machine performance peak is calculated by taking into account the calculation vector unit working in parallel on 4 double-precision floating points. If the FPU is not used with vector operations, the resulting performance is divided by a factor of 4.

Moreover, the compute cores are capable of carrying out two floating-point operations, at each clock cycle and for each floating-point calculation unit, if it is one addition (or subtraction) combined with one multiplication (a = b + c*d: an operation called Fused Multiply-Add by IBM). If your application does not effectuate FMA operations, the maximum performance will again be divided by a factor of 2.

Number of floating point instructions per second (FLoat inst/s)

This measurement gives the number of floating point instructions executed on the machine per second. According to the type of instruction, one instruction can represent several operations (1 operation for a non-vector and non-FMA instruction, 2 for a non-vector and FMA instruction, 4 for a vector and non-FMA instruction, 8 for a vector and FMA instruction).

The theoretical maximum is one instruction per cycle and per core.

Number of operations on integers (Integer op/s)

This measurement gives the number of instructions (or operations for non-vector units) on the integers executed on the machine per second.

The theoretical maximum is one instruction per cycle and per core. This is very difficult to reach as the instructions of read and write in the memory also share these resources (1 instruction per cycle). For example, the updating of an integer table a[i]=a[i]+1 entails 1 read and 1 write per operation. Outside of memory access considerations, we cannot, in this example, go beyond 1/3 of the performance peak for the operations on integers.

Number of instructions (Instructions/s)

This measurement gives the number of every type of instruction executed on the machine per second.

The theoretical maximum is 2 instructions per cycle and per core. At each cycle, a core is capable of executing, at the same time, up to one instruction in floating-point and one other instruction (operation on the integers, memory reead or write, or other). However, at each cycle, one given process (or thread) is limited to only one instruction. To execute more than one operation per cycle, it is necessary to overload the cores with several processes (or threads). On the Turing Blue Gene/Q machine, it is possible to execute up to 4 hardware instances simultaneously.

Read speeds from the L2 cache (L2 read)

The read speed from the Level 2 cache is available. This allows measuring the pressure exerted on the Level 2 memory cache (there is also a Level 1 memory cache for which the speeds are not measurable). A high value (greater than 50% of the performance peak) indicates that your application is heavily solliciting the memory and is probably limited by these speeds if the number of FLOP/s is far from the performance peak.

On the other hand, a weak value (less than 10%) associated with a weak number of FLOP/s can also indicate that the compute cores are losing a lot of time waiting for data or writing them. This can be the case if you randomly access the memory many times or if you effectuate many costly operations (divisions, square roots, …).

Comments:

  • The speeds measured take into account not only the floating-point vaariables but also the queries on the machine instructions.
  • The measurements of write speeds in the L2, or read and write speeds in the L1, are not available on Turing.

Speeds with the main memory (DDR read and DDR write)

The same remarks as for the speeds on the L2 memory cache can be made.

You also need to know that the data in the main memory are read or written by blocks of 128 bytes (size of a cache line of L2 level). Even if you need to access only one element in the table, you will have to load the entire corresponding cache line.

It is equally important to compare the speeds of the L2 cache and the main memory. If they are close, this means that the data in the caches are not often being used and the performance (in terms of FLOP/S) will probably be rather low. On the contrary, if the L2 speed is much higher, this indicates a good usage of the cache data and the performance should be acceptable.

The documentation on sequential optimisation explains the functionning of the caches and how to benefit from them.

Send speed on the 5D torus network (Network sent)

The send speed on the 5D torus network is equal to the sum of the data injected by the 5D torus network compute node in the 10 existing directions. Each direction has a maximum speed of 2 GB/s for a total of 20 GB/s.

These measurements include the data sent by the processes of each compute node in addition to other data which could pass through the same nodes. A message passing through several compute nodes will, therefore, be counted several times. These include the headers specific to the system which ensure the routing of the data, as well as the MPI headers.

The reception speed is not available.

Input/output speeds (I/O writes and I/O reads)

The write and read speeds are given. These include all the I/Os: on the files, the inputs, outputs and standard errors, as well as the data transferred via the sockets (not many Blue Gene/Q applications use sockets).

The counters also include the system headers (a header of 32 bytes per operation or a block with a maximum of 512 application data bytes) and the safety controls (if you make write entries, you will also notice some traffic in the direction of the reads and inversely).

The peak value is given in function of the number of I/O nodes (one per block of 64 compute nodes) and in considering the maximal network speed from each block of 64 compute nodes towards its I/O node (2 links, each at 2 GB/s). All the tests performed up to now have shown that the speed towards the file systems could not be more than 3 GiB/s per I/O node. You must remember also that you are limited by the maximum speed of the file servers (50 GB/s for the WORKDIR and TMPDIR) and that there are others using the machine.

Utilisation

A module is available to position the paths for the library, the include files and the Fortran modules.

module load libhpcidris

Utilisation of libhpcidris in a Fortran program

All the functionalities of performance measurement are available by loading the Fortran module hpcidris (or hpcidris_hardcount). In your Fortran program, you just need to add the following line in all the places where you use this library:

use hpcidris

The available subroutines are:

  • HPCIDRIS_F03_MPI_hardcount_start(sync): Begins performance measurement.
  • HPCIDRIS_F03_MPI_hardcount_stop(thrdbyproc,level,sync) : Stops and displays the performance measurements for all the processes of the application. The argument thrdbyproc must be equal to the number of threads executed per MPI task (1 for a pure MPI application). The detail level of the display is chosen with the argument level:
    • HPCIDRIS_LEVEL_DEFAULT : Summary for all the compute nodes
    • HPCIDRIS_LEVEL_ALLPRC : Detailed outputs for all the compute nodes
    • HPCIDRIS_LEVEL_MISSES : Information about the miss caches (Attention: Values are rather difficult to interpret.)
  • HPCIDRIS_F03_MPI_hardcount_init() : Allocates the libhpcidris structures. It is not necessary to call this subroutine before calling HPCIDRIS_F03_MPI_hardcount_start because it will automatically be called if necessary.
  • HPCIDRIS_F03_MPI_hardcount_finalize() : Frees the resources used by the libhpcidris library. This call can only be made one time during the entire execution of your program. It is not necessary to call it at the end of the program.

These subroutines are collective and must be called by all the processes of the application (MPI_COMM_WORLD communicator).

The sync argument allows defining the synchronisation policy of the processes just before beginning the measurement:

  • HPCIDRIS_SYNC_DEFAULT For the default value (forced synchronisation for HPCIDRIS_F03_MPI_hardcount_start and no synchronisation for HPCIDRIS_F03_MPI_hardcount_stop).
  • HPCIDRIS_SYNC_YES To have synchronisation between all the processes.
  • HPCIDRIS_SYNC_NO To not have synchronisation.

Utilisation of libhpcidris in a C program

All the functionalities of performance measurement are available by including the hpcidris.h file (or hpcidris_hardcount.h). In your C program, you just need to add the following line everywhere that you use this library:

#include ''hpcidris.h''

The available functions are:

  • HPCIDRIS_MPI_hardcount_start(sync) : Begins the performance measurement.
  • HPCIDRIS_MPI_hardcount_stop(thrdbyproc,level,sync) : Stops and displays the performance measurement on all of the processes of the application. The thrdbyproc argument must equal the number of threads executed by the MPI task (1 for a pure MPI application). The detail level of the display is chosen by the argument level :
    • HPCIDRIS_LEVEL_DEFAULT : Summary for all the compute nodes
    • HPCIDRIS_LEVEL_ALLPRC : Detailed outputs for all the compute nodes
    • HPCIDRIS_LEVEL_MISSES : Information about the miss caches (Attention: Values are rather difficult to interpret.)
  • HPCIDRIS_MPI_hardcount_init() : Allocates the libhpcidris structures. It is not necessary to call this function before calling HPCIDRIS_F03_MPI_hardcount_start because it will automatically be called if necessary.
  • HPCIDRIS_MPI_hardcount_finalize() : Frees the resources used by the libhpcidris library. This call can only be made one time during all the execution of your program. It is not necessary to call it at the end of the program.

These functions are collective and must be called by all of the processes of the application (MPI_COMM_WORLD communicator).

The sync argument allows defining the synchronisation policy of the processes just before beginning the measurement:

  • HPCIDRIS_SYNC_DEFAULT For the default value (forced synchronisation for HPCIDRIS_F03_MPI_hardcount_start and no synchronisation for HPCIDRIS_F03_MPI_hardcount_stop).
  • HPCIDRIS_SYNC_YES To have synchronisation between all the processes.
  • HPCIDRIS_SYNC_NO To not have synchronisation.

Comments

  • The calls engender limited additional costs (about 2 microseconds per call). The use of these functionalities is, therefore, not adapted to very fine measurements as the duration measured will not be precise enough (the other measurements, independent of time, will be correct). Moreover, the display is not instantaneous and involves MPI communications between certain processes (1 per node). A slight desynchronisation of the processes can, therefore, appear. The measurements are not disturbed by the display as they were effected before this operation.
  • All the calls are collective in the MPI context and must be carried out by all the processes of the application (MPI_COMM_WORLD communicator).
  • For reasons inherent to the machine architecture, the performance measurements can only be carried out for a maximum of one thread or process per core at any given moment. If your application overloads the cores (several processes or threads per core), the measurements will be extrapolated. You will, therefore, obtain a meessage such as: WARNING: values per thread are extrapolated (only 1 thread per core is instrumented)
  • Every measurement begun must be stopped one and only one time.