Turing : Optimisation of inputs/outputs

Introduction

Every compute code requires input data and the creation of result files (output). In this domaine, massively parallel applications generally have greater needs than more classic applications. More and more software applications are now limited by their input/output performance.

The goal of this document is to provide you with advice and ideas which will help you to improve the management of your inputs/outputs in terms of performance, organisation and volume.

Methodology

As for all performance optimisations, the first question to ask is: Is it useful? Before launching into input/output optimisation, it is necessary to measure the amount of time taken by these operations. If the operations consume only 1% of your elapsed (or clock) time, the work of optimising them is probably not useful unless you think that this time will become critical in the future when you are multiplying the size of the problem to be calculated by 100 or 1000.

After each modification, it is imperative to verify that the output results are still correct and that the performance has not deteriorated.

Attention : Performance measurements of input/output on Turing are difficult to carry out because you do not have an access dedicated to the disks. The WORKDIR and TMPDIR files are shared by all the batch and interactive jobs on Turing and Ada. Therefore, large variations can be expected between two executions, depending on the network and file server loads.

Modifying the way of administrating the I/Os may also be necessary, or simply useful, to resolve problems of code portability between different architectures or problems of file management (e.g. the file formats or too many files).

Before beginning I/O optimisation operations, it is essential to be familiar with the hardware characteristics concerning I/Os on the Blue Gene/Q machine. These are described on the following page: Turing, IBM Blue Gene/Q: Hardware Configuration.

Design

The I/Os are often neglected when designing a compute code. They are fundamental, however, for every application.

At the very beginning of the design phase, it is important to determine which information should be read and written. When using massively parallel applications, it is critical to write only what is necessary. The approach used for the I/Os (see following section) determines the portability and scalability of the application.

It is strongly advised to separate the I/O procedures from the rest of the code. This favours the readability and maintenance of the source code.

Approaches

Numerous methods are possible for effectuating I/Os in parallel applications. Some are only adapted to specific problems and others can be simpler to implement, more portable or give a better performance.

Only one process for the I/Os

In this method, you use only one process for the I/Os.

  • Advantages
    • The method is simple.
    • Very few sequential files are generated.
  • Disadvantages
    • It is intrinsically non-parallel.
    • Does not take advantage of the possibililities of the parallel file system.
    • Does not give a good performance.
    • Not possible to read/write data at high throughput.

This method, therefore, should not be used on Turing except for applications which read or write very little data.

One file per process

All the processes read and write their data independently of each other in separate files.

  • Advantages
    • It is simple to implement.
    • Parallelisation is automatic.
    • It often gives a good performance (sometimes the best performance), when there is a small number of processes.
  • Disadvantages
    • A large number of files is generated.
    • The pre/post processing steps will need to adapt to the large number of files used.
    • If all the processes write simultaneously, there is a high risk of saturating the file servers, thereby impacting all the other jobs on Ada and Turing.

On a machine such as Turing, a massively parallel application with several thousand processes would create a prohibitive number of files to handle. This approach, therefore, is to be avoided.

One file shared by all the processes

It is possible to create only one file (or just a few files) to be shared by all the processes.

  • Advantages
    • A limited number of files in generated.
  • Disadvantages :
    • This approach can be problematic as the system must manage a large number of requests simultaneously, causing a risk of significant performance limitations.
    • In order to avoid two processes writing in the same place at the same time, lock management conflicts can exist, also causing performance losses.
    • When you are writing the application, it is difficult to know where each process must write/read its data.

This method is not advised because of the multiple disadvantages. If you wish to use an approach with only one shared file, please consult the sections which follow on MPI-I/O, Parallel NetCDF and HDF5.

Specialised processes for the I/Os

The best compromise method is to use a group of processes to effectuate the I/Os. This can be done either with a group of compute processes or with a group of processes which share the I/Os of all the compute processes. To improve the load balance, the group of processes effectuating the I/Os can also be included in the compute processes.

The advantage of this method is regrouping all the data into only one (or a few) file(s). It is then possible to have the same number of output files as if the same application were executed sequentially. This simplifies the pre/post processing steps.

When there are several I/O processes, the performance can be good if there are enough processes to saturate the bandwidth of the write/read file system. This approach can also give an excellent performance when each I/O process uses different I/O nodes. In that case, however, the difficulty is to do the process mapping correctly on the Blue Gene/Q. For this purpose, IBM proposes certain MPI extensions (see Blue Gene/Q Application Development). When using MPI extensions, however, be careful of the code portability to architectures other than Blue Gene (if you are using others).

The disadvantages are that the processes which do not participate in calculations are being “wasted” (from this point of view). There is also the problem of implementation which is not always easy (especially when there are multiple I/O processes). In fact, it is necessary to manage the communications for the data transfers to and from these processes and to be careful about their memory occupation. For example, it is dangerous to allocate arrays containing all the data of all the processes on only one process.

Moreover, the same as for the preceding methods, data portability from one machine to another is not assured. The portability is mainly related to the endianness of the data and the size of the different base datatypes (integers, simple or double precision reals, …).

MPI-I/O

MPI-I/O makes it possible to obtain excellent performance for parallel I/Os. In fact, file access is done in a parallel manner by all the MPI processes involved, either individually or collectively. Collective operations generally give the best performance because numerous optimisations become possible (regrouping of requests, data sieving mechanism, access in two phases, …).

MPI-I/O communications occur in much the same way as MPI communications between two processes, but with using derived datatypes.

MPI-I/O allows grouping the data of all the processes (or only of some of them) in only one file. This greatly facilitates file management and also simplifies the pre/post processing steps.

The files generated are not portable except when they are written in external32 mode; however, this is not yet supported on Turing.

You will find more ample information on the use of MPI-I/O in the IDRIS training course materials.

Parallel NetCDF

Parallel NetCDF is a high-performance, parallel input/output library, partially compatible with the NetCDF file format distributed by Unidata (UCAR). This format is often used by the scientific community, particularly for climatology.

Parallel NetCDF was developed independently of the NetCDF group to fill the parallelism void in the original library (which is no longer the case in NetCDF-4). Parallel NetCDF sends calls to MPI-I/O subroutines and, therefore, is dependent on the MPI-I/O performance.

This library gives a very good parallel performance on Turing if it is used correctly. The file format has very few additional instructions for ensuring portability so, theoretically, it is efficient.

Parallel NetCDF has the advantage of creating files which are perfectly portable between different architectures. The files are also auto-documented (for example, all the data are named and have well-defined dimensions). The standard NetCDF tools can be used on the files generated which greatly facilitates the pre/post-processing operations. Attention : The current NetCDF tools are not yet compatible with the new 64-bit file format introduced in Parallel NetCDF 1.1.0. (This format, however, is not used by default.)

Unfortunately, the use of Parallel NetCDF is rather cumbersome (declaration of all the variables and dimensions before being able to begin writing). Also, the available documentation is very limited at this time, particularly for the Fortran interface.

HDF5

HDF5, like Parallel NetCDF, is a high-performance, parallel input/output library. It also consists of calling MPI-I/O subroutines and, therefore, is dependent on the MPI-I/O performance.

The file format is much more complex than that of Parallel NetCDF. It brings many more functionalities but also greater additional costs. Nevertheless, these additional costs remain reasonable (even negligible) for large-sized files.

It is rather difficult to use this format because of the large number of functionalities.

HDF5 creates auto-documented files which are portable between all types of architecture.

Simple pre/post-processing tools are provided. (e.g. You can find the value of a given variable in one simple command line.)

Other libraries

A l'IDRIS, plusieurs autres bibliothèques d'entrées/sorties sont disponibles :

Conseils divers

The following are some ideas and advice which for improving your input/output performance on Turing.

  • Do not open and close the files too frequently as this involves numerous operations on the file system. The best way to proceed is to open a file at the beginning of a program and to leave it open during a sufficiently long lapse of time, until using the file is no longer necessary.
  • Limit the number of files which are open at the same time. For each open file, resources are reserved and managed by the system on the I/O nodes and on the file servers.
  • Open files in the suitable mode. If a file is only to be read, it must be opened in “read-only” mode. Choosing the correct mode allows the system to apply certain optimisations and to allocate only the resources necessary.
  • Do not flush the buffers unless necessary. Flushing is a costly operation.
  • Write/read the data tables/structures with just one call rather than element by element. Ignoring this rule has a very significant negative impact on the performance.
  • Separate the I/O procedures from the rest of the source code. This allows better code readability and maintainability.
  • Separate the metadata from the data. The metadata describes the data, such as the parameters of the calculations carried out, the size of the tables, etc. It is often easier to divide the file contents into two parts: one part containing the metadata and the other part containing the data itself.
  • Create files which are independant from the number of processes. This makes both post-processing and eventual restarts using a different number of processes easier.

For more information

The principal documents about the Blue Gene/Q include input/outputs and can also help you to better understand the functioning of Turing.