Turing, IBM Blue Gene/Q:  Hardware configuration

BlueGene picture

  • 98,304, PowerPC A2
  • 96 TiB memory
  • Cumulated power peak of 1.258 Pflop/s
  • 2.2 PB file system (shared with Ada)
  • 636 kW electrical consumption

For more details concerning the use of this machine's resources, click here.

Detailed description of machine

Architecture

Turing consists of several machines which have very precise functions:

  • Interactive nodes (also called front-end or log-in nodes)
  • Service nodes
  • The actual computing machine (Blue Gene/Q)

The front-end is the entry point on Turing and is the only place where a user has direct access.  It is here where job submission and compilation are done.

The front-end machine runs under Linux (Red Hat) having 32 cores of Power 7, a clock speed of 3 GHz, and 64 GiB of memory.

The service nodes manage the Turing resources (jobs, data bases …).

Blue Gene/Q hardware overview

ibm_bluegene_q_hierarchy

The computing machine, starting with the complete configuration and finishing with the basic elements, is comprised of the following:

  • Six Blue Gene/Q racks (or cabinets)
  • Each rack contains 32 node cards.
  • Each rack is divided into two midplanes.
  • Each midplane contains 16 node cards.
  • Each node card has 32 compute nodes (compute cards).
  • Each compute node has 16 cores.
  • Each core can run 4 threads (or processes).

Each compute node has 16 GiB of memory and a theoretical power of 201.1 Gflop/s (12.7 Gflop/s by core).  The complete configuration furnishes 6,144 compute nodes, 98,304 cores, and 393,216 hardware threads, with a cumulated power peak of 1.258 Pflop/s and 96 TiB of memory.

Finally, each rack holds 16 I/O nodes, with 2 connections per I/O node to the compute nodes.

Compute nodes and cores

The principal characteristics of a compute node are given in the following table:

CORE 64-bit POWERPC A2
Cores by node 16
Clock speed 1.6 GHz
L1 Cache per core  L1i: 16 kiB + L1d: 16 kiB
L2 Cache (shared) 32 MiB
Memory 16 GiB DDR3
Power peak per node 204.1 Gflop/s
Memory transmission bandwidth 42.6 GB/s
Electrical consumption 55 W

The peak performance of a compute core is relatively low (clock speed of only 1.6 GHz).  This greatly reduces electrical consumption by each core while, at the same time, multiplying the number of cores which can be used simultaneously.  If one supposes that the consumption is by order of frequency cube, dividing the frequency by two allows multiplying the computing power by two (by multiplying the number of cores by four) and reducing the electrical consumption by half.  The machine performance, therefore, is based on a large number of compute cores with low clock speed which results in very low electrical consumption.

This type of massively parallel architecture (MPP) requires applications which can be run with a very high level of parallelism (several thousand processes).

Caches and memory

The different cache levels and the memory are described on a separate page.

Networks

Four different networks are included in Turing.  The remarkable 5D (five-dimensional) Torus, represented by the figure below, is specialized for communications between compute nodes.  Through this network, excellent performance is obtained for MPI-type communications. 

The 5D Torus

tore 5D

The 5D Torus connects each compute node with its 10 neighbouring nodes (2 in each direction, in 5 dimensions).  The principal characteristics are:

  • 10 bi-directional links (20 x 2 GB/s = 40 GB/s)
  • A DMA engine which allows very good overlap of communications by calculations

This network is used by the totality of the MPI communications and the intakes-outputs.

Other networks

  • 1 bi-directional link at 2 GB/s which connects one compute node (of each group of 64) to an I/O node
  • JTAG : service network
  • Clock:  for the clock

Input/Output (I/O)

Regarding the GPFS file system servers

The I/Os on Turing are implemented through GPFS file system servers. These servers are shared with the Ada computing machine.

The available capacity is about 1 PB for the WORKDIR space and 500 TB for the TMPDIR space. The maximum output rate is 50 GB/s.

Regarding Turing

Specialized nodes are responsible for the intakes-outputs. There is one specialized I/O node for each set of 64 compute nodes. Each I/O node is capable of writing (or reading) at a rate of up to 3 GB/s. This means that, theoretically, it is possible to saturate the file servers with only 17 I/O nodes (that is, with a little more than one Turing rack). If the file system is solicited by other applications at the same time (because Ada also uses this system), the performance will be decreased.