Turing : Memory fragmentation problem

Symptoms

Memory allocations can fail even though there is sufficient memory available.

Cause

The operating system used on the Turing compute nodes (called CNK for Compute Node Kernel) is very simple in order to have the least possible interference to the processes running on the Blue Gene/Q as well as to run the fastest possible.

Unfortunately, the CNK memory administrator is very basic and is not capable of handling the phenomenon of memory fragmentation.

The memory can fragment when an application repeatedly makes dynamic allocations and deallocations. For example, if you allocate a certain number of small work arrays and thereby fill up the memory, after which you deallocate some of them, some holes will appear. If then you attempt to make a new allocation, the exploitation system will try to find a sufficiently large free place in which to put your new work array. If a place is not found (because your array is larger than the largest hole), it cannot allocate it even if the total available space is largely sufficient. A more advanced operating system is capable of allocating non-adjacent memory blocks or moving the blocks to prevent fragmentation. The approach used for memory management on Blue Gene/Q allows the allocations/deallocations to be done very rapidly but to the detriment of its flexibility.

Advice

Here are some suggestions for writing an application in order to prevent memory fragmentation:

  • Allocate the work arrays all at the same time at the beginning of the application.
  • Avoid multiple and repeated allocations/deallocations.
  • Give preference to allocations/deallocations of fixed-size arrays during the execution of your application.

If you don't succeed in getting rid of this phenomenon, try to manage a part of the memory space yourself. This can be done by allocating a large space and distributing chunks of memory into it as needed during the execution. Be aware that this approach can prove to be very complex.

Example

The following is a small program in C which illustrates the problem of memory fragmentation:

#include <stdlib.h>
 
#define NBLOCKS 40
#define MB 1024*1024
 
int main(int argc,char **argv)
{
  int     i;
  size_t  size;
  char   *smallblocks[NBLOCKS], *bigblock;
 
  // Allocate 'small' blocks
  // With 1 rank-per-node, it will use all the memory (16GB)
  size = 400*MB;
  for (i=0;i<NBLOCKS;i++)
  {
    smallblocks[i] = (char *) calloc(size, sizeof(char));
    if(smallblocks[i]==NULL)
    {
      printf(''Allocation of smallblocks[%i] failed! '',i);
      break;
    }
  }
 
  // Deallocate 1 smallblock every 2
  // After deallocation, we should have NBLOCKS/2
  // free blocks of 400MB
  for (i=0;i<NBLOCKS;i+=2) free(smallblocks[i]);
 
  // Allocate big block
  // Total freespace should be at least 20*400MB=8000MB
  size = 600*MB;
  bigblock = (char *) calloc(size, sizeof(char));
  if(bigblock==NULL)
  {
    printf(''Allocation of bigblock failed!n'');
    return EXIT_FAILURE;
  }
 
  return EXIT_SUCCESS;
}

If we are running it with –ranks-per-node 1 on only one process, we obtain the following outputs during the execution:

runjob --ranks-per-node 1 --np 1 : ./a.out
 
Allocation of bigblock failed!
2012-12-19 01:31:25.961 (WARN ) [0xfff68ea8b10] :6919:ibm.runjob.client.Job: normal termination with status 1 from rank 0

The allocation of a 600 MB memory block failed even though there were at least 8000 MB available.