Turing : Notes on job accounting

Before submitting a job, it is important to understand the IDRIS job accounting policy for the Turing machine. This policy consists of counting the total number of reserved cores even if only a fraction of these resources is actually used during the execution. Furthermore, in certain cases you can end up blocking more resources/cores than you are actually intending to use. This is notably due to the specificities of the Turing hardware architecture and system which are explained below.

The IDRIS Blue Gene/Q machine is physically composed of 64 blocks, each made up of one I/O node (necessary to boot the block) and 64 compute nodes (CN). When a code is at the point of executing, the system allocates the necessary resources and dedicates these resources to it for the duration of the execution. In function of the CN number requested by the user, the system will reserve a whole number (1,2,16, …) of blocks which will be consolidated to form the executing block. In this way, the number of reserved CNs (making up the execution block) is necessarily a multiple of 64. Moreover, the system can only create certain block sizes: 64, 128, 256, 512, 1024, 2048 or 4096 CN (if there is a different choice, the system will round off this value to the next higher number). It is also possible to execute jobs using only a fraction of the compute nodes available inside a block: This is called a subblock job. In this case, network resources (5D-torus) could be shared with other jobs. This could have an impact on the MPI communication performance and on the intakes/outputs (one I/O node for 64 compute nodes). Here also, only certain subblock sizes are permitted: 1, 2, 4, 8, 16 or 32 compute nodes.

At the code execution, all or part of the reserved resources can be used (by free choice of the user) but the accounting will be done on the basis of resources reserved, whether they are used or not. Thus, for all executions, the number of hours tallied will be calculated in the following way: Tallied hours = number of reserved cores * duration of execution. For example, a job having requested 64 CN and computing on 32 MPI processes during 1 hour, will be billed 1024 hours (64 CN * 16 cores by CN * 1h): This situation, therefore, should be avoided.

The number of processes per node used during the execution can also have important consequences on the job billing. In fact, it is possible to launch 1, 2, 4, 8, 16, 32 or 64 MPI processes per compute node with respective access to 16 GiB, 8 GiB, 4 GiB, 2 GiB, 1 GiB, 512 MiB or 256 MiB per process. If you need a large amount of memory per MPI process, your application is not multi-threaded or OpenMP and you launch fewer processes than cores (16 cores per node), you will waste the computing resources. In the case of a job of 1024 MPI processes with 4 processes per node (and therefore 4 GiB per process), you will be allocated a block with 256 CN. For a computation lasting 1 hour, therefore, you will be billed 4096 hours (256 CN * 16 cores per CN * 1h). You should remember, as well, that overloading the cores by placing 2 or 4 processes on each core can sometimes be advantageous on a Blue Gene/Q architecture.