Babel : Getting started

Hardware

The Babel machine is a IBM Blue Gene/P system. It has 10 racks, each one containing 1024 compute nodes. A compute nodes has 4 computing cores running at 850MHz and 2GiB of memory. The total theoretical peak perfomance is 139Tflops (3.4Gflops by core).

Each rack is divided in two midplanes, each one containing 512 compute nodes (2048 cores). The compute nodes are grouped by 64 and each group has 1 I/O node. Therefore, it is only possible to allocate partitions (group of compute nodes) by multiple of 64 (256 cores). This is important for the accounting because, even if you use only one core, you always will be accounted for a multiple of 256 cores times the elapsed time.

The machine has access to parallel filesystems with a total capacity of nearly 800TB.

Development and production modes

Development mode

The develoment mode is the default mode when you open your Blue Gene/P project at IDRIS. This mode is reserved for porting your application and doing scalability tests. Once done, you have to ask to go to production.

Logins in this mode are available to all research teams that have access to the national supercomputer centers. An allocation of 32,768 computing hours is available without scientific evaluation. Only a short description of the scientific project and of the computing approach is required.

By default, you have access to a maximum of 4096 computing cores. However, if you prove a good scalability up to this level, it is possible to get access to more cores. It is also possible to get an extension of CPU time (with a total including the initial allocation not exceeding 65,536 hours) with justification. These extensions are to be asked on our extranet website.

The development mode is opened for at most one year.

To ask for a development project on Babel, you have to go on our ebabel service.

Production mode

Production mode is the normal mode for IDRIS projects. For these, hours allocations are done via the usual eDARI interface.

However, it is only possible to go to production mode and get access to your allocation if you prove that your application has good performance and a good scalability on the Blue Gene/P architecture.

To ask for production mode, go to our extranet website.

Logins are available only on a frontend which is a Power5+ system. There is no access on the compute nodes. The frontend is used for cross-compilation, job preparation and submission, and small pre/post-processing.

To log on, it is recommended to use ssh. The address of Babel is babel.idris.fr. It is important to know that access is allowed only from machines declared to IDRIS (see this page to declare new ones).

File transfer can be done via FTP or SFTP (SFTP is not recommended for large amount of data).

The only supported shell on the login nodes is bash.

File systems

You have access to 3 different filesystem. Each one has its own characteristics and usage.

You can check your disk usage with the quota_u command (with the -w option for the WORKDIR).

HOME

This is a permanent space with daily backups. Its size is limited to 200MiB by group (+ 200MiB for duplication) due to backup constraints. Its recommended use is to store small files used frequently (source files, libraries and important files).

WORKDIR

WORKDIR is a permanent space but without backups. By default, you get 10GiB by project. This value can be increazed for DEISA users by contacting the DEISA support or for normal users via the extranet website.

This is the directory where you can store executables, data files, object files,…

It can be accessed via the $WORKDIR environment variable.

TMPDIR

TMPDIR is a temporary space. It is created automatically when an interactive or batch job starts. It is empty at the beginning of the job and is destroyed at the end (be careful with this). Therefore, do not forget to transfer the files you want to keep in your HOME or WORKDIR (or on the file machine Gaya if you have an account on it).

This is the recommended file system for the execution of your jobs. You should use it for temporary files.

It can be accessed via the $TMPDIR environment variable.

Compilation

Fortran compilers

In order to compile a fortran code, you can:

Compile parallel executables for the compute nodes Use the commands: mpixlf2003_r, mpixlf95_r, mpixlf90_r, mpixlf77_r.
Compile serial executables for the frontend Use the commands: xlf2003/xlf2003_r/f2003, xlf95/xlf95_r/f95, xlf90/xlf90_r/f90, xlf/f77/xlf_r

Format of the source code:

mpixlf77_r/xlf_r i.e f77 compilers assume that the code is written in a fixed format (Old F77 style) whereas the others consider that it is a free format. You can define explicitly the format by adding -qfixed or -qfree=f90 at the compilation. The fixed format is no longer used now since the changes made by Fortran F95 (the norm introduced has made it possible to program in a free format).

Suffix of the files:

The source files can use the extensions: .f03, .f95, .f90 or .f. If they are replaced by capital letters, a call to the preprocessor is assumed.(-qsuffix=f=own_suffix can be used in case you want to specify your own extension).

Example:

If prog.f90 is written in a free format then you can compile it in 2 different ways:

Babel : mpixlf90_r prog.f90 -o prog
Babel : mpixlf77_r -qfree=f90 prog.f90 -o prog

Compiling a hybrid parallel code using MPI and OpenMP in Fortran

In order to compile the following program (source.f95) using MPI and OpenMP, use the command line:

mpixlf95_r -qsmp=omp source.f95

mpixlf95_r is the parallel compiler of IBM (IBM Fortran compilers combined with MPI library)

-qsmp=omp: this option is used so that the compiler can take into account the OpenMP directives inside the source code.

C/C++ Compilers on Babel

Here are the compilers used for the compilation on the frontend and on the compute nodes of the BG/P:

Language	Frontend	Compute nodes	Source file suffix
C	xlc_r	mpixlc_r	.c
C++	xlC_r, xlc++_r	mpixlcxx_r	.C, .cxx,.c++, .cc, .cp, .cpp

Example: How to create an executable on the BG/P

Babel : mpixlc_r prog.c -o prog
Babel : mpixlcxx_r prog.C -o prog

Compiling a hybrid parallel code using MPI and OpenMP in C

In order to compile the following program (source.c) using MPI and OpenMP, use the command line:

mpixlc_r -qsmp=omp source.c
mpixlcxx_r -qsmp=omp source.C

mpixlc_r / mpixlcxx_r are the parallel compilers of IBM (IBM C/C++ compilers combined with MPI library)

-qsmp=omp: this option is used so that the compiler can take into account the OpenMP directives inside the source code.

Libraries

Scientific/numerical libraries are available in the /bglocal/pub and /bglocal/prod directories. If one is missing please contact the user support team.

Most of them are available via the module command. To get the list of available modules:

your_login@babel1:~> module avail
------------ /local/pub/Modules/IDRIS/modulefiles/environnement ------------
compilerwrappers/no           compilerwrappers/yes(default)
------------ /local/pub/Modules/IDRIS/modulefiles/compilateurs ------------
c++/ibm/9.0.0.6(default)      fortran/ibm/11.1.0.6(default)
------------ /local/pub/Modules/IDRIS/modulefiles/bibliotheques ------------
arpack/96(default)              mumps/4.7.3(default)
blacs/1.1(default)              netcdf/3.6.2
blas/4.4(default)               netcdf/3.6.3(default)
blassmp/4.4(default)            p3dfft/2.3.2(default)
essl/4.4(default)               parpack/96(default)
esslsmp/4.4(default)            petsc/2.3.3
fftw/2.1.5(default)             petsc/3.0.0-p2/c-real(default)
fftw/3.1.2                      petsc/3.0.0-p2/c-real-debug
fftw/3.2.2                      petsc/3.0.0-p8/babel-real
fftw/3.2.2_fpu                  petsc/3.0.0-p8/babel-real-debug
hdf5/1.8.1                      phdf5/1.8.1
hdf5/1.8.2(default)             phdf5/1.8.2(default)
hdf5/1.8.5                      phdf5/1.8.5
hypre/2.4.0b(default)           pnetcdf/1.0.3
lapack/3.1.1                    pnetcdf/1.1.0
lapack/3.2.2(default)           pnetcdf/1.1.1(default)
mass/4.4(default)               pnetcdf/1.2.0
metis/4.0.1(default)            scalapack/1.8.0(default)
metis_frontend/4.0.1(default)   sundials/2.4.0(default)
------------ /local/pub/Modules/IDRIS/modulefiles/outils ------------
cmake/2.6.4(default)       scalasca/1.2
fpmpi2/2.1f(default)       scalasca/1.3.0
hpm/3.2.5(default)         scalasca/1.3.1(default)
libhpcidris/2.0            subversion/1.4.6
libhpcidris/3.0(default)   subversion/1.6.3(default)
mpip/3.1.2(default)        totalview/8.6.0-1(default)
mpitrace/def(default)      totalview/8.7.0-2
------------ /local/pub/Modules/IDRIS/modulefiles/applications ------------
cpmd/3.13.1                lammps/2010.06.25
cpmd/3.13.2(default)       lammps/2010.09.24
gromacs/3.3.3              namd/2.6(default)
gromacs/4.0.3(default)     namd/2.6.beta
gromacs/4.0.5              namd/2.7b1
gromacs/4.5.1              namd/2.7b2
lammps/2007.06.22          namd/2.7b3
lammps/2008.01.22(default) namd/2.7b4
lammps/2009.03.26
------------ /local/pub/Modules/DEISA/modulefiles/init ------------
deisa

Loading a module will set all the environment of this application or library. If it is a library, you don't have to add the path to the library or include files, it will be done automatically for you. Example:

your_login@babel1:~> module load fftw
(load) FFTW version 2.1.5

Interactive jobs

See limits in the run limits section.

It is not a real interactive mode because the scheduler creates a script file and submit it as a batch but this is hidden to the user.

In order to use this mode a special command is needed based on mpirun which is bgrun.

Example: running a job with 256 cpus:

rlab432@babel:/homegpfs/rech/lab/rlab432> bgrun -np 256 -mode VN -exe ./executable

If the user needs more computing resources in terms of memory and number of nodes, using the batch mode is necessary.

Batch jobs

Simple batch job

LoadLeveler is the scheduler or job manager used on the BG/P to run any application in batch mode.

In order to submit your job, you need to write a special script in which you define your computational needs, i.e #CPUs, memory,… so that LoadLeveler can set up the environment you are asking for.

Here is a simple example on how to run a code with 256 cores via the scheduler. A script job.ll is created.

Babel: more job.ll

# @ job_name = job_simple
# @ job_type = BLUEGENE
# Fichier sortie standard du travail
# @ output = $(job_name).$(jobid)
# Fichier erreur standard du travail
# @ error = $(output)
# Temps elapsed maximum demande
# @ wall_clock_limit = 1:00:00
# Taille partition d'execution
# @ bg_size = 64
# @ queue

# Copy executable and input file to TMPDIR
# Warning: if you need to transfer important volumes
# of data, please use a multi-step job
cp my_code $TMPDIR
cp data.in $TMPDIR
cd $TMPDIR

mpirun -mode VN -np 256 -mapfile TXYZ -exe ./my_code

# Copy output file to submission directory
# Warning: if you need to transfer important volumes
# of data, please use a multi-step job
# $LOADL_STEP_INITDIR is the submission directory
cp data.out $LOADL_STEP_INITDIR/

Then you submit this script via the llsubmit command:

Babel: llsubmit job.11

N.B: The script could be divided into two different parts:

The first one is related to the directives that should be passed to LoadLeveler via: #@ The second part is related to the command line that needs to be executed on the computes nodes.

LoadLeveler directives

A script submitted to Loadleveler contains special directives, some of them are necessary otherwise it is rejected by the job manager.

Necessary directives:

# @ job_type = BLUEGENE: type of the job. Use BLUEGENE if the job has to be executed on the compute nodes; SERIAL is used for the pre /post processing on the frontend
# @ bg_size: partition size to be allocated for the job (allowed values: 64, 128, 256, 512, 1024, 2048, 4096, 6144, 8192 and 10240). The number of MPI processes depends on this variable. Note that the maximum allowed partition size is by default 1024. To get acces to more compute nodes, you have to prove good scalability.
# @ wall_clock_limit: Elapsed time of the job should not exceed this limit(given in seconds or in HH:MM:SS format)
# @ queue: last directive of LoadLeveler (i.e no directive is considered after this one)

Optional directives:

# @ bg_connection : can take 3 values: MESH, TORUS or PREFER_TORUS. The default is MESH. MESH is the only valid value for partition with less than 512 compute nodes. For the others, it is recommended to set it to TORUS
# @ job_name: name of the job
# @ output: name of the output file
# @ error: name of the error file. Using $(output) should put any error messages into the output file
# @ notification: allows to receive an email at the end of the run.

A complete description is given by the IBM documentation.

Remarks about the script:

The number of MPI processes (specified after -np during a call to mpirun) needs to be matched with the partition size and the different modes used (-mode SMP, -mode DUAL, -mode VN).

A special attention should be paid to the usage of the mapping (-mapfile option, default value is TXYZ which is also the recommanded value) which can improve the performance of your application (by minimizing the communication time between the MPI processes).

When a job is run, the total consumed time depends on the number of cores used. If a single core is in charge of doing some important preprocessing steps necessary for the job while N cores are allocated, the idle time of the N-1 cores is taken into account. In order to avoid this, a multi-steps job has to be created.

Multi-steps jobs

Quite often, jobs perform different steps. By example:

Get some files from Gaya/WORKDIR for preprocessing purposes
Run the parallel job on the compute nodes
Put the created files on Gaya/WORKDIR

Therefore, rather than submitting a script in which serial tasks are executed in a parallel environment, it is possible to create a single script where serial and parallel tasks are separated. Serial tasks should be executed on the frontend and parallel tasks on the compute nodes.

N.B: The directory TMPDIR is used for the different steps (i.e temporary files could be transfered to/from that location for fast access) and is not destroyed between each step.

Here is an example where data is transfered from Gaya, then the job is run on 256 cores and finally an archive (which contains the results of the run) is created. The job submission file is the following (we decided to call it job_multi.ll):

#=========== Global directives ===========
#@ shell    = /bin/bash
#@ job_name = test_multi-steps
#@ output   = $(job_name).$(step_name).$(jobid)
#@ error    = $(output)

#=========== Step 1 directives ===========
#======= Sequential preprocessing ========
#@ step_name = sequential_preprocessing
#@ job_type  = serial
#@ cpu_limit = 0:15:00
#@ queue

#=========== Step 2 directives ===========
#============= Parallel step =============
#@ step_name  = parallel_step
#@ dependency = (sequential_preprocessing == 0)
# (submit only if previous step completed without error)
#@ job_type   = bluegene
#@ bg_size    = 64
#@ wall_clock_limit = 1:00:00
#@ queue

#=========== Step 3 directives ===========
#======= Sequential postprocessing =======
#@ step_name  = sequential_postprocessing
#@ dependency = (parallel_step >= 0)
# (submit even if previous step completed with an error)
#@ job_type   = serial
#@ cpu_limit  = 0:15:00
#@ queue

case $LOADL_STEP_NAME in

  #============ Step 1 commands ============
  #======= Sequential preprocessing ========
  sequential_preprocessing )
    set -ex
    cd $tmpdir
    mfget test/src/coucou_MPI.f
    mpixlf90_r coucou_MPI.f -o coucou_MPI.exe

    mfget test/data.tar
    tar xvf data.tar
    ls -l
    ;;

  #============ Step 2 commands ============
  #============= Parallel step =============
  parallel_step )
    set -x
    cd $tmpdir
    mpirun -mode VN -np 256 -mapfile TXYZ ./coucou_MPI.exe
    ;;

  #============ Step 3 commands ============
  #======= Sequential postprocessing =======
  sequential_postprocessing )
    set -x
    cd $tmpdir
    tar cvf result.tar *.dat
    mfput result.tar test/result.tar
    ;;
esac

This job is separated in two parts. The first one contains all the directives for the queue manager LoadLeveler and the second one the commands executed by the different steps of the job. These two parts may be mixed but it is not good practice.

Each LoadLeveler step directive section must be finished by a #@ queue directive.

#@ dependency: this option is used to create a dependency between the different steps. In our case, the second step can be executed only if the first one has completed without errors (return value equals to zero). If you want to execute a step even if there are errors in the previous, use >=0 (see example script). It is necessary to put dependency directives if you want the steps to execute one after the other (in the opposite each step starts independently from the others).

The submission of this job will create three different sub-jobs with the same content. To distinguish between them, the LOADL_STEP_NAME variable is set. You can use branches (case) in your script to execute the right part of the script (see example).

The multi-step job is submitted to LoadLeveler with the llsubmit command:

Babel: llsubmit multi-steps.ll

All the steps are executed in the same directory (from where you submit the job). Files are therefore written there (if not specified otherwise), which means that every task related to a step should have a different name for the output file. Otherwise each step will overwrite the previous output file.

In our example the line: #@ output = $(job_name).$(step_name).$(jobid) shows one way to avoid that. The name of the output file is a variable that depends on the step name.

Tools to control jobs

llsubmit : to submit jobs.
llq : to see the submission queue and get information about submitted jobs (status,…). The -l and -s options are often useful.
llcancel : to cancel a job.

Execution modes

Each compute nodes can work in one of the three following modes:

SMP: a compute node is seen as one quad-processor SMP node. There is only 1 MPI process by compute node which has access to 2GiB of memory. Without multithreading, only one physical core on the 4 of the node is effectively used. To use the 3 others, it is necessary to use a hybrid parallelisation paradigm with OpenMP/pthreads inside each compute node (limited to 4 threads by MPI process).
DUAL: a compute node is seen as two dual-processor SMP nodes. There are 2 MPI processes by compute node with access to 1GiB of memory by MPI process. Without multithreading, only two physical cores out of 4 are effectively used. To use the 2 others, it is necessary to use a hybrid parallelisation paradigm with OpenMP/pthreads inside each compute node (limited to 2 threads by MPI process).
VN: a compute node is seen as 4 independant mono-processor systems. Each one has access to only one quarter of the node memory (512MiB/core). The programming paradigm is limited to pure MPI (no multithreading support). There are 4 MPI processes by compute node.

Mode	MPI processes/node	Max memory/MPI process
VN	4	512MiB
DUAL	2	1GiB
SMP	1	2GiB

The execution mode is selected during submission via the mode directive.

mpirun -mode VN -np 256 -exe ./my_code

Run limits

Your consumption can be checked with the cpt command (updated once a day).

Batch jobs on the frontend (Power5+)

For mono-processors executions (cross-compilation, file transfers, small pre/post processing,…), there are 3 queues:

t1: maximum 1h of elapsed time and 15 minutes of CPU time
t2: maximum 10h of elapsed time and 2h of CPU time
archive: only for file transfers with Gaya (with mfput/mfget commands)

Memory (stack + data):

By default: data=2.0GiB, stack=2.0GiB
data + stack ⇐ 4.0GiB

Batch jobs on Blue Gene/P

   T(h)

       |
   20h +-------+--------+--------+--------+--------+-------+
       |       |        |        |        |        |       |
       | MRt3  | 1Rt3   | 2Rt3   | 4Rt3   | 8Rt3   | 10Rt3 |
       |       |        |        |        |        |       |
   10h +-------+--------+--------+--------+--------+-------+
       |       |        |        |        |        |       |
       | MRt2  | 1Rt2   | 2Rt2   | 4Rt2   | 8Rt2   | 10Rt2 |
       |       |        |        |        |        |       |
    1h +-------+--------+--------+--------+--------+-------+
       | MRt1  | 1Rt1   | 2Rt1   | 4Rt1   | 8Rt1   | 10Rt1 |
     0 +-------+--------+--------+--------+--------+-------+-->
       64      512      1024     2048     4096     8192    10240
                                                 Number of compute nodes
 core -> base unit
 compute node -> 4 cores
 MR -> Half rack: 512 nodes (2048 cores)
 1R  -> 1 rack: 1024 nodes (4096 cores)
 ...
 T(h) : elapsed time in hours

 Memory: 2GiB/node

Remarks:

The minimum number of compute nodes that can be reserved is 64 (bg_size directive). This corresponds to a minimum of 256 cores.
The number of nodes is always a multiple of 64.
Be careful, it is the elapsed time multiplied by the number of reserved cores that is accounted for even if some of them are not used.

Interactive jobs on Blue Gene/P (bgrun command)

Max elapsed time: 30min
Number of compute nodes: from 64 to 256