The Babel machine is a IBM Blue Gene/P system. It has 10 racks, each one containing 1024 compute nodes. A compute nodes has 4 computing cores running at 850MHz and 2GiB of memory. The total theoretical peak perfomance is 139Tflops (3.4Gflops by core).
Each rack is divided in two midplanes, each one containing 512 compute nodes (2048 cores). The compute nodes are grouped by 64 and each group has 1 I/O node. Therefore, it is only possible to allocate partitions (group of compute nodes) by multiple of 64 (256 cores). This is important for the accounting because, even if you use only one core, you always will be accounted for a multiple of 256 cores times the elapsed time.
The machine has access to parallel filesystems with a total capacity of nearly 800TB.
The develoment mode is the default mode when you open your Blue Gene/P project at IDRIS. This mode is reserved for porting your application and doing scalability tests. Once done, you have to ask to go to production.
Logins in this mode are available to all research teams that have access to the national supercomputer centers. An allocation of 32,768 computing hours is available without scientific evaluation. Only a short description of the scientific project and of the computing approach is required.
By default, you have access to a maximum of 4096 computing cores. However, if you prove a good scalability up to this level, it is possible to get access to more cores. It is also possible to get an extension of CPU time (with a total including the initial allocation not exceeding 65,536 hours) with justification. These extensions are to be asked on our extranet website.
The development mode is opened for at most one year.
To ask for a development project on Babel, you have to go on our ebabel service.
Production mode is the normal mode for IDRIS projects. For these, hours allocations are done via the usual eDARI interface.
However, it is only possible to go to production mode and get access to your allocation if you prove that your application has good performance and a good scalability on the Blue Gene/P architecture.
To ask for production mode, go to our extranet website.
Logins are available only on a frontend which is a Power5+ system. There is no access on the compute nodes. The frontend is used for cross-compilation, job preparation and submission, and small pre/post-processing.
To log on, it is recommended to use ssh. The address of Babel is babel.idris.fr. It is important to know that access is allowed only from machines declared to IDRIS (see this page to declare new ones).
File transfer can be done via FTP or SFTP (SFTP is not recommended for large amount of data).
The only supported shell on the login nodes is bash.
You have access to 3 different filesystem. Each one has its own characteristics and usage.
You can check your disk usage with the quota_u command (with the -w option for the WORKDIR).
This is a permanent space with daily backups. Its size is limited to 200MiB by group (+ 200MiB for duplication) due to backup constraints. Its recommended use is to store small files used frequently (source files, libraries and important files).
WORKDIR is a permanent space but without backups. By default, you get 10GiB by project. This value can be increazed for DEISA users by contacting the DEISA support or for normal users via the extranet website.
This is the directory where you can store executables, data files, object files,...
It can be accessed via the $WORKDIR environment variable.
TMPDIR is a temporary space. It is created automatically when an interactive or batch job starts. It is empty at the beginning of the job and is destroyed at the end (be careful with this). Therefore, do not forget to transfer the files you want to keep in your HOME or WORKDIR (or on the file machine Gaya if you have an account on it).
This is the recommended file system for the execution of your jobs. You should use it for temporary files.
It can be accessed via the $TMPDIR environment variable.
In order to compile a fortran code, you can:
Format of the source code:
mpixlf77_r/xlf_r i.e f77 compilers assume that the code is written in a fixed format (Old F77 style) whereas the others consider that it is a free format. You can define explicitly the format by adding -qfixed or -qfree=f90 at the compilation. The fixed format is no longer used now since the changes made by Fortran F95 (the norm introduced has made it possible to program in a free format).
Suffix of the files:
The source files can use the extensions: .f03, .f95, .f90 or .f. If they are replaced by capital letters, a call to the preprocessor is assumed.(-qsuffix=f=own_suffix can be used in case you want to specify your own extension).
If prog.f90 is written in a free format then you can compile it in 2 different ways:
Babel : mpixlf90_r prog.f90 -o prog Babel : mpixlf77_r -qfree=f90 prog.f90 -o prog
In order to compile the following program (source.f95) using MPI and OpenMP, use the command line:
mpixlf95_r -qsmp=omp source.f95
mpixlf95_r is the parallel compiler of IBM (IBM Fortran compilers combined with MPI library)
-qsmp=omp: this option is used so that the compiler can take into account the OpenMP directives inside the source code.
Here are the compilers used for the compilation on the frontend and on the compute nodes of the BG/P:
|Language||Frontend||Compute nodes||Source file suffix|
|C++||xlC_r, xlc++_r||mpixlcxx_r||.C, .cxx,.c++, .cc, .cp, .cpp|
Example: How to create an executable on the BG/P
Babel : mpixlc_r prog.c -o prog Babel : mpixlcxx_r prog.C -o prog
In order to compile the following program (source.c) using MPI and OpenMP, use the command line:
mpixlc_r -qsmp=omp source.c mpixlcxx_r -qsmp=omp source.C
mpixlc_r / mpixlcxx_r are the parallel compilers of IBM (IBM C/C++ compilers combined with MPI library)
-qsmp=omp: this option is used so that the compiler can take into account the OpenMP directives inside the source code.
Scientific/numerical libraries are available in the /bglocal/pub and /bglocal/prod directories. If one is missing please contact the user support team.
Most of them are available via the
module command. To get the list of available modules:
your_login@babel1:~> module avail ------------ /local/pub/Modules/IDRIS/modulefiles/environnement ------------ compilerwrappers/no compilerwrappers/yes(default) ------------ /local/pub/Modules/IDRIS/modulefiles/compilateurs ------------ c++/ibm/22.214.171.124(default) fortran/ibm/126.96.36.199(default) ------------ /local/pub/Modules/IDRIS/modulefiles/bibliotheques ------------ arpack/96(default) mumps/4.7.3(default) blacs/1.1(default) netcdf/3.6.2 blas/4.4(default) netcdf/3.6.3(default) blassmp/4.4(default) p3dfft/2.3.2(default) essl/4.4(default) parpack/96(default) esslsmp/4.4(default) petsc/2.3.3 fftw/2.1.5(default) petsc/3.0.0-p2/c-real(default) fftw/3.1.2 petsc/3.0.0-p2/c-real-debug fftw/3.2.2 petsc/3.0.0-p8/babel-real fftw/3.2.2_fpu petsc/3.0.0-p8/babel-real-debug hdf5/1.8.1 phdf5/1.8.1 hdf5/1.8.2(default) phdf5/1.8.2(default) hdf5/1.8.5 phdf5/1.8.5 hypre/2.4.0b(default) pnetcdf/1.0.3 lapack/3.1.1 pnetcdf/1.1.0 lapack/3.2.2(default) pnetcdf/1.1.1(default) mass/4.4(default) pnetcdf/1.2.0 metis/4.0.1(default) scalapack/1.8.0(default) metis_frontend/4.0.1(default) sundials/2.4.0(default) ------------ /local/pub/Modules/IDRIS/modulefiles/outils ------------ cmake/2.6.4(default) scalasca/1.2 fpmpi2/2.1f(default) scalasca/1.3.0 hpm/3.2.5(default) scalasca/1.3.1(default) libhpcidris/2.0 subversion/1.4.6 libhpcidris/3.0(default) subversion/1.6.3(default) mpip/3.1.2(default) totalview/8.6.0-1(default) mpitrace/def(default) totalview/8.7.0-2 ------------ /local/pub/Modules/IDRIS/modulefiles/applications ------------ cpmd/3.13.1 lammps/2010.06.25 cpmd/3.13.2(default) lammps/2010.09.24 gromacs/3.3.3 namd/2.6(default) gromacs/4.0.3(default) namd/2.6.beta gromacs/4.0.5 namd/2.7b1 gromacs/4.5.1 namd/2.7b2 lammps/2007.06.22 namd/2.7b3 lammps/2008.01.22(default) namd/2.7b4 lammps/2009.03.26 ------------ /local/pub/Modules/DEISA/modulefiles/init ------------ deisa
Loading a module will set all the environment of this application or library. If it is a library, you don't have to add the path to the library or include files, it will be done automatically for you. Example:
your_login@babel1:~> module load fftw (load) FFTW version 2.1.5
See limits in the run limits section.
It is not a real interactive mode because the scheduler creates a script file and submit it as a batch but this is hidden to the user.
In order to use this mode a special command is needed based on mpirun which is bgrun.
Example: running a job with 256 cpus:
rlab432@babel:/homegpfs/rech/lab/rlab432> bgrun -np 256 -mode VN -exe ./executable
If the user needs more computing resources in terms of memory and number of nodes, using the batch mode is necessary.
LoadLeveler is the scheduler or job manager used on the BG/P to run any application in batch mode.
In order to submit your job, you need to write a special script in which you define your computational needs, i.e #CPUs, memory,... so that LoadLeveler can set up the environment you are asking for.
Here is a simple example on how to run a code with 256 cores via the scheduler. A script job.ll is created.
Babel: more job.ll # @ job_name = job_simple # @ job_type = BLUEGENE # Fichier sortie standard du travail # @ output = $(job_name).$(jobid) # Fichier erreur standard du travail # @ error = $(output) # Temps elapsed maximum demande # @ wall_clock_limit = 1:00:00 # Taille partition d'execution # @ bg_size = 64 # @ queue # Copy executable and input file to TMPDIR # Warning: if you need to transfer important volumes # of data, please use a multi-step job cp my_code $TMPDIR cp data.in $TMPDIR cd $TMPDIR mpirun -mode VN -np 256 -mapfile TXYZ -exe ./my_code # Copy output file to submission directory # Warning: if you need to transfer important volumes # of data, please use a multi-step job # $LOADL_STEP_INITDIR is the submission directory cp data.out $LOADL_STEP_INITDIR/
Then you submit this script via the llsubmit command:
Babel: llsubmit job.11
N.B: The script could be divided into two different parts:
The first one is related to the directives that should be passed to LoadLeveler via: #@ The second part is related to the command line that needs to be executed on the computes nodes.
A script submitted to Loadleveler contains special directives, some of them are necessary otherwise it is rejected by the job manager.
A complete description is given by the IBM documentation.
Remarks about the script:
The number of MPI processes (specified after -np during a call to mpirun) needs to be matched with the partition size and the different modes used (-mode SMP, -mode DUAL, -mode VN).
A special attention should be paid to the usage of the mapping (-mapfile option, default value is TXYZ which is also the recommanded value) which can improve the performance of your application (by minimizing the communication time between the MPI processes).
When a job is run, the total consumed time depends on the number of cores used. If a single core is in charge of doing some important preprocessing steps necessary for the job while N cores are allocated, the idle time of the N-1 cores is taken into account. In order to avoid this, a multi-steps job has to be created.
Quite often, jobs perform different steps. By example:
Therefore, rather than submitting a script in which serial tasks are executed in a parallel environment, it is possible to create a single script where serial and parallel tasks are separated. Serial tasks should be executed on the frontend and parallel tasks on the compute nodes.
N.B: The directory TMPDIR is used for the different steps (i.e temporary files could be transfered to/from that location for fast access) and is not destroyed between each step.
Here is an example where data is transfered from Gaya, then the job is run on 256 cores and finally an archive (which contains the results of the run) is created. The job submission file is the following (we decided to call it
#=========== Global directives =========== #@ shell = /bin/bash #@ job_name = test_multi-steps #@ output = $(job_name).$(step_name).$(jobid) #@ error = $(output) #=========== Step 1 directives =========== #======= Sequential preprocessing ======== #@ step_name = sequential_preprocessing #@ job_type = serial #@ cpu_limit = 0:15:00 #@ queue #=========== Step 2 directives =========== #============= Parallel step ============= #@ step_name = parallel_step #@ dependency = (sequential_preprocessing == 0) # (submit only if previous step completed without error) #@ job_type = bluegene #@ bg_size = 64 #@ wall_clock_limit = 1:00:00 #@ queue #=========== Step 3 directives =========== #======= Sequential postprocessing ======= #@ step_name = sequential_postprocessing #@ dependency = (parallel_step >= 0) # (submit even if previous step completed with an error) #@ job_type = serial #@ cpu_limit = 0:15:00 #@ queue case $LOADL_STEP_NAME in #============ Step 1 commands ============ #======= Sequential preprocessing ======== sequential_preprocessing ) set -ex cd $tmpdir mfget test/src/coucou_MPI.f mpixlf90_r coucou_MPI.f -o coucou_MPI.exe mfget test/data.tar tar xvf data.tar ls -l ;; #============ Step 2 commands ============ #============= Parallel step ============= parallel_step ) set -x cd $tmpdir mpirun -mode VN -np 256 -mapfile TXYZ ./coucou_MPI.exe ;; #============ Step 3 commands ============ #======= Sequential postprocessing ======= sequential_postprocessing ) set -x cd $tmpdir tar cvf result.tar *.dat mfput result.tar test/result.tar ;; esac
This job is separated in two parts. The first one contains all the directives for the queue manager LoadLeveler and the second one the commands executed by the different steps of the job. These two parts may be mixed but it is not good practice.
Each LoadLeveler step directive section must be finished by a #@ queue directive.
#@ dependency: this option is used to create a dependency between the different steps. In our case, the second step can be executed only if the first one has completed without errors (return value equals to zero). If you want to execute a step even if there are errors in the previous, use >=0 (see example script). It is necessary to put dependency directives if you want the steps to execute one after the other (in the opposite each step starts independently from the others).
The submission of this job will create three different sub-jobs with the same content. To distinguish between them, the LOADL_STEP_NAME variable is set. You can use branches (case) in your script to execute the right part of the script (see example).
The multi-step job is submitted to LoadLeveler with the llsubmit command:
Babel: llsubmit multi-steps.ll
All the steps are executed in the same directory (from where you submit the job). Files are therefore written there (if not specified otherwise), which means that every task related to a step should have a different name for the output file. Otherwise each step will overwrite the previous output file.
In our example the line: #@ output = $(job_name).$(step_name).$(jobid) shows one way to avoid that. The name of the output file is a variable that depends on the step name.
Each compute nodes can work in one of the three following modes:
|Mode||MPI processes/node||Max memory/MPI process|
The execution mode is selected during submission via the mode directive.
mpirun -mode VN -np 256 -exe ./my_code
Your consumption can be checked with the cpt command (updated once a day).
For mono-processors executions (cross-compilation, file transfers, small pre/post processing,...), there are 3 queues:
Memory (stack + data):
T(h) ^ | 20h +-------+--------+--------+--------+--------+-------+ | | | | | | | | MRt3 | 1Rt3 | 2Rt3 | 4Rt3 | 8Rt3 | 10Rt3 | | | | | | | | 10h +-------+--------+--------+--------+--------+-------+ | | | | | | | | MRt2 | 1Rt2 | 2Rt2 | 4Rt2 | 8Rt2 | 10Rt2 | | | | | | | | 1h +-------+--------+--------+--------+--------+-------+ | MRt1 | 1Rt1 | 2Rt1 | 4Rt1 | 8Rt1 | 10Rt1 | 0 +-------+--------+--------+--------+--------+-------+--> 64 512 1024 2048 4096 8192 10240 Number of compute nodes core -> base unit compute node -> 4 cores MR -> Half rack: 512 nodes (2048 cores) 1R -> 1 rack: 1024 nodes (4096 cores) ... T(h) : elapsed time in hours Memory: 2GiB/node
See this page to get IBM documents about the Blue Gene/P (most of the documents are in english).