Turing : MPITRACE

Description

MPITRACE is a tool developed and supplied by the IBM teams. With this tool, the behaviour of an MPI/OpenMP application can be analysed and its critical parts easily identified.

Utilisation

The utilisation of MPITRACE takes place in three steps:

  1. Instrumentation of the application
  2. Execution of the instrumented application
  3. Analysis of the results

Instrumentation

To instrument your MPI application, you need to position (before linking) the module:

$ module load mpitrace/mpi
$ mpixlf90_r -o code.exe code.o

To instrument your MPI/OpenMP application or to have detailed information about the input-output performance, you need to position (before compiling) the module command and position the -qsmp option at the compilation and linking phases:

$ module load mpitrace/smp
$ mpixlf90_r -c -qsmp=omp code.f90
$ mpixlf90_r -qsmp=omp -o code.exe code.o

Execution

During the execution, we advise the positioning of the environment variables PROFILE_VPROF and PROFILE_JIO :

$ runjob --envs PROFILE_VPROF=yes PROFILE_JIO=yes -n 4096 -p 64 --exe code.exe

By default, the process 0 profile is collected. After the execution, the following files are generated:

  • mpi_profile.<PID>.0 : text file indicating the MPI performance of the process 0 (see below)
  • vmon.out.0 : binary file used by the bfdprof tool (see below)
  • hpm_process_summary.<PID>.0 : text file giving information about the hardware counters
  • hpm_job_summary.<PID>.0 : text file summarising the information about the hardware counters

Analysis of the MPI communications about the process 0

Here is what we can see in the mpi_profile.0 file:

$ cat mpi_profile.0
elapsed time from clock-cycles using freq = 850.0 MHz
-----------------------------------------------------------------
MPI Routine                  #calls     avg. bytes      time(sec)
-----------------------------------------------------------------
MPI_Comm_size                     1            0.0          0.000
MPI_Comm_rank                     2            0.0          0.000
MPI_Sendrecv                    600        65536.0          1.087
MPI_Bcast                         1            4.0          0.000
MPI_Reduce                       30            8.0          0.019
MPI_Allreduce                   100            8.0          0.174
-----------------------------------------------------------------
total communication time = 1.280 seconds.
total elapsed time       = 4.318 seconds.

-----------------------------------------------------------------
Message size distributions:

MPI_Sendrecv              #calls    avg. bytes      time(sec)
                             400       32768.0          0.141
                             200      131072.0          0.946

MPI_Bcast                 #calls    avg. bytes      time(sec)
                               1           4.0          0.000

MPI_Reduce                #calls    avg. bytes      time(sec)
                              30           8.0          0.019

MPI_Allreduce             #calls    avg. bytes      time(sec)
                             100           8.0          0.174

-----------------------------------------------------------------
Communication summary for all tasks:

  minimum communication time = 1.280 sec for task 0
  median  communication time = 1.283 sec for task 215
  maximum communication time = 1.284 sec for task 9

taskid  xcoord  ycoord  zcoord  procid   total_comm(sec)   avg_hops
     0       0       0       0       0         1.280   21020696.00
     1       0       0       0       1         1.284   21020696.00
...
...
...
   254       3       3       3       2         1.284   21020696.00
   255       3       3       3       3         1.283   21020696.00

MPI tasks sorted by communication time:
taskid  xcoord  ycoord  zcoord  procid    total_comm(sec)  avg_hops
     0       0       0       0       0         1.280   21020696.00
    16       0       1       0       0         1.281   21020696.00
...
...
...
   253       3       3       3       1         1.284   21020696.00
     9       2       0       0       1         1.284   21020696.00

Profiling of the application

The performance analysis can be done by using the bfdprof command which takes as argument the name of the executable file and the binary file name vmon.out.0 generated by the execution.

module load mpitrace
bfdprof code.exe vmon.out.0 > cpu_profile.0

You can then read the profiling in the text file “cpu_profile.0” which contains, notably, the sampling by code line (tics) :

tics | source 
   58|          D2F(n1m1,1:n2,1:n3) = (F (n1,1:n2,1:n3)+F(n1m2,1:n2,1:n3))*B &
    5|          &                     + (F(n1m3,1:n2,1:n3)+G1(1,1:n2,1:n3))*C + F(n1m1,1:n2,1:n3)*A
   85|          D2F(n1  ,1:n2,1:n3) = (G1(         1,1:n2,1:n3)+F(n1m1,1:n2,1:n3))*B &
    5|          &                     + (F(n1m2,1:n2,1:n3)+G1(2,1:n2,1:n3))*C + F(n1  ,1:n2,1:n3)*A

Documentation

For more information: