Turing : TAU

Description

TAU (Tuning and Analysis Utilities) is a performance analysis graphical tool for parallel applications. It was developed by the University of Oregon, LANL and the JSC (Jülich Supercomputing Centre). With this tool, the behaviour and performance of an application can be analysed and its critical parts easily identified.

Versions installed

  • TAU 2.21.4 (default version)

Utilisation

The module command gives access to TAU.

Before working with TAU, you must execute the following command:

module load tau

The utilisation of TAU takes place in three steps:

  1. Instrumentation of the application
  2. Execution of the instrumented application
  3. Analysis/visualisation of the results

Automatic instrumentation

TAU functions by modifying your application in order to insert its own measurement procedures. Each application can be instrumented either automatically (selectively or not) or manually. We will only address automatic instrumentation in this document. For the manual procedure, refer to the TAU manuals.

To instrument your application, you just need to replace the compiler name with tau_f90.sh for a Fortran90 code, tau_cc.sh for a C code or tau_cxx.sh for a C++ code:

tau_f90.sh my_code.f90

The functionalities available in the instrumented application are chosen via the environment variable TAU_MAKEFILE. By default, it is positionned on Turing at the value $TAU_MAKEFILE_DIR/Makefile.tau-bgqtimers-papi-mpi-compensate-pdt (adapted to pure MPI applications). The available functionalities are as follows:

  • Makefile.tau-bgqtimers-papi-mpi-compensate-pdt : for pure MPI applications
  • Makefile.tau-bgqtimers-papi-mpi-compensate-pdt-openmp : for hybrid MPI/OpenMP applications
  • Makefile.tau-bgqtimers-papi-mpi-compensate-pdt-openmp-opari : for hybrid MPI/OpenMP applications with OpenMP instrumentation directives
  • Makefile.tau-memory-bgqtimers-papi-mpi-compensate-pdt : for pure MPI applications with measurement of the heap memory (dynamically allocated memory) at each function/subroutine call (attention to overhead)
  • Makefile.tau-memory-bgqtimers-papi-mpi-compensate-pdt-openmp : for hybrid MPI/OpenMP applications with measurement of the heap memory (dynamically allocated memory) at each function/subroutine call (attention to overhead)
  • Makefile.tau-memory-bgqtimers-papi-mpi-compensate-pdt-openmp-opari : for pure MPI applications with measurement of the heap memory (dynamically allocated memory) at each function/subroutine call (attention to overhead) and with OpenMP instrumentation directives

Attention : Using TAU results in overhead in execution time (not negligible, of about 10%, quite variable from one application to another and depending on the measurements made), memory occupation (fairly low impact) and disk space (the writing of at least one file per core used).

Selective instrumentation (semi-automatic)

It is possible to select and instrument only the parts of the codes which interest us. Furthermore, this approach allows the instrumentation of certain loops (rather than the functions/routines) as well as the monitoring of memory allocations/de-allocations and input/output calls of an application.

To do this, you must first (before the compilation) position the environment variable TAU_OPTIONS at -optVerbose -optSelectTauFile=select.tau. select.tau is the name of the file which stores the instrumentation information for TAU.

export TAU_OPTIONS="-optVerbose -optSelectTauFile=select.tau"

Here is an example of a selective instrumentation file:

BEGIN_EXCLUDE_LIST
# Do not instrument the routines/functions containing INIT in their names
"#INIT#"
END_EXCLUDE_LIST

BEGIN_FILE_EXCLUDE_LIST
# Do not instrument the following files
"not_to_instrument_1.f90"
"not_to_instrument_2.f90"
END_FILE_EXCLUDE_LIST

BEGIN_INSTRUMENT_SECTION
# Instrument the routines beginning by BIGLOOP in the loop_test.f90 file
loops file="loop_test.f90" routine="BIGLOOP#"
# Instrument the I/Os in the OUTPUT routine
io routine="OUTPUT"
# Instrument the allocations/de-allocations in the alloc.f90 file
memory file="alloc.f90"
END_INSTRUMENT_SECTION

The different sections are not obligatory and can be empty. The # symbol marks the beginning of the commentary, equivalent to * in the names of routines (in TAU, use # and not * in the names of routines). In Fortran, the names of subroutines must be put in capital letters to be recognized by TAU. More details are available on the TAU Web site.

Execution

The execution is done in a standard way by using the instrumented application. The behaviour and measurements can be monitored via the environment variables. The principal variables are:

  • TAU_CALLPATH and TAU_CALLPATH_DEPTH : To establish the call graph (which subroutine called which other subroutine and how much time was taken). To activate this functionality: Set TAU_CALLPATH=1 and indicate the depth of the path (by default at 2). The higher the value, the higher the overhead.
  • TAU_COMM_MATRIX : Determines the MPI communication matrix of the application (which processes communicate together). Activate it with the value of 1 (0 by default).
  • TAU_METRICS : Select the hardware performance counters by using the PAPI library. The list of available counters can be obtained with the papi_avail command (available by loading the papi module). Attention, all the counter combinations are not possible (the papi_event_chooser command allows verification of what is possible). The names of the counters must be separated by the “:” character. It is advised to always add the BGQTIMERS counter here so that the duration can also be measured.
  • TAU_THROTTLE_NUMCALLS and TAU_THROTTLE_PERCALL: Deactivate the measurements for short and repeated calls in order to reduce the overhead. By default, TAU_THROTTLE_NUMCALLS=100000 and TAU_THROTTLE_PERCALL=10 which corresponds to a deactivation of the measurements for calls made at least 100.000 times and which last less than 10 microseconds. To never deactivate the measurements, set TAU_THROTTLE_NUMCALLS at 0.
  • PROFILEDIR : Allows changing the measurement output directory (the current working directory is used by default).
  • TAU_VERBOSE : Activates the TAU verbose mode (set at 1 to activate it, 0 by default). Attention, each process will write on the standard output.
  • TAU_TRACE : To enable tracing (temporal evolution of the application) instead of profiling (set at 1 to activate it, 0 by default). TAU_TRACEDIR allows stocking the tracing outputs in any library. Attention, visualisation cannot be done with TAU. Jumpshot (to install on your computer) is one of the possible visualisation means (see the TAU documentation).
  • TAU_COMPENSATE : TAU attempts to subtract out the costs associated with using TAU. This option is active by default. It can be deactivated by setting TAU_COMPENSATE at 0.

The following is an example of a job submission:

job.ll
# @ job_name = tau_run
# @ job_type = BLUEGENE
# Standard job output file
# @ output = $(job_name).$(jobid)
# Standard job error file
# @ error = $(output)
# Maximum elapsed time request
# @ wall_clock_limit = 1:00:00
# Execution block size
# @ bg_size = 64
# @ queue
 
runjob --envs "TAU_METRICS=BGQTIMERS" --ranks-per-node 16 --np 1024 : ./my_appli

Analysis/visualisation of the results

Analysis of the results is done by using the graphic interface paraprof. To launch this, you need to enter:

module load tau
paraprof

As the instrumentation generates one (or several) files per core, it is advised to first generate a compacted version of the results with the command:

paraprof --pack output.ppk

The measurement files (profile.*) can then be deleted and the results visualised with the command:

paraprof output.ppk

TAU

The visualisation is presented through a graphic application so it is not always practical to use it directly from the Turing front-end. To get around this problem, it is possible to install the visualisation tool paraprof on a PC using Linux (or any UNIX machine).

The pprof command can also be used to obtain a summary of the measurements.

Documentation