Turing: SCALASCA

Description

SCALASCA is a graphical performance analysis tool for parallel applications. It was developed by JSC (Jülich Supercomputing Centre). It analyses application behaviours and easily identifies the highly time-consuming parts. This tool is perfectly adapted to the study of massively parallel executions.

Versions installed

  • SCALASCA 1.4.3
  • SCALASCA 2.0
  • SCALASCA 2.1 (default version)
  • SCALASCA 2.2

Attention:

  • Scalasca is perfectly adapted to the MPI and multithreaded/OpenMP hybrid applications up to the level of MPI_THREAD_MULTIPLE for a profile and MPI_THREAD_FUNNELED for a trace.
  • SCALASCA 1.4.3 can no longer be used to instrument an application on Turing since switching to the V1R2M1 driver on April 1st 2014. However, the files which were generated with this version cannot be exploited with SCALASCA 2.x. Therefore, SCALASCA 1.4.3 is still available to analyse the traces created with this version.

Utilisation

The module command provides access to Scalasca. Before working with Scalasca, therefore, it is necessary to execute the following command:

module load scalasca

Using Scalasca consists of three steps:

  1. Application instrumentation
  2. Execution of the instrumented application
  3. Analysis/visualisation of the results

Instrumentation

Scalasca functions by modifying your application in order to insert the Scalasca measurement procedures.

All applications can be instrumented either automatically or manually. We will only discuss the automatic instrumentation of a pure MPI application (without OpenMP) in this document. For the manual procedure, refer to the Scalasca manual.

To instrument your application, you just need to add the command skin (or scalasca -instrument) in front of the compiler name (leaving one blank space between them):

skin mpixlf95_r my_code.f95

Attention:

  • The performance between the calls to MPI_Init and MPI_Finalize is measured. Any other operations will not be included in the measurement.
  • The usage of Scalasca brings additional costs in execution time, memory and disc occupation.

Execution

The execution is done by adding the command scan (or scalasca -analyze) just before runjob in your LoadLeveler scripts.

By default, only one profile is collected: A profile is a summary of the execution.

To obtain a trace of the events and not a simple profile, you just need to use the -t option. Be aware, however, that this option greatly increases the need for disc space. This option is very useful because it allows SCALASCA to identify various performance problems which will then be highlighted during the visualisation of results.

Each time the instrumented application is executed, the files will be written in a directory; the name of this directory is generated in the following manner:

scorep_NOMAPPLI_RANKSPERNODEpNPROCxNTHR_TYPE

with

  • NOMAPPLI = the name of the executable file
  • RANKSPERNODE= the number of processes per node
  • NPROC = the total number of processes
  • NTHR = the number of threads per process
  • TYPE = sum for a profile and trace for a trace

Attention: If the directory already exists, the execution will fail (prevents overwriting the preceding results). Therefore, before generating a directory, you must make sure that the directory does not already exist.

The following is an example of a job submission:

job.ll
# @ job_name = scalasca_run
# @ job_type = BLUEGENE
# Standard output of the job
# @ output = $(job_name).$(jobid)
# Standard error of the job
# @ error = $(output)
# Elapsed time maximum request
# @ wall_clock_limit = 1:00:00
# Execution block size
# @ bg_size = 64
# @ queue
 
module load scalasca
scan runjob --ranks-per-node 8 --envs "OMP_NUM_THREADS=8" --np 512 : ./my_appli my_args

Analysis/visualization of results

The results analysis is done with the help of the graphical interface square (or scalasca -examine). To launch this graphical interface, you just need to type:

module load scalasca
square repertoire_sortie_scalasca

The interface is divided into three sections: On the left are presented the different measurements carried out, in the middle the call tree is given, and on the right is the topology.

By increasing or reducing the different entries of the left panel, it is possible to have a more or less synthetic view of the performance. The choices made in this panel are carried over to the 2 other panels; this allows for the identification of the application's significantly time-consuming parts. If the execution is carried out in trace mode, SCALASCA can identify certain behaviour sources causing performance loss (sending messages in disorder, load imbalance, …).

Because the visualization is done via a graphical application, it is not always practical to use this from the Turing front-end. To get around this problem, it is possible to install the visualisation tool CUBE (downloadable on the official SCALASCA site) on a PC using Linux (or any UNIX machine).

Documentation