Performances

Timings and Flops have been measured on three parallel machines which hardware and software characteristics are given below:

Machines IBM SP2 SGI/CRAY T3E Fujitsu VPP300
#Processors127 256 8
Processor IBM P2SC thin Dec Alpha EV5 Fujitsu Vector
Processor memory 256 MB 128 MB 2048 MB
Memory data cache 1 level cache: 128 KB 2 level caches:8 KB + 96 KB 64 Kb Scalar data cache
Processor peak performance 500 MFlops/sec 600 MFlops/sec 2200 MFlops/sec
Interconnexion network Omega Multilevel Switch Bidirectional 3D Torus Bidirectional Crossbar
90 MB/sec 600 MB/sec 615 MB/sec
Operating system AIX 4.2 Unicos/mk 2.0 UXP/V V10L20
Compiler xlf (version 4.1) f90 (version 3.0) frt (version L97121)
Optimization options -O3 -O3,unroll2,pipeline3 -Of -Wv,-Of

Table 1: Hardware and software characteristics


Our test case is the 2D Laplace problem. We used NAG FFT to solve the local problem. The table below shows how the elapsed execution time is spread on the different PMD routines and the amount of floating point operations per second per processor on each parallel machine:

Machines IBM SP2 SGI/CRAY T3E-600 Fujitsu VPP300
Build Schur matrix (sec.) 2979.6 1340. 63.
Factor Schur matrix (sec.) 74.7 183.7 10.
Solve (sec.) 3.6 1.8 0.1
Total elapsed time (sec.) 3125. 1525.6 74.
Communication time (sec.) 15.5 0.6 0.2
Total MFlops/sec./processor 22. 42. 859.

Table 2: Performances of the 2D Laplace problem on 4 processors. Local mesh size: Nx=801,Ny=601


The scalability can be evaluated assuming a fixed global mesh size for any number of subdomains (or processes). The curves below shows the elapsed execution time versus the number of processors:

[PMD Scalability at fixed global mesh]

Especially noteworthy is that we need almost 64 RISC processors to perform the problem as fast as 4 vector processors.

The curves below shows timings at fixed local mesh size.

[PMD Scalability at fixed local mesh size]


In such situation, as we know, mono-domain solvers usually provide timings wich evolve as N x f(N) to be compared to timings which evolve here as c x N, where the slope "c", as we notice, remains constant for any global mesh size N. For further details please read this paper.