## Overview |

PMD is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

- Load the last PMD version ()
- Before installing PMD, check for MPI (manufacturer or LAM or MPICH implementation) , MPIBLACS, BLAS, LAPACK and ScaLAPACK libraries on your system.
- zcat pmd-(version).tar.gz | tar -xvf -
- cd PMD/MakeDep

Configure the "Make.inc" file - cd .. ; make

Compile the PMD modules - make examples

Compile and load all the examples - Run the examples

PMD is a parallel Fortran 90 module which allows to solve positive definite elliptic linear second order operator systems. To access the PMD subroutines a "USE PMD" statement must be included in all scoping units. The pseudocode example hereafter shows the general calling sequence to the PMD subroutines to solve the whole problem assuming given local operator matrix and physical boundary conditions:

PROGRAM myprog USE |

PMD implements a non-overlapped domain decomposition based on the dual-Schur complement. The Schur matrix is built using a parallel influence matrix technique. At this time, PMD implements a 1D Domain decomposition for 1D and 2D problems. The interface problem can be solved using a direct or an iterative solver. PMD is built on top of MPI to ensure data exchange between the subdomains. Full source code examples using PMD are presented here.

Timings and Flops have been measured on three parallel machines which
hardware and software characteristics are given below:

Machines | IBM SP2 | SGI/CRAY T3E | Fujitsu VPP300 |

#Processors | 127 | 256 | 8 |

Processor | IBM P2SC thin | Dec Alpha EV5 | Fujitsu Vector |

Processor memory | 256 MB | 128 MB | 2048 MB |

Memory data cache | 1 level cache: 128 KB | 2 level caches:8 KB + 96 KB | 64 Kb Scalar data cache |

Processor peak performance | 500 MFlops/sec | 600 MFlops/sec | 2200 MFlops/sec |

Interconnexion network | Omega Multilevel Switch | Bidirectional 3D Torus | Bidirectional Crossbar |

90 MB/sec | 600 MB/sec | 615 MB/sec | |

Operating system | AIX 4.2 | Unicos/mk 2.0 | UXP/V V10L20 |

Compiler | xlf (version 4.1) | f90 (version 3.0) | frt (version L97121) |

Optimization options | -O3 | -O3,unroll2,pipeline3 | -Of -Wv,-Of |

Our test case was the 2D Laplace problem. We used NAG FFT to solve the local problem. The table below shows how the elapsed execution time is spread on the different PMD routines and the amount of floating point operations per second per processor on each parallel machines:

Machines | IBM SP2 | SGI/CRAY T3E-600 | Fujitsu VPP300 |

Build Schur matrix (sec.) | 2979.6 | 1340. | 63. |

Factor Schur matrix (sec.) | 74.7 | 183.7 | 10. |

Solve (sec.) | 3.6 | 1.8 | 0.1 |

Total elapsed time (sec.) | 3125. | 1525.6 | 74. |

Communication time (sec.) | 15.5 | 0.6 | 0.2 |

Total MFlops/sec./processor | 22. | 42. | 859. |

The scalability can be evaluated assuming a fixed global mesh size for any number of subdomains (or processes). The curves below shows the elapsed execution time versus the number of processors:

Especially noteworthy is that we need almost 64 RISC processors to perform the problem as fast as 4 vector processors.

The curves below shows timings at fixed local mesh size.

In such situation, as we know, mono-domain solvers usually provide timings wich evolve as N x f(N) to be compared to timings which evolve here as c x N, where the slope "c", as we notice, remains constant for any global mesh size N. For further details please read these papers.

- The computational domain must be simply connected.
- Along the domain decomposition axis, physical boundary conditions are assumed to be non-periodic.
- The current version of
`PMD`stands for 1D Domain decomposition only which means that subdomain interfaces cannot cross. - The interface normal vectors must be parallel to the domain decomposition axis.
- It is up to the user to compute the interface first normal derivatives along the domain decomposition axis.
- When using direct parallel solvers, which here are based on the
`ScaLAPACK`library, the`BLACS`process grid must be square. This means that if Np denotes the process grid dimension, then the size of the PMD process group must be equal to Np**2. However this restriction doesn't stand when using parallel iterative solvers as PCG and Bi-CGstab algorithms. - Parallel PCG and Bi-CGstab methods, has an inconvenient property which is the number of iterations to convergence increases as the number of subdomains grows.