Benchmark for Gromacs 6vnx case

Benchmark description

System

The system consists of a protein in a box of water:

  • Number of atoms:
    • Total = 932,310
    • Solvent = 887,139
    • Protein = 45,156
    • Counter ions = 15
  • Box:
    • cubic 21.14 Å

Simulation

The simulation is run with Gromacs 2020.1 and has the following characteristics:

  • Type of simulation: molecular dynamics
  • Ensemble: NVT
  • Number of steps: 500,000
  • Timestep : 2 fs
  • Long range electrostatic : Particle Mesh Ewald
  • PME cutoff: 1.2 Å
  • van der Waals cutoff: 1.1 Å
  • cutoff-scheme: Verlet
  • Temperature coupling: V-rescale
  • Constraints: LINCS on H-bonds

Benchmark

The benchmark is performed using JUBE.

Gromacs has a rather large number of degrees of freedom when benchmarking. So it is plausible that a different fine-tuned configuration might give a better result. We tried to be as general as possible.

A first benchmark was performed on a small number of steps (50,000), then the best configurations obtained were re-run for 500,000 time steps to reduce the impact of the first slower steps.

CPU benchmark

The default number of PME/PP tasks is kept as default. Please note that no large imbalance occurred during the benchmarks. This is something you might want to check when doing your tests.

Here is an example of the mdrun command for the thread-MPI version of Gromacs:

gmx mdrun -ntmpi 40  -ntomp 1 \ 
          -dlb yes -update auto -bonded auto \
          -nb auto -pme auto -pmefft auto \
          -deffnm 6vxx_nvt -v \
          -nsteps 500000  -resetstep 300000

GPU benchmark

The GPU benchmark was done to check the best results for:

  • 1 quarter of a node (1 GPU, 10 physical cores)
  • 1 half node (2 GPUs, 20 physical cores)
  • 1 node (4 GPUs, 40 physical cores)

The idea was to offload the maximum computation possible. So the following choices have been made:

  • Offload PME: This means that the number of PME tasks has to be set to 1 (-npme 1) since PME on GPU does not support domain decomposition.
  • For the thread-MPI version, the newest developments available since 2020 version are offloaded to GPU. Please read this page before doing a production run
    • Update of neighbors list
    • Bonded interactions

The thread-MPI version is limited to running on 1 node, meaning a maximum of 4 GPUs. This means that a large imbalance might occur. But this is still the best choice for performance.

Here is an example of the mdrun command for GPU thread-MPI version of Gromacs:

export GMX_GPU_PME_PP_COMMS=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=1
export GMX_GPU_DD_COMMS=true

gmx mdrun -ntmpi 5 -npme 1 -ntomp 4 \
          -dlb yes -update gpu -bonded gpu \
          -nb gpu -pme gpu -pmefft gpu \
          -deffnm 6vxx_nvt -v \
          -nsteps 500000  -resetstep 300000

Results

The performance is measured on the 200,000 last steps.

Benchmark of best results (500,000 steps)

A few indicators such as the imbalance and the PME/F ratio are indicated.

The multithread column indicates if the hyperthreading was used during the computation.

nodesntasksnthreadsmultithread ngpustime (s)perf (ns/day)imbalancepme/F
1 40 1 nomultithread0 7082.4244.88 0.6 0.88
1 5 4 multithread 1 2344.07914.744 0.8 0.03
1 4 5 nomultithread2 1478.93723.368 2.2 0.02
1 4 10 nomultithread4 851.949 40.566 0.2 0.03

Table of speedup:

Number of GPUsperf (ns/day)Speedup
0 4.88 1.0
1 14.744 3.0
2 23.368 4.8
4 40.566 8.3

First benchmark (50,000 steps)

When the performance is -1.0, it means that the run did not finish. Most of the time this is due to a segmentation fault in the program.

Some indicators such as the imbalance and the PME/F ratio are shown.

The multithread column indicates if hyperthreading was used during the computation.

nodesntasksnthreadsmultithread ngpustime (s)perf (ns/day)imbalancepme/F
1 5 8 nomultithread0 489.09 3.533 0.5
1 10 4 nomultithread0 469.594 3.68 0.8
1 8 5 nomultithread0 437.289 3.952 0.4
1 20 2 nomultithread0 381.868 4.526 0.7 1.02
1 40 1 nomultithread0 365.068 4.734 0.7 0.87
1 40 1 nomultithread4 -1.0
1 40 2 multithread 4 -1.0
1 32 1 nomultithread4 -1.0
1 32 2 multithread 4 -1.0
1 20 2 nomultithread4 -1.0
1 20 4 multithread 4 -1.0
1 16 2 nomultithread4 -1.0 0.02
1 16 4 multithread 4 -1.0 0.02
1 12 3 nomultithread4 -1.0 0.06
1 12 6 multithread 4 -1.0 0.06
1 8 4 nomultithread4 -1.0 0.05
1 8 8 multithread 4 -1.0 0.05
1 10 1 nomultithread1 -1.0 0.03
1 10 2 multithread 1 -1.0 0.04
1 20 1 nomultithread2 -1.0
1 20 2 multithread 2 -1.0
1 10 2 nomultithread2 -1.0 0.04
1 10 4 multithread 2 -1.0 0.04
1 2 5 nomultithread1 132.988 12.995
1 4 2 nomultithread1 128.825 13.415 0.7 0.03
1 2 10 multithread 1 125.646 13.754
1 5 2 nomultithread1 124.236 13.91 0.5 0.03
1 4 4 multithread 1 120.234 14.373 0.3 0.03
1 5 4 multithread 1 117.625 14.692 0.5 0.02
1 2 10 nomultithread2 107.217 16.118
1 2 20 multithread 2 103.69 16.667
1 4 5 nomultithread2 75.724 22.822 0.5 0.02
1 4 10 multithread 2 73.035 23.662 3.4 0.03
1 4 10 nomultithread4 44.863 38.521 0.2 0.03