Benchmark for Gromacs 6vnx case

Benchmark description

System

The system consists of a protein in a box of water:

Number of atoms:
- Total = 932,310
- Solvent = 887,139
- Protein = 45,156
- Counter ions = 15
Box:
- cubic 21.14 Å

Simulation

The simulation is run with Gromacs 2020.1 and has the following characteristics:

Type of simulation: molecular dynamics
Ensemble: NVT
Number of steps: 500,000
Timestep : 2 fs
Long range electrostatic : Particle Mesh Ewald
PME cutoff: 1.2 Å
van der Waals cutoff: 1.1 Å
cutoff-scheme: Verlet
Temperature coupling: V-rescale
Constraints: LINCS on H-bonds

Benchmark

The benchmark is performed using JUBE.

Gromacs has a rather large number of degrees of freedom when benchmarking. So it is plausible that a different fine-tuned configuration might give a better result. We tried to be as general as possible.

A first benchmark was performed on a small number of steps (50,000), then the best configurations obtained were re-run for 500,000 time steps to reduce the impact of the first slower steps.

CPU benchmark

The default number of PME/PP tasks is kept as default. Please note that no large imbalance occurred during the benchmarks. This is something you might want to check when doing your tests.

Here is an example of the mdrun command for the thread-MPI version of Gromacs:

gmx mdrun -ntmpi 40  -ntomp 1 \ 
          -dlb yes -update auto -bonded auto \
          -nb auto -pme auto -pmefft auto \
          -deffnm 6vxx_nvt -v \
          -nsteps 500000  -resetstep 300000

GPU benchmark

The GPU benchmark was done to check the best results for:

1 quarter of a node (1 GPU, 10 physical cores)
1 half node (2 GPUs, 20 physical cores)
1 node (4 GPUs, 40 physical cores)

The idea was to offload the maximum computation possible. So the following choices have been made:

Offload PME: This means that the number of PME tasks has to be set to 1 (-npme 1) since PME on GPU does not support domain decomposition.
For the thread-MPI version, the newest developments available since 2020 version are offloaded to GPU. Please read this page before doing a production run
- Update of neighbors list
- Bonded interactions

The thread-MPI version is limited to running on 1 node, meaning a maximum of 4 GPUs. This means that a large imbalance might occur. But this is still the best choice for performance.

Here is an example of the mdrun command for GPU thread-MPI version of Gromacs:

export GMX_GPU_PME_PP_COMMS=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=1
export GMX_GPU_DD_COMMS=true

gmx mdrun -ntmpi 5 -npme 1 -ntomp 4 \
          -dlb yes -update gpu -bonded gpu \
          -nb gpu -pme gpu -pmefft gpu \
          -deffnm 6vxx_nvt -v \
          -nsteps 500000  -resetstep 300000

Results

The performance is measured on the 200,000 last steps.

Benchmark of best results (500,000 steps)

A few indicators such as the imbalance and the PME/F ratio are indicated.

The multithread column indicates if the hyperthreading was used during the computation.

nodes	ntasks	nthreads	multithread	ngpus	time (s)	perf (ns/day)	imbalance	pme/F
1	40	1	nomultithread	0	7082.424	4.88	0.6	0.88
1	5	4	multithread	1	2344.079	14.744	0.8	0.03
1	4	5	nomultithread	2	1478.937	23.368	2.2	0.02
1	4	10	nomultithread	4	851.949	40.566	0.2	0.03

Table of speedup:

Number of GPUs	perf (ns/day)	Speedup
0	4.88	1.0
1	14.744	3.0
2	23.368	4.8
4	40.566	8.3

First benchmark (50,000 steps)

When the performance is -1.0, it means that the run did not finish. Most of the time this is due to a segmentation fault in the program.

Some indicators such as the imbalance and the PME/F ratio are shown.

The multithread column indicates if hyperthreading was used during the computation.

nodes	ntasks	nthreads	multithread	ngpus	time (s)	perf (ns/day)	imbalance	pme/F
1	5	8	nomultithread	0	489.09	3.533	0.5
1	10	4	nomultithread	0	469.594	3.68	0.8
1	8	5	nomultithread	0	437.289	3.952	0.4
1	20	2	nomultithread	0	381.868	4.526	0.7	1.02
1	40	1	nomultithread	0	365.068	4.734	0.7	0.87
1	40	1	nomultithread	4		-1.0
1	40	2	multithread	4		-1.0
1	32	1	nomultithread	4		-1.0
1	32	2	multithread	4		-1.0
1	20	2	nomultithread	4		-1.0
1	20	4	multithread	4		-1.0
1	16	2	nomultithread	4		-1.0		0.02
1	16	4	multithread	4		-1.0		0.02
1	12	3	nomultithread	4		-1.0		0.06
1	12	6	multithread	4		-1.0		0.06
1	8	4	nomultithread	4		-1.0		0.05
1	8	8	multithread	4		-1.0		0.05
1	10	1	nomultithread	1		-1.0		0.03
1	10	2	multithread	1		-1.0		0.04
1	20	1	nomultithread	2		-1.0
1	20	2	multithread	2		-1.0
1	10	2	nomultithread	2		-1.0		0.04
1	10	4	multithread	2		-1.0		0.04
1	2	5	nomultithread	1	132.988	12.995
1	4	2	nomultithread	1	128.825	13.415	0.7	0.03
1	2	10	multithread	1	125.646	13.754
1	5	2	nomultithread	1	124.236	13.91	0.5	0.03
1	4	4	multithread	1	120.234	14.373	0.3	0.03
1	5	4	multithread	1	117.625	14.692	0.5	0.02
1	2	10	nomultithread	2	107.217	16.118
1	2	20	multithread	2	103.69	16.667
1	4	5	nomultithread	2	75.724	22.822	0.5	0.02
1	4	10	multithread	2	73.035	23.662	3.4	0.03
1	4	10	nomultithread	4	44.863	38.521	0.2	0.03

Institut du développement et des ressources en informatique scientifique

Navigation du site

L'IDRIS

Gestion des ressources

Espace utilisateurs

Actualités

Table des matières