Project

General

Profile

Bug #3102

-multidir issue with OpenMP and 2 replicas

Added by Markus Hermann 22 days ago. Updated 18 days ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
2018.6 and 2019.3
Affected version:
Difficulty:
uncategorized
Close

Description

We are noticing performance issues with the -multidir option in our inhouse modification of Gromacs 2018.6. The issue has also been observed with Gromacs 2019.3

We are running replica simulations with commitment to the Principle of Maximum Entropy (details: https://doi.org/10.1021/acs.jctc.9b00338). For this we use a modified Gromacs version based on 2018.6 and the -multidir option. We could later reproduce the same behavior with an unmodified Gromacs 2019.3 and on 2 different machines (details below).

We are running 2 replica simulations (on 2 nodes) as follow:
mpiexec -n 2 -npernode 1 mdrun_mpi -multidir replica0 replica1 -deffnm waxscalc -g replica2b.log -resetstep 3000 -nsteps 8000 -v -maxh 72 -ntomp 12 >& replic2.out

What we are expecting:
We expect that on each node 1 thread spawns and uses 12 cpus.

What we are noticing:
The cpu load is at 170% (out of 1200%) and also the the ns/day is low.

This behavior only occures when using 2 replicas. With 3 or 4 replicas the cpu load is at (roughly) 1200%.

Tested modi:
mpiexec -n 4 -npernode 1 mdrun_mpi -multidir replica0 replica1 replica2 replica3 -deffnm waxscalc -g replica4.log -resetstep 3000 -nsteps 8000 -v -maxh 72 -ntomp 12 >& replic4.out
(ns/day) (hour/ns)
Performance: 159.653 0.150

mpiexec -n 3 -npernode 1 mdrun_mpi -multidir replica0 replica1 replica2 -deffnm waxscalc -g replica3.log -resetstep 3000 -nsteps 8000 -v -maxh 72 -ntomp 12 >& replic3.out
(ns/day) (hour/ns)
Performance: 156.362 0.153

mpiexec -n 2 -npernode 1 mdrun_mpi -multidir replica0 replica1 -deffnm waxscalc -g replica2.log -resetstep 3000 -nsteps 8000 -v -maxh 72 -ntomp 12 >& replic2.out
(ns/day) (hour/ns)
Performance: 60.035 0.400

System1:
gcc: 9.1.0
openMPI: mpiexec (OpenRTE) 4.0.1
GPU: GTX 1070 Ti
CPU: Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz
(more details in the attached files)

System2:
gcc: 4.8.5
openMPI: mpiexec (OpenRTE) 2.0.2
GPU: GTX 1070 Ti
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
(more details in the attached files)

waxscalc.tpr (1.8 MB) waxscalc.tpr tpr files with gromacs 2019.3 that was used for testing Markus Hermann, 09/23/2019 04:57 PM
system1.out (31.5 KB) system1.out Markus Hermann, 09/23/2019 05:00 PM
system2.log (267 KB) system2.log Markus Hermann, 09/23/2019 05:04 PM

History

#1 Updated by Berk Hess 22 days ago

I don't understand what your issue is.
You write "1 thread spawns 12 cpus". But a process spawn threads that CPUs.
With your command line you simply spawn (-n)*(-ntomp) threads in total.

#2 Updated by Markus Hermann 21 days ago

Ahh sorry, the wording was wrong here. Mixed up process and thread.

So here we want -n processes with -ntomp threads.

What we see (with top on a node) is, that the cpu load lies at ~170% when using 2 replicas. With 3 or 4 replicas it is ~1200%.

#3 Updated by Berk Hess 21 days ago

That doesn't look like a GROMACS problem to me.

The log file says:
Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity

Have you looked in top which "CPU"s are active? Maybe the thread affinity outside GROMACS put multiple threads on the same cores and leaves some cores empty.

#4 Updated by Mark Abraham 21 days ago

If so, use mdrun -pin on, or stop the external thing (probably mpiexec) from doing unsuitable things :-)

#5 Updated by Berk Hess 20 days ago

  • Status changed from New to Feedback wanted

Is this issue resolved or identified as an issue outside GROMACS?

#6 Updated by Markus Hermann 19 days ago

As the line:
Non-default thread affinity set probably by the OpenMP library,
can't be found in the 3 replica and 4 replica jobs I assume that the issue may indeed lie within the OpenMP library.
I am just wondering why this occures on all 2 replica jobs.
As I am no expert on this I may have to dig deeper and for now assume the issue lies outside GROMACS.

#7 Updated by Berk Hess 19 days ago

But have you tried with -pin on? Then GROMACS should override the external affinities.

#8 Updated by Mark Abraham 19 days ago

Markus Hermann wrote:

As the line:
Non-default thread affinity set probably by the OpenMP library,
can't be found in the 3 replica and 4 replica jobs I assume that the issue may indeed lie within the OpenMP library.
I am just wondering why this occures on all 2 replica jobs.
As I am no expert on this I may have to dig deeper and for now assume the issue lies outside GROMACS.

Could be anything, but if your cluster is treating a socket as a node, then a 2-replica job fits in an actual node, so might be able to have something else inconsistently configured. Unlike a job with more than 2 replicas.

#9 Updated by Markus Hermann 18 days ago

@Mark: We had 1 process on each node, so this should not be the issue.
Strange thing is that this happens on 2 completely different clusters. Only thing they have in common is they run Slurm.

I have tried '-pin on' yesterday and the performance was bad.
Tried it again today and now the performance is much better. Maybe a hickup in the system during the 1st benchmark was responsible for the bad performance.

So for me this can be closed now.

#10 Updated by Markus Hermann 18 days ago

P.s. @all
thank you for your time and your helpful comments. You are doing a great job with this free support and developing such an excellent package for MD simulations.

#11 Updated by Berk Hess 18 days ago

  • Status changed from Feedback wanted to Closed

Also available in: Atom PDF