slow MD with sd integrator and GPU / verlet
I'm seeing MD simulation running a lot slower with the sd integrator than with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no documented indication that this should be the case.
Log files are attached. Wall time seems to be accumulating in Update and Rest, adding up to >60% of total.
Using the group cutoff scheme, there seems to be a non-negligible slowdown with sd, also associated with extra wall time in Update and Rest, but this is modest compared to the impact with the verlet scheme.
System: Xeon E5-1620, 1x GTX 680
sd / verlet: 6
sd / group: 10
md / verlet: 9.2
md / group: 11.4
sd / verlet: 11
md / verlet: 29.8
fixed SD+BD integration slowing down with OpenMP threads
The SD and BD integrators would integrate on all OpenMP threads,
making the integration much slower instead of faster.
It is not clear if the results could be affected by this bug.
#1 Updated by Berk Hess over 4 years ago
- Status changed from New to Closed
There is no bug here.
(Un)fortunately we made the non-bondeds on the GPU so fast, that with SD we spend more time in the integration and constraints. The absolute time differences between md and sd should be similar with group and verlet, although settle in verlet is slightly slower because no charge groups are used,
You can switch to the sd1 integrator, if that's accurate enough for you. That's about as fast as md.
#2 Updated by Berk Hess over 4 years ago
PS note that mdp_opt.html says at SD:
An accurate leap-frog stochastic dynamics integrator.
Four Gaussian random number are required
per integration step per degree of freedom. With constraints,
coordinates needs to be constrained twice per integration step.
Depending on the computational cost of the force calculation,
this can take a significant part of the simulation time.
And gives the sd1 suggestion as well.
#3 Updated by Floris Buelens over 4 years ago
thanks for the reply - I'm not convinced that explains the numbers though. Ignoring the GPU for now, here are some more timings using '-nb cpu' only:
ns/day | Update walltime (s) | Rest walltime (s)
Group cutoff scheme:
md: 11.6 | 2.03 | 0.34
sd: 10.9 | 3.95 | 3.42
sd1: 11.7 | 2.25 | 0.33
Verlet cutoff scheme:
md: 9.4 | 2.30 | 1.31
sd: 6.2 | 27.53 | 22.12
sd1: 7.9 | 19.12 | 1.33
... so with the group scheme, Update takes ca. 2x longer when you switch from md to sd
while with the verlet scheme, Update takes ca. 12x longer when you switch from md to sd
Again without GPU, switching from md to sd costs me 6% with the group scheme, and 34% with verlet.
sd1 is indeed about as fast as md with the group scheme, with verlet the hit is still significant (16%). On GPU, my sd1 timing is 18.3 ns/day, against 11 ns/day for sd and 29.8 ns/day for md, so still 38% slower than md.
#4 Updated by Berk Hess over 4 years ago
- Status changed from Closed to In Progress
You're right, I overlooked the rest time.
I see now that the SD (and BD) update is not OpenMP threaded.
A (my) comment in the code says we need to find a way to generate random seeds for different threads.
I'll think of one.