Project

General

Profile

Bug #1341

Inconsistent energy output

Added by Justin Lemkul over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
4.6.3, release-4-6, master
Affected version:
Difficulty:
uncategorized
Close

Description

We have been doing single-point energy calculations in CHARMM and GROMACS to validate our latest CHARMM force field, with the hopes of contributing it to GROMACS very soon. In the final stages of testing, we noticed some disturbing inconsistencies in the results. All of our small molecule tests have gone well, but we recently constructed a system of protein, water, and ions to check what might be considered a very common system. The outcome was wildly different energy values, depending on which version of GROMACS we have used.

We calculated the single-point energy values in two ways, one using a zero-step EM procedure and the other using mdrun -rerun for zero steps. I will summarize our findings below, and I am attaching .tpr files and .log files for each.

The "correct" value of the energy (at least from the perspective of CHARMM, whose energy value we have thus far been able to reproduce for all simple systems) is -7.58785e+04 kJ/mol.

Version 4.5.5
Zero-step EM: -7.58789e+04 (correct value)
Rerun: -8.59875e+04

Version 4.6.3
Zero-step EM: -8.58497e+04
Rerun: -9.59530e+04

More disturbingly, the output of mdrun -rerun appears to be dependent upon the number of energygrps specified in the .mdp file. We had used multiple energygrps to monitor individual nonbonded interactions (for debugging purposes), but switching to one energygrp (System) leads to the following mdrun -rerun energy of -9.43066e+04.

Having gone through recent released versions, I also tried the same tests using the latest release-4-6 and master branches (pulled this morning). The results:

Release-4-6
Zero-step EM:-8.58497e+04
Rerun: -5.67696e+04
Rerun w/1 energygrp: -9.59530e+04

Note the massive difference in energy, based solely on the energygrps.

Master
Zero-step EM: -7.58785e+04 (correct)
Rerun: -8.59870e+04
Rerun w/1 energygrp: -8.59811e+04

So it seems the master branch calculates the correct energy, but only when running a zero-step EM and not with mdrun -rerun, and the disparity between single and multiple energygrps is reduced, but still troubling. The release-4-6 branch seems badly broken, especially in terms of the energygrps.

log_files.tar.gz (8.17 KB) Justin Lemkul, 09/23/2013 07:12 PM

tpr_files.tar.gz (1.38 MB) Justin Lemkul, 09/23/2013 07:12 PM

log_files_new.tar.gz (6.89 KB) Justin Lemkul, 09/23/2013 07:49 PM

tpr_files_new.tar.gz (1.11 MB) Justin Lemkul, 09/23/2013 07:49 PM

mix_gmx.ions.pdb (1.11 MB) Justin Lemkul, 09/24/2013 02:35 PM

input.tar.gz (329 KB) Justin Lemkul, 09/24/2013 03:32 PM

ff.tar.gz (293 KB) Justin Lemkul, 09/24/2013 03:42 PM

Associated revisions

Revision 5c551693 (diff)
Added by Erik Lindahl over 3 years ago

Fix bug in (long) neighborlist SIMD padding when adding to previous list

Gromacs-4.6 introduced SIMD padding in the neighborlists, which works
fine for normal simulations. However, when the neighborlist gets long
and we end up adding a second batch of particles we need to remove the
previous padding, which was not done until now. This will typically only
occur when the list per node is large, e.g. when using long cutoffs
(>2nm) with only a single core. Normal simulations should not have been
affected by it (which is also why we did not find it until now).

Fixes #1341.

Change-Id: Ie64ab6c0313a8dc0d3545a5e7d610f24adae4438

History

#1 Updated by David van der Spoel over 3 years ago

I assume that is Epot you're reporting?

Is the rerun done with xtc or trr file? xtc has round off (but should not be that bad).

Have you tried to break down the energy components?

Are the bondeds OK?

#2 Updated by Justin Lemkul over 3 years ago

David van der Spoel wrote:

I assume that is Epot you're reporting?

Yes

Is the rerun done with xtc or trr file? xtc has round off (but should not be that bad).

A single frame, in .pdb format. We have been doing the same for the other tests, with .pdb and .crd files of identical precision.

Have you tried to break down the energy components?

Are the bondeds OK?

Yes, I forgot to mention this. All bonded terms are correct, as are 1-4 terms. The terms that change are LJ(SR) and Coul(SR).

#3 Updated by David van der Spoel over 3 years ago

OK bondeds seem OK, although there is no real reference log file.

Could you please add one?

Have you checked the flop counting?

#4 Updated by David van der Spoel over 3 years ago

gmxcheck -s1 zero_455.tpr -s2 zero_463 gives MANY differences:

inputrec->ns_type (0 - 1)
inputrec->bContinuation (0 - 1)
inputrec->verletbuf_drift (0.000000e+00 - 5.000000e-03)
inputrec->nstcalclr (10 - 0)

Interestingly, running 4.6.3 with the 4.5.5 tpr gives the same result as 4.5.5.

#5 Updated by Justin Lemkul over 3 years ago

David van der Spoel wrote:

OK bondeds seem OK, although there is no real reference log file.

Could you please add one?

I guess that's because I don't exactly know which one to trust. The zero-step EM from 4.5.5 and master gives the correct result, so I would consider either of them the "reference" for this purpose. A colleague has reproduced the findings of 4.5.5 on different hardware.

Have you checked the flop counting?

I'm looking at it, but quite honestly I don't know what useful insight I'm supposed to draw from it.

#6 Updated by Justin Lemkul over 3 years ago

David van der Spoel wrote:

gmxcheck -s1 zero_455.tpr -s2 zero_463 gives MANY differences:

inputrec->ns_type (0 - 1)
inputrec->bContinuation (0 - 1)
inputrec->verletbuf_drift (0.000000e+00 - 5.000000e-03)
inputrec->nstcalclr (10 - 0)

Interestingly, running 4.6.3 with the 4.5.5 tpr gives the same result as 4.5.5.

Sorry for the inconsistencies; I suppose I was trying too many things. Attached are new tarballs comparing just 4.6.3 and 4.5.5 with .mdp files that should be consistent between the two versions (aside from version-specific changes that I cannot control).

I am also noticing that there can be significant differences between ns_type grid and simple.

#7 Updated by David van der Spoel over 3 years ago

Some more observations (now running with DP to rule out precision).

gmx455 gives identical results on 1 and 4 cores.
gmx463 gives almost the same energy as 455 with the 455 tpr file (but 200 kJ/mol off)

4.5.5:

          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)      Potential
    2.33688e+03    5.53098e+04    6.29565e+04   -2.18116e+05   -7.58875e+04

4.6.3:

          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)      Potential
    2.33688e+03    5.53098e+04    6.27420e+04   -2.18002e+05   -7.59881e+04

The number of pairs tested by the NS is the same, but the inner loop count is different in both versions.

Without acceleration 4.6.3 gives the same result as with acceleration.

As I would expect, the master branch gives almost the same energy as the 4.6 branch.
Could it nevertheless be neighborsearching?

#8 Updated by Mark Abraham over 3 years ago

  • Status changed from New to Accepted
  • Target version set to 4.6.4

What files were used as input to to -rerun, please?

With release-4-6 I tried mdrun -s ../tprs/min_455.tpr -deffnm r46-on-min-v4.5.5-nt-1 -nt 1 and got

          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)      Potential
    2.33688e+03    5.53099e+04    6.11263e+04   -1.79331e+05   -3.89324e+04

whereas with the same command and mdrun, but -nt 4 I got

          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)      Potential
    2.33688e+03    5.53098e+04    6.35186e+04   -2.15082e+05   -7.22916e+04

So there is at least one issue here to fix. I will run a git bisection to see where this particular behaviour emerged. Hopefully that will shed some light.

#9 Updated by Justin Lemkul over 3 years ago

Mark Abraham wrote:

What files were used as input to to -rerun, please?

The zero_*.tpr files are the inputs for -rerun. min_*.tpr are standalone EM processes. I have attached the coordinate file we're using here to make things easy.

mdrun -s zero_455.tpr -rerun mix_gmx.ions.pdb

With release-4-6 I tried mdrun -s ../tprs/min_455.tpr -deffnm r46-on-min-v4.5.5-nt-1 -nt 1 and got

[...]

whereas with the same command and mdrun, but -nt 4 I got

[...]

So there is at least one issue here to fix. I will run a git bisection to see where this particular behaviour emerged. Hopefully that will shed some light.

This is an interesting observation. I had only been running on one thread. Thanks for all the commentary thus far. Let me know if you need any more information, input files, etc from me.

#10 Updated by Justin Lemkul over 3 years ago

I have a new bit of information that will hopefully be helpful. The version of master that I compiled yesterday had no CPU acceleration; it produced the correct energy value using a zero-step EM. I re-compiled today with SSE4.1, and the same .tpr file (min_master.tpr) now produces the incorrect energy of -8.58497e+04, which is the same as using mdrun -rerun. The machine on which I first discovered the problem (Linux, running GROMACS 4.6.3) is a bit older and has SSE2, though it is worth noting that all results posted to this bug report thus far have been from my Mac, compiled with gcc 4.7.3 and FFTW 3.3.3. I can try to re-compile release-4-6 or 4.6.3 without any acceleration to see if that is the source of the issue.

#11 Updated by Mark Abraham over 3 years ago

366c49a438 introduces the difference I note in post 8. It added the new SSE2 group kernels and changed some neighbour searching. The kernels themselves should be fine, so the non-kernel changes there would likely be the culprit. Yet more examples of why we should not accept kitchen-sink patches!

Will look to see what the issue might be.

#12 Updated by Justin Lemkul over 3 years ago

Justin Lemkul wrote:

I have a new bit of information that will hopefully be helpful. The version of master that I compiled yesterday had no CPU acceleration; it produced the correct energy value using a zero-step EM. I re-compiled today with SSE4.1, and the same .tpr file (min_master.tpr) now produces the incorrect energy of -8.58497e+04, which is the same as using mdrun -rerun. The machine on which I first discovered the problem (Linux, running GROMACS 4.6.3) is a bit older and has SSE2, though it is worth noting that all results posted to this bug report thus far have been from my Mac, compiled with gcc 4.7.3 and FFTW 3.3.3. I can try to re-compile release-4-6 or 4.6.3 without any acceleration to see if that is the source of the issue.

Confirmed that current release-4-6 without any CPU acceleration produces the correct energy (-7.58785e+04) in zero-step EM. mdrun -rerun produces incorrect energy (-8.58497e+04) with this version.

#13 Updated by Mark Abraham over 3 years ago

Hmm, maybe post 10 and 12 mean we have to keep an open mind about the the kernels, too... or at least the padding.

#14 Updated by Mark Abraham over 3 years ago

Justin, can you please upload a tarball of all the inputs for grompp? (At least a 4.6-era grompp.) Erik plans to spend a few hours today squashing this, and being able to probe with different energy group settings might be something he needs to do.

#15 Updated by Justin Lemkul over 3 years ago

Mark Abraham wrote:

Justin, can you please upload a tarball of all the inputs for grompp? (At least a 4.6-era grompp.) Erik plans to spend a few hours today squashing this, and being able to probe with different energy group settings might be something he needs to do.

Sure thing. Here you go. These are the exact files I've been using for all the tests, just switching out commented lines and whatever settings as needed.

#16 Updated by Mark Abraham over 3 years ago

Thanks. It also needs charmm36a.ff - per contrib or local version?

#17 Updated by Justin Lemkul over 3 years ago

Mark Abraham wrote:

Thanks. It also needs charmm36a.ff - per contrib or local version?

Oh, duh. Local version, what we're preparing to release soon. Attached here.

#18 Updated by Mark Abraham over 3 years ago

I have also observed that 366c49a438 with GMX_CPU_ACCELERATION=None sees mdrun -nt 1 and -nt 4 produce the same EM result (-7.58815e+04 and -7.58847e+04, resp), and the same -rerun result (-8.59870e+04 and -8.59888e+04 resp). These are in agreement with the corresponding 4.5.5 numbers Justin reports above.

With SSE2 acceleration, -nt 1 and -nt 4 produce different result:
zero-step em:

-42151.2,  step   0:     -72774.4

rerun:
-56769.6,  step   1:     -85990.3

All that looks very much like a buffer-padding fail.

Justin, zero-step EM does constraints (i.e. including SETTLE) before computing the energy (for better or worse). So some numerical disagreement between EM and rerun is expected. However, the sign and magnitude of the difference between the 4.5.5 numbers for "zero-step EM from" and "rerun on" the same coordinates in the original post seems wrong. Are you suggesting that they should be comparable (i.e. that the EM tpr contains the -rerun coordinates), modulo that constraint stage?

#19 Updated by Justin Lemkul over 3 years ago

Mark Abraham wrote:

Justin, zero-step EM does constraints (i.e. including SETTLE) before computing the energy (for better or worse). So some numerical disagreement between EM and rerun is expected. However, the sign and magnitude of the difference between the 4.5.5 numbers for "zero-step EM from" and "rerun on" the same coordinates in the original post seems wrong. Are you suggesting that they should be comparable (i.e. that the EM tpr contains the -rerun coordinates), modulo that constraint stage?

The application of SETTLE is fine. We do the same in CHARMM before calculating the energy, so that's why I'm thinking that the zero-step EM is what I believe to be the "correct" value (and it matches CHARMM, so that helps). It's just a bit strange (and perhaps it's because I don't know all of the details of the routines here - md.c is scary) that with other small molecules (amino acids, nucleotides, etc) that we have been testing, the agreement between zero-step EM and mdrun -rerun has been nearly perfect. Perhaps it's just a matter of waters or something, given that the problematic case here is the first we have done with a full box of water.

All .tpr files contain the same coordinates, and the .tpr files were built using the .pdb file I provided, so in theory the .tpr and .pdb should match in terms of the initial (unconstrained) coordinates.

#20 Updated by Erik Lindahl over 3 years ago

Hi,

Just getting started here, so you guys still have a bit of a head start. However, let's clearly separate this into three (or more) different things:

1) A possible general difference between Gromacs-4.5 and Gromacs-4.6. I just tried creating my own input files for Gromacs-4.5, and then I noticed that I get potentials around -8.59E+04 when continuation=yes, and -7.59E+04 when continuation=no. The good news here is that I get exactly the same results when using Gromacs-4.6 with the Gromacs-4.5 TPR files. When I'm using grompp from Gromacs-4.6 to generate things I'm getting lots of differences in the number of Settles and LJ14 interactions - that's the next thing I need to trace down, to make sure there are no issues in the TPR file generation. Still, the first discrepancy Justin saw could be due to continuation being selected in one case but not the other.

2) Based on Mark's testing, there seems to be something strange happening when using fewer threads than physical processors. I too see this, and we'll make sure to fix today.

3) When it comes to rerun, I think those discrepancies are related to the behaviour in item (1) above, but I'm not 100% sure - yet.

#21 Updated by Justin Lemkul over 3 years ago

Thanks, Erik. The continuation setting could be one source of discrepancy and that may be a boneheaded move on my part if I've botched the settings there. Just to keep track, I would also add:

4) dependence of Epot on the number of energygrps

I probably should have filed that as a separate issue, but since things all seemed interrelated I didn't know how best to break things up.

#22 Updated by Erik Lindahl over 3 years ago

No problem. I'll start and go after item (2) in the list, since that might require me checking kernels. Will get back soon.

#23 Updated by Erik Lindahl over 3 years ago

  • Status changed from Accepted to Fix uploaded

OK, I think I solved this. The good news is that the kernels are fine, and likely most normal simulations. However, the SIMD padding of the neighborlist that we introduced in Gromacs-4.6 was always applied when we closed a list, and we did not remove this if the list for the same i-particle-list is added to with a second batch of j-particles. I've checked things a bit, and this does not seem to occur for any normal cutoff (<1.6nm) even when just using a single core, and when running in parallel that limit is pushed further up. Coarse-grained simulations are also likely to be fine, since it is the number of particles within the cutoff rather than cutoff itself that matters. The reason it occured here is that Justin's system is both dense and uses long cutoffs - but we should of course be correct for that case to.

Justin: Can you check if https://gerrit.gromacs.org/#/c/2623/ does indeed fix it for you too?

(There is still the difference between continuation=yes/no, but I don't think that's a bug)

#24 Updated by Justin Lemkul over 3 years ago

Thanks, Erik. The patch does fix the problem. The zero-step EM processes now generate equivalent energies across versions. There is still a very small difference based on the energygrps, but it's only about 6 kJ/mol so it's a vast improvement.

#25 Updated by Erik Lindahl over 3 years ago

When enabling/disabling energy groups my total energies match to within ~4e-5 in Gromacs-4.6. That is likely the numerical accuracy - remember that you are summing in the order of 100 million interactions since your long cutoff mena you are using all-vs-all interactions here, and since the potential is a sum of all those contributions the error will add up (forces in contrast will only be sums of ~14,000 terms). When enabling four different energy groups we execute the kernels and sum things in a slightly different order. Try double precision if you want to make sure the force fields produce the same result!

#26 Updated by Justin Lemkul over 3 years ago

Great, thanks Erik.

#27 Updated by Erik Lindahl over 3 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#28 Updated by Mark Abraham over 3 years ago

When I do EM with the new patch, the only way to reproduce the CHARMM energy is if I had set

continuation = no
(or equivalent with a code hack). That is, reproducing the CHARMM energy requires doing constraints before reporting the energy. That seems consistent with Justin's statement of what the CHARMM calculation is doing.

If I set

continuation = yes
in the EM .mdp, then the constraints are not enforced, and release-4-6 produces the same energy (within a few kJ/mol) with zero-step EM as it does with -rerun, despite -rerun using charge groups when EM did not. This is expected, so that is good. This was also true over all combinations of -nt 1, -nt 4, no SIMD, and AVX256.

So that resolves points 2-4. We rationalised Erik's observations in 1 with the change to LJ parameter treatment, so I think things are all resolved. Thanks all.

#29 Updated by Mark Abraham over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF