Bug #2845
critical box fluctuations when using GPUs
Description
Systems become extremely unstable (i.e. very large box fluctuations) after 80-100 ns until they eventually deform (i.e. membrane
abnormally enlarged, big water pores go through, and protein collapses). None of the following actions helps:
1 - Increase/decrease certain parameters within the mdp file
(e.g. tau-p).
2 - Re-build and re-equilibrate the system.
3 - Use of different GROMACS versions, namely 5.4, gromacs2016 and
gromacs2018.
4 - Use of different computers and different GPUs models.
5 - Use of a different system (other membrane proteins of the same family with
different ligands).
Sooner or later, same thing happens over and over again in every protein-membrane system. Intriguingly, systems never
crash (i.e. simulations keep on running despite of the observed artifacts).
Thus, the gromacs+CHARMM36+GPUs combination inevitably induces protein-membrane systems to collapse.
The only workaround so far is running the system using only CPUs.
Associated revisions
Fix incorrect LJ repulsion force switching on GPUs
When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.
Cherry-picked from 2019
Fixes #2845
Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Fix incorrect LJ repulsion force switching on GPUs
When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.
Cherry picked from 2019
Fixes #2845
Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
History
#1 Updated by Mark Abraham 24 days ago
Thanks for the report. I've started GPU and CPU runs to see what I reproduce.
What is the source of your membrane parameters and protocol? Is that known to be stable ie. not undergo phase transitions?
Can you please let us also have the full topology file and supporting files, so we can build our own .tpr if we need to?
#4 Updated by Ramon Guixa-Gonzalez 24 days ago
- File simfiles.zip simfiles.zip added
Mark Abraham wrote:
Thanks for the report. I've started GPU and CPU runs to see what I reproduce.
What is the source of your membrane parameters and protocol? Is that known to be stable ie. not undergo phase transitions?
Can you please let us also have the full topology file and supporting files, so we can build our own .tpr if we need to?
Thanks for the response Mark:
- I build the system using the CHARMM-GUI web tool, so the source of my membrane parameters is the last stable CHARMM release. No phase transition expected at all (pure POPC at 310 K)
- Please find attached the extra files you would need to build your own tpr
#5 Updated by Ramon Guixa-Gonzalez 24 days ago
Berk Hess wrote:
How long did you run a CPU run? Also more than 100 ns?
Membranes are/were not stable at constant surface tension in some version of the CHARMM force field, but I forgot with one(s) that was/were.
Dear Berk,
- The CPU run was run for 500ns
- I have been using CHARMM with the same parameters for more than 6 years and never came across any problem like this. Since I use the CHARMM-GUI interface to generate my systems, the version I am using is always the last stable release (i.e. in this case, C36m).
#6 Updated by Ramon Guixa-Gonzalez 24 days ago
Berk Hess wrote:
Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.
Can you attach a md.log file of a 2018 GPU run that produced incorrect results?
Please find attached here the log file of the system I uploaded
#7 Updated by Szilárd Páll 23 days ago
Berk Hess wrote:
Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.
Unless the issue that caused #2007 was not correctly identified, to me it seems that the fix on 5.1 was appropriate.
#8 Updated by Gerrit Code Review Bot 21 days ago
Gerrit received a related patchset '1' for Issue #2845.
Uploader: Berk Hess (hess@kth.se)
Change-Id: gromacs~release-2019~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9066
#11 Updated by Berk Hess 17 days ago
- Status changed from Fix uploaded to Resolved
Applied in changeset 4576f802ac81b07a49c7c4ad21d3bf15831979a7.
#12 Updated by Gerrit Code Review Bot 17 days ago
Gerrit received a related patchset '1' for Issue #2845.
Uploader: Paul Bauer (paul.bauer.q@gmail.com)
Change-Id: gromacs~release-2016~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9083
#13 Updated by Gerrit Code Review Bot 17 days ago
Gerrit received a related patchset '1' for Issue #2845.
Uploader: Paul Bauer (paul.bauer.q@gmail.com)
Change-Id: gromacs~release-2018~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9084
#14 Updated by Berk Hess 14 days ago
Applied in changeset de27733a387922033b116bf7024c595e5b1e99a8.
#15 Updated by Berk Hess 1 day ago
Applied in changeset c131d790b48da10f5fcb9e4aabeef0d730b76724.
Fix incorrect LJ repulsion force switching on GPUs
When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.
Fixes #2845
Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748