Project

General

Profile

Bug #2845

critical box fluctuations when using GPUs

Added by Ramon Guixa-Gonzalez 24 days ago. Updated 1 day ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
2019, 2016
Affected version:
Difficulty:
uncategorized
Close

Description

Systems become extremely unstable (i.e. very large box fluctuations) after 80-100 ns until they eventually deform (i.e. membrane
abnormally enlarged, big water pores go through, and protein collapses). None of the following actions helps:

1 - Increase/decrease certain parameters within the mdp file
(e.g. tau-p).
2 - Re-build and re-equilibrate the system.
3 - Use of different GROMACS versions, namely 5.4, gromacs2016 and
gromacs2018.
4 - Use of different computers and different GPUs models.
5 - Use of a different system (other membrane proteins of the same family with
different ligands).

Sooner or later, same thing happens over and over again in every protein-membrane system. Intriguingly, systems never
crash (i.e. simulations keep on running despite of the observed artifacts).
Thus, the gromacs+CHARMM36+GPUs combination inevitably induces protein-membrane systems to collapse.

The only workaround so far is running the system using only CPUs.

bug.zip (34.1 MB) bug.zip Ramon Guixa-Gonzalez, 01/28/2019 12:31 PM
simfiles.zip (1010 KB) simfiles.zip Ramon Guixa-Gonzalez, 01/28/2019 10:44 PM
logConcat (2.72 MB) logConcat Ramon Guixa-Gonzalez, 01/28/2019 10:50 PM

Associated revisions

Revision 4576f802 (diff)
Added by Berk Hess 20 days ago

Fix incorrect LJ repulsion force switching on GPUs

When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.

Fixes #2845

Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748

Revision de27733a (diff)
Added by Berk Hess 17 days ago

Fix incorrect LJ repulsion force switching on GPUs

When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.

Cherry-picked from 2019

Fixes #2845

Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748

Revision c131d790 (diff)
Added by Berk Hess 1 day ago

Fix incorrect LJ repulsion force switching on GPUs

When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.

Cherry picked from 2019

Fixes #2845

Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748

History

#1 Updated by Mark Abraham 24 days ago

Thanks for the report. I've started GPU and CPU runs to see what I reproduce.

What is the source of your membrane parameters and protocol? Is that known to be stable ie. not undergo phase transitions?

Can you please let us also have the full topology file and supporting files, so we can build our own .tpr if we need to?

#2 Updated by Berk Hess 24 days ago

How long did you run a CPU run? Also more than 100 ns?
Membranes are/were not stable at constant surface tension in some version of the CHARMM force field, but I forgot with one(s) that was/were.

#3 Updated by Berk Hess 24 days ago

Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.

Can you attach a md.log file of a 2018 GPU run that produced incorrect results?

#4 Updated by Ramon Guixa-Gonzalez 24 days ago

Mark Abraham wrote:

Thanks for the report. I've started GPU and CPU runs to see what I reproduce.

What is the source of your membrane parameters and protocol? Is that known to be stable ie. not undergo phase transitions?

Can you please let us also have the full topology file and supporting files, so we can build our own .tpr if we need to?

Thanks for the response Mark:

- I build the system using the CHARMM-GUI web tool, so the source of my membrane parameters is the last stable CHARMM release. No phase transition expected at all (pure POPC at 310 K)

- Please find attached the extra files you would need to build your own tpr

#5 Updated by Ramon Guixa-Gonzalez 24 days ago

Berk Hess wrote:

How long did you run a CPU run? Also more than 100 ns?
Membranes are/were not stable at constant surface tension in some version of the CHARMM force field, but I forgot with one(s) that was/were.

Dear Berk,

- The CPU run was run for 500ns
- I have been using CHARMM with the same parameters for more than 6 years and never came across any problem like this. Since I use the CHARMM-GUI interface to generate my systems, the version I am using is always the last stable release (i.e. in this case, C36m).

#6 Updated by Ramon Guixa-Gonzalez 24 days ago

Berk Hess wrote:

Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.

Can you attach a md.log file of a 2018 GPU run that produced incorrect results?

Please find attached here the log file of the system I uploaded

#7 Updated by Szilárd Páll 23 days ago

Berk Hess wrote:

Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.

Unless the issue that caused #2007 was not correctly identified, to me it seems that the fix on 5.1 was appropriate.

#8 Updated by Gerrit Code Review Bot 21 days ago

Gerrit received a related patchset '1' for Issue #2845.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9066

#9 Updated by Berk Hess 21 days ago

  • Category set to mdrun
  • Status changed from New to Fix uploaded
  • Target version set to 2018.6
  • Affected version - extra info set to 2019, 2016

I am quite sure the fix I uploaded fixes the issue, but please test.

#10 Updated by Berk Hess 21 days ago

I did actually test myself that this fix significantly reduces the difference in pressure between CPU and GPU at step 0.

#11 Updated by Berk Hess 17 days ago

  • Status changed from Fix uploaded to Resolved

#12 Updated by Gerrit Code Review Bot 17 days ago

Gerrit received a related patchset '1' for Issue #2845.
Uploader: Paul Bauer ()
Change-Id: gromacs~release-2016~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9083

#13 Updated by Gerrit Code Review Bot 17 days ago

Gerrit received a related patchset '1' for Issue #2845.
Uploader: Paul Bauer ()
Change-Id: gromacs~release-2018~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9084

#14 Updated by Berk Hess 14 days ago

#15 Updated by Berk Hess 1 day ago

Also available in: Atom PDF