Project

General

Profile

Bug #2845

critical box fluctuations when using GPUs

Added by Ramon Guixa-Gonzalez about 2 months ago. Updated 7 days ago.

Status:
Feedback wanted
Priority:
High
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
2019, 2016
Affected version:
Difficulty:
uncategorized
Close

Description

Systems become extremely unstable (i.e. very large box fluctuations) after 80-100 ns until they eventually deform (i.e. membrane
abnormally enlarged, big water pores go through, and protein collapses). None of the following actions helps:

1 - Increase/decrease certain parameters within the mdp file
(e.g. tau-p).
2 - Re-build and re-equilibrate the system.
3 - Use of different GROMACS versions, namely 5.4, gromacs2016 and
gromacs2018.
4 - Use of different computers and different GPUs models.
5 - Use of a different system (other membrane proteins of the same family with
different ligands).

Sooner or later, same thing happens over and over again in every protein-membrane system. Intriguingly, systems never
crash (i.e. simulations keep on running despite of the observed artifacts).
Thus, the gromacs+CHARMM36+GPUs combination inevitably induces protein-membrane systems to collapse.

The only workaround so far is running the system using only CPUs.

bug.zip (34.1 MB) bug.zip Ramon Guixa-Gonzalez, 01/28/2019 12:31 PM
simfiles.zip (1010 KB) simfiles.zip Ramon Guixa-Gonzalez, 01/28/2019 10:44 PM
logConcat (2.72 MB) logConcat Ramon Guixa-Gonzalez, 01/28/2019 10:50 PM
2016_6.zip (43.4 MB) 2016_6.zip Ramon Guixa-Gonzalez, 02/22/2019 10:46 AM
boxzCpu.jpeg (863 KB) boxzCpu.jpeg Ramon Guixa-Gonzalez, 02/22/2019 03:43 PM
boxProtein-free.png (111 KB) boxProtein-free.png Ramon Guixa-Gonzalez, 02/22/2019 08:08 PM
2016.6_400ns.jpeg (497 KB) 2016.6_400ns.jpeg Ramon Guixa-Gonzalez, 02/24/2019 12:37 PM
2016.6XYZ.jpeg (554 KB) 2016.6XYZ.jpeg Ramon Guixa-Gonzalez, 02/24/2019 01:32 PM
run1Concat.log (2.26 MB) run1Concat.log Ramon Guixa-Gonzalez, 02/25/2019 09:53 AM

Related issues

Related to GROMACS - Bug #2867: abnormal box fluctuations on GPUs still thereNew

Associated revisions

Revision 4576f802 (diff)
Added by Berk Hess about 2 months ago

Fix incorrect LJ repulsion force switching on GPUs

When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.

Fixes #2845

Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748

Revision de27733a (diff)
Added by Berk Hess about 2 months ago

Fix incorrect LJ repulsion force switching on GPUs

When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.

Cherry-picked from 2019

Fixes #2845

Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748

Revision c131d790 (diff)
Added by Berk Hess about 1 month ago

Fix incorrect LJ repulsion force switching on GPUs

When using a CUDA or OpenCL GPU, the coefficient for the second order
term for the LJ repulsion in the force (not energy) switching function,
called 'A' in the manual, had the wrong sign in both the force-only
kernels and the force+energy kernels.
Note that the dispersion force switching was correct.

Cherry picked from 2019

Fixes #2845

Change-Id: Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748

History

#1 Updated by Mark Abraham about 2 months ago

Thanks for the report. I've started GPU and CPU runs to see what I reproduce.

What is the source of your membrane parameters and protocol? Is that known to be stable ie. not undergo phase transitions?

Can you please let us also have the full topology file and supporting files, so we can build our own .tpr if we need to?

#2 Updated by Berk Hess about 2 months ago

How long did you run a CPU run? Also more than 100 ns?
Membranes are/were not stable at constant surface tension in some version of the CHARMM force field, but I forgot with one(s) that was/were.

#3 Updated by Berk Hess about 2 months ago

Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.

Can you attach a md.log file of a 2018 GPU run that produced incorrect results?

#4 Updated by Ramon Guixa-Gonzalez about 2 months ago

Mark Abraham wrote:

Thanks for the report. I've started GPU and CPU runs to see what I reproduce.

What is the source of your membrane parameters and protocol? Is that known to be stable ie. not undergo phase transitions?

Can you please let us also have the full topology file and supporting files, so we can build our own .tpr if we need to?

Thanks for the response Mark:

- I build the system using the CHARMM-GUI web tool, so the source of my membrane parameters is the last stable CHARMM release. No phase transition expected at all (pure POPC at 310 K)

- Please find attached the extra files you would need to build your own tpr

#5 Updated by Ramon Guixa-Gonzalez about 2 months ago

Berk Hess wrote:

How long did you run a CPU run? Also more than 100 ns?
Membranes are/were not stable at constant surface tension in some version of the CHARMM force field, but I forgot with one(s) that was/were.

Dear Berk,

- The CPU run was run for 500ns
- I have been using CHARMM with the same parameters for more than 6 years and never came across any problem like this. Since I use the CHARMM-GUI interface to generate my systems, the version I am using is always the last stable release (i.e. in this case, C36m).

#6 Updated by Ramon Guixa-Gonzalez about 2 months ago

Berk Hess wrote:

Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.

Can you attach a md.log file of a 2018 GPU run that produced incorrect results?

Please find attached here the log file of the system I uploaded

#7 Updated by Szilárd Páll about 2 months ago

Berk Hess wrote:

Issue 2007 affected this configuration, but that was fixed in version 5.1.4, in a patch release of 2016 and is not present in the 2018 release. But maybe this was not fixed correctly.

Unless the issue that caused #2007 was not correctly identified, to me it seems that the fix on 5.1 was appropriate.

#8 Updated by Gerrit Code Review Bot about 2 months ago

Gerrit received a related patchset '1' for Issue #2845.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9066

#9 Updated by Berk Hess about 2 months ago

  • Category set to mdrun
  • Status changed from New to Fix uploaded
  • Target version set to 2018.6
  • Affected version - extra info set to 2019, 2016

I am quite sure the fix I uploaded fixes the issue, but please test.

#10 Updated by Berk Hess about 2 months ago

I did actually test myself that this fix significantly reduces the difference in pressure between CPU and GPU at step 0.

#11 Updated by Berk Hess about 2 months ago

  • Status changed from Fix uploaded to Resolved

#12 Updated by Gerrit Code Review Bot about 2 months ago

Gerrit received a related patchset '1' for Issue #2845.
Uploader: Paul Bauer ()
Change-Id: gromacs~release-2016~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9083

#13 Updated by Gerrit Code Review Bot about 2 months ago

Gerrit received a related patchset '1' for Issue #2845.
Uploader: Paul Bauer ()
Change-Id: gromacs~release-2018~Ib5c250a1f370d4c0a4fd652bf6efa03e70e18748
Gerrit URL: https://gerrit.gromacs.org/9084

#16 Updated by Ramon Guixa-Gonzalez about 1 month ago

Ramon Guixa-Gonzalez wrote:

Systems become extremely unstable (i.e. very large box fluctuations) after 80-100 ns until they eventually deform (i.e. membrane
abnormally enlarged, big water pores go through, and protein collapses). None of the following actions helps:

1 - Increase/decrease certain parameters within the mdp file
(e.g. tau-p).
2 - Re-build and re-equilibrate the system.
3 - Use of different GROMACS versions, namely 5.4, gromacs2016 and
gromacs2018.
4 - Use of different computers and different GPUs models.
5 - Use of a different system (other membrane proteins of the same family with
different ligands).

Sooner or later, same thing happens over and over again in every protein-membrane system. Intriguingly, systems never
crash (i.e. simulations keep on running despite of the observed artifacts).
Thus, the gromacs+CHARMM36+GPUs combination inevitably induces protein-membrane systems to collapse.

The only workaround so far is running the system using only CPUs.

Berk Hess wrote:

Applied in changeset c131d790b48da10f5fcb9e4aabeef0d730b76724.

Berk Hess wrote:

I am quite sure the fix I uploaded fixes the issue, but please test.

Dear Berk,

I tested the 2016.6 release (http://manual.gromacs.org/2016.6/ReleaseNotes/release-notes.html#fix-incorrect-lj-repulsion-force-switching-on-gpus) and I am afraid this is not fixed...
It took more than 300 ns, but box fluctuations seem to be there again. Please check the files I have attached

Am I doing something wrong?

Thanks for the help again

R

#17 Updated by Paul Bauer about 1 month ago

  • Status changed from Resolved to Closed

#18 Updated by Berk Hess about 1 month ago

  • Related to Bug #2867: abnormal box fluctuations on GPUs still there added

#19 Updated by Berk Hess about 1 month ago

Could you attach a box plot for a long cpu run, so we can see what you think is normal and abnormal?

#20 Updated by Ramon Guixa-Gonzalez about 1 month ago

Berk Hess wrote:

Could you attach a box plot for a long cpu run, so we can see what you think is normal and abnormal?

Sure, here you go. 1 microsecond of a CPU run

Berk Hess wrote:

Could you attach a box plot for a long cpu run, so we can see what you think is normal and abnormal?

Sure, here you go Berk. 1-microsecond cpu run

#21 Updated by Ramon Guixa-Gonzalez about 1 month ago

Ramon Guixa-Gonzalez wrote:

Berk Hess wrote:

Could you attach a box plot for a long cpu run, so we can see what you think is normal and abnormal?

Sure, here you go. 1 microsecond of a CPU run

Berk Hess wrote:

Could you attach a box plot for a long cpu run, so we can see what you think is normal and abnormal?

Sure, here you go Berk. 1-microsecond cpu run

By the way, one thing that is now puzzling me is the fact that if I build the same system but protein-free, and simulate it under the same conditions, there are no box fluctuations whatsoever even using GPUs...? Picture attached. Any hint?

#22 Updated by Berk Hess 30 days ago

Have you continued the 2016.6 GPU run? It could be that the system with protein is just on the limit of the stability with the force field. So maybe the 2016.6 GPU results are just "unlucky", but the system remains at the same box size for most of the time?

#23 Updated by Ramon Guixa-Gonzalez 29 days ago

Berk Hess wrote:

Have you continued the 2016.6 GPU run? It could be that the system with protein is just on the limit of the stability with the force field. So maybe the 2016.6 GPU results are just "unlucky", but the system remains at the same box size for most of the time?

I continued the 2016.6 GPU run and, as you can see in the pictures, the system is basically getting squeezed along the z dimension, so back to the beginning.

#24 Updated by Berk Hess 29 days ago

Are you 100% sure that you are running 2016.6 with the fix?
I checked that the fix corrected the energy and the pressure, so the fix actually fixes a bug that occurred with your run setup. I can not exclude that there is a second bug, but I have not clue where that could be. The only special code with force-switching on GPUs is the few lines where the fix is.

Could you attach the md.log file for the GPU 2016.6 run?

#25 Updated by Ramon Guixa-Gonzalez 28 days ago

Berk Hess wrote:

Are you 100% sure that you are running 2016.6 with the fix?
I checked that the fix corrected the energy and the pressure, so the fix actually fixes a bug that occurred with your run setup. I can not exclude that there is a second bug, but I have not clue where that could be. The only special code with force-switching on GPUs is the few lines where the fix is.

Could you attach the md.log file for the GPU 2016.6 run?

Well, I am running the 2016.6 version recently released (http://manual.gromacs.org/documentation/2016.6/download.html). There, you state this is fixed right? (http://manual.gromacs.org/documentation/2016.6/ReleaseNotes/index.html).

Please find attached the md.log file for the GPU 2016.6 run.

#26 Updated by Ramon Guixa-Gonzalez 25 days ago

Ramon Guixa-Gonzalez wrote:

Berk Hess wrote:

Are you 100% sure that you are running 2016.6 with the fix?
I checked that the fix corrected the energy and the pressure, so the fix actually fixes a bug that occurred with your run setup. I can not exclude that there is a second bug, but I have not clue where that could be. The only special code with force-switching on GPUs is the few lines where the fix is.

Could you attach the md.log file for the GPU 2016.6 run?

Well, I am running the 2016.6 version recently released (http://manual.gromacs.org/documentation/2016.6/download.html). There, you state this is fixed right? (http://manual.gromacs.org/documentation/2016.6/ReleaseNotes/index.html).

Please find attached the md.log file for the GPU 2016.6 run.

Hi there, have you figured what the problem is here? I only have access to a GPU cluster, so since this issue came up I am basically stuck...

#27 Updated by Berk Hess 19 days ago

This is an issue with high priority, as the setting says. But I have little clue where to look. Normally I would test things myself. But since this happens after so long time, and not without the protein, that is impractical.

The only suggestion I have is to run with a plain cut-off instead. Using a plain cut-off of 1.11 nm gives the same potential and force at distances between 0 and 1 nm and only differs in the switching region. This should not significantly change the properties of the system. Whether this is stable or not should also give us hints about this issue.

#28 Updated by Berk Hess 19 days ago

PS I meant a plain cut-off with shifted potential, which is the default potential modifier with the Verlet cutoff scheme.

#29 Updated by Berk Hess 7 days ago

  • Status changed from Closed to Feedback wanted
  • Target version changed from 2018.6 to 2019.2

Do you already have results with a 1.11 nm cut-off without switch?

#30 Updated by Ramon Guixa-Gonzalez 7 days ago

Berk Hess wrote:

Do you already have results with a 1.11 nm cut-off without switch?

Sorry, I have not been able to launch this test yet.
Would you mind confirming the lines I need to have in my mdp in order to perform this run?

This is what I currently use:
cutoff-scheme = Verlet
coulombtype = pme
rcoulomb = 1.2

Should I just use this:
cutoff-scheme = Verlet
coulombtype = Cut-off
rcoulomb = 1.1

Would this be all?

Thanks

R

#31 Updated by Berk Hess 7 days ago

No, still use PME.
Change:
rcoulomb=1.11
rvdw=1.11
vdw-modifier=potential-shift (which is the default)

This should also run faster.

You seem to be running with the default PME fourier spacing. You could have used a coarser grid with rcoulomb=1.2. But I would leave it as is with the 1.11 nm cut-off.

Also available in: Atom PDF