Project

General

Profile

Bug #1245

Bug in REMD with particle decomposition

Added by Xavier Periole over 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
407, 455, 457 and 461 were tested
Affected version:
Difficulty:
uncategorized
Close

Description

Here is a report of an issue that looks like a bug. It occurs when REMD is used in combination to particle decomposition (pd) but not with domain decomposition (dd). In the current case the use of pd is necessary du to the definition of extra harmonic bond between molecules distant from ~4.0 to 6.0 nm, so the dd does not allow decomposition on multiple CPUs.

The system is described using the MARTINI CG model but similar issue was observed with an atomistic system.

Note that there seems to be an evolution of the problem when going from version 407 to 455 and newer. I was actually able to run so 500 ns of 6 replicas with 407 but with closely related systems and box dimensions (which might well be the problem but it is not clear at this point).

The issue manifests itself by crashes with notes going from LINCS warning, pressure scaling more than 1%, to "Large VCM: ...", problem settling of water molecules.

The observations that I have noted are:
1- following the evolution of a set of variables it appears that the LJ(SR) and the pressure are the first to manifest a huge jump at the step right after the exchange. The temperature and box size seem to follow a few steps later. The jump ins LJ(SR) is from 10 to 50% of the total value (~400,000 kJ/mol)
2- turning the pd off (and the extra long bond) and use of dd resolve the problem
3- in gmx457, turning off pressure coupling (going to NVT) resolves the problem but ONLY if the starting conformation is identical. When starting conformations are different the systems collapses at the first exchange.
4- in gmx407 a system with similar starting conformations has ran for ~500 ns with regular exchanges. The inspection of the exchanges indicates the issue is present but increasing the pressure relaxation time to 5 ps) seems to reduce the effect and the system is able to handle the issue and relax within 5-10 ps. This is not happening in gmx455 and newer.
5- reducing the time step from 20 fs to 2 fs does not improve the problem
6- switching to parrinelo-raman does not help
7- switching to dd resolves the problem

I attach a set of 16 tpr that were used with the gmx461. Note that you do not need to run the 16 replicas!
I used a command like:
<path-to>/mdrun -pd -replex 500 -multi 16
The crash happens at the first exchange (500*0.02fs=10 ps). I guess this can speeded up.

Any help would be greatly appreciated and I can of course participate as much as my coding abilities allow it :)).

Tks for your time,
XAvier.

bug-remd.tar (8.81 MB) bug-remd.tar Xavier Periole, 05/09/2013 12:57 PM
remd-issue.pdf (3.57 MB) remd-issue.pdf Xavier Periole, 05/09/2013 10:03 PM
remd-issue-gmx461.pdf (1.34 MB) remd-issue-gmx461.pdf Xavier Periole, 05/10/2013 09:13 PM
remd-issue-gmx461-slow.pdf (2.31 MB) remd-issue-gmx461-slow.pdf Xavier Periole, 05/10/2013 09:13 PM

Related issues

Related to GROMACS - Bug #1362: REMD simulation in NPT ensemble have unreasonable box sizes, with single cpu core per replicaClosed10/25/2013

History

#1 Updated by Xavier Periole over 6 years ago

I reduced the number of replicas to 9 to be able to upload the tars.

XAvier.

#2 Updated by Michael Shirts over 6 years ago

  • Description updated (diff)

This is only a stopgap, and doesn't address the underlying problem, but have you tried using the pull code to define your harmonic bond instead of having a large molecule? In that case, you may be able to use dd after all.

#3 Updated by Xavier Periole over 6 years ago

I am not sure what you mean by stopgap and why it is not addressing the underlying problem ...

I have not tried the pull code since I use a 6D umbrella sampling from which only 3 are active in that particular case. They include one distance and two improper dihedral angles. While the distance could be transferred to the pull code the angles would not so the problem would remain. In that case the pd option is of great use and it would be great to fix it :))

#4 Updated by Michael Shirts over 6 years ago

Stopgap meaning it's an alternate way to handle the problem w/o having to use particle decomposition. Not helping the underlying problem because it doesn't fix REMD for particle decomposition.

If it's a dihedral, that probably isn't possible to do with the pull code (I'm guessing).

Berk, I don't know the PD setup, but please let me know if I can help address any issues with understanding some of the changes I made in the code to handle multiple swaps efficiently.

#5 Updated by Berk Hess over 6 years ago

Your observations seem consistent with the box size not being properly exchanged.
But I just looked through the code and I didn't spot any issues.
You could print box elements before and after exchange to check.

#6 Updated by Xavier Periole over 6 years ago

@Michael: I got it :)) I thought the starting "This" of your sentence referred to my note and not your following comment ...

@Berk:

you can see the step by step evolution of a few quantities at the following link:
http://md.chem.rug.nl/~periole/remd-issue.pdf or in the file I attached.

Note that the data was extracted from a run using gmx-4.0.7. I could check with gmx-457 or gmx-461 since things are actually worse in these versions.

But overall it seems that hte first thing to blow is the LJ the box seems only to follow. The change in LJ energy is really huge and with the CG model small changes in distances should not have such effect ... but I am not sure.

#7 Updated by Floris Buelens over 6 years ago

Hi,

I replied on the mailing list but perhaps you missed it. Your symptoms sound a lot like a bug I reported back in 2010: http://redmine.gromacs.org/issues/433 - closed with only a short comment from Berk that it was fixed for 4.5. With my fix as detailed in the redmine report I've been happily doing NPT replica exchange with my custom code forked from 4.0.5 - I've never tested any later versions.

Just from a quick look at the 4.6 code I think this could be the same or a related issue - with domain decomposition, a call to dd_partition_system (md.c:2062) copies box information from 'state_global' to 'state'. At least in 4.0.5, this wasn't happening with only one CPU per replica, so the box information after an exchange was still that of the old set of coordinates.

You could try a copy and paste of the fix from my bug report (after the line 2067 close-brace, so you get if (bExchanged && DOMAINDECOMP) {...} else if (bExchanged) {...}) as a quick test.

#8 Updated by Xavier Periole over 6 years ago

Hi Floris,

I had not seen your post indeed! It ended up in my spam box for some reason.

I am looking at it right now. It looks like it might well be the problem.

#9 Updated by Xavier Periole over 6 years ago

Things are going well ... using only one CPU per replica it ran and did multiple exchanges successfully :))

However when I introduce the section you mention in your earlier bug fix I get the following error while compiling, at the make step:

/home/p228779/gromacs-4.6.1/src/kernel/md.c: In function 'do_md':
/home/p228779/gromacs-4.6.1/src/kernel/md.c:2074:31: error: 'lt' undeclared (first use in this function)
/home/p228779/gromacs-4.6.1/src/kernel/md.c:2074:31: note: each undeclared identifier is reported only once for each function it appears in
/home/p228779/gromacs-4.6.1/src/kernel/md.c:2074:33: error: expected ')' before ';' token
make2: * [src/kernel/CMakeFiles/mdrun.dir/md.c.o] Error 1
make1:
[src/kernel/CMakeFiles/mdrun.dir/all] Error 2
make1: *
* Waiting for unfinished jobs....

Any idea what is going on?

Note I got the same problem on two different machines.

#10 Updated by Floris Buelens over 6 years ago

some formatting got lost there - the 'for' line should read

for(i=0; i<state->ngtc; i++)

If you're working with 4.0 code I'd suggest you test there first, since I only tested this on code branched from 4.0.5.

#11 Updated by Xavier Periole over 6 years ago

Ok found out ... the for statement should be replaced by:

for ( i=0; i<state->ngtc; i++)

but the program stops after the first exchange without giving any error message ...

#12 Updated by Floris Buelens over 6 years ago

...and bear in mind this was a hacked solution that made my specific problem go away, not tested any further than that. If you see correct behaviour with one CPU per replica I'm pretty convinced we've narrowed down the issue and Berk or colleagues can come up with a more trustworthy solution.

#13 Updated by Xavier Periole over 6 years ago

Ok, so here we are. The latest news:

- the modification proposed by Floris did not removed the problem in versions 4.0.7 nor 4.6.1.
- the problem persists from 407 to 461
- using systems with closely related sizes avoid immediate crash but the exchange is not done properly
- the use of one cpu per replica irradiates the problem.
- looking at the code did not help me much so far ...

I attach a file containing some observables along the crash (using gmx461). There are two versions, the one with different box sizes and the crash occurs at the first exchange. The second test (extension slow to the file) uses the similar starting conformations so I can actually follow the observable a bit more.

Again this is a problem only in particle decomposition ...

Any help would be greatly appreciated,

Tks for your time,
XAvier.

#15 Updated by Mark Abraham over 6 years ago

  • Target version changed from 4.6.2 to 4.6.3

I would like to have time to investigate this, but do not have it now.

#16 Updated by Mark Abraham over 6 years ago

  • Target version changed from 4.6.3 to 4.6.x

#17 Updated by Szilárd Páll almost 6 years ago

As indicated above, this may be related to #1362. Could you try whether the gerrit chage 2727 fixes the issue?

#18 Updated by Rossen Apostolov over 5 years ago

  • Status changed from New to Feedback wanted
  • Target version changed from 4.6.x to 4.6.6

Xavier, could you try again with the head of the 4.6-release branch or https://gerrit.gromacs.org/#/c/2727. I tested your inputs and they ran fine without a crash.

#19 Updated by Rossen Apostolov over 5 years ago

  • Status changed from Feedback wanted to Closed
  • Assignee changed from Berk Hess to Mark Abraham

@Xavier: please reopen the issue if it still exists. moreover, PD is removed (see #1292)

Also available in: Atom PDF