Bug #1245

Updated by Michael Shirts about 7 years ago

Here is a report of an issue that looks like a bug. It occurs when REMD is used in combination to particle decomposition (pd) but not with domain decomposition (dd). In the current case the use of pd is necessary du to the definition of extra harmonic bond between molecules distant from ~4.0 to 6.0 nm, so the dd does not allow decomposition on multiple CPUs.

The system is described using the MARTINI CG model but similar issue was observed with an atomistic system.

Note that there seems to be an evolution of the problem when going from version 407 to 455 and newer. I was actually able to run so 500 ns of 6 replicas with 407 but with closely related systems and box dimensions (which might well be the problem but it is not clear at this point).

The issue manifests itself by crashes with notes going from LINCS warning, pressure scaling more than 1%, to "Large VCM(group SOL): ...", problem settling of water molecules.

The observations that I have noted are:
1- following the evolution of a set of variables it appears that the LJ(SR) and the pressure are the first to manifest a huge jump at the step right after the exchange. The temperature and box size seem to follow a few steps later. The jump ins LJ(SR) is from 10 to 50% of the total value (~400,000 kJ/mol)
2- turning the pd off (and the extra long bond) and use of dd resolve the problem
3- in gmx457, turning off pressure coupling (going to NVT) resolves the problem but ONLY if the starting conformation is identical. When starting conformations are different the systems collapses at the first exchange.
4- in gmx407 a system with similar starting conformations has ran for ~500 ns with regular exchanges. The inspection of the exchanges indicates the issue is present but increasing the pressure relaxation time to 5 ps) seems to reduce the effect and the system is able to handle the issue and relax within 5-10 ps. This is not happening in gmx455 and newer.
5- reducing the time step from 20 fs to 2 fs does not improve the problem
6- switching to parrinelo-raman does not help
7- switching to dd resolves the problem

I attach a set of 16 tpr that were used with the gmx461. Note that you do not need to run the 16 replicas!
I used a command like:
<path-to>/mdrun -pd -replex 500 -multi 16
The crash happens at the first exchange (500*0.02fs=10 ps). I guess this can speeded up.

Any help would be greatly appreciated and I can of course participate as much as my coding abilities allow it :)).

Tks for your time,