Possible bug with domain decomposition in rerun
Created an attachment (id=388)
.mdp file, git diff, gmxcheck output, mdrun .log files
Having seen that Bugzilla is in fact working (I'd thought it down for months like the rest of the website when I was unable to log in some time), I thought I'd revise and re-post my gmx-developers email. The below should supersede that post, though it agrees in all essentials with that post.
While doing some other things, I ran across some inexplicable behaviour that prompted me to test 2D vs 3D domain decomposition of reruns under otherwise identical conditions.Using
- either HEAD of branch release-4-0-patches, or the commit on that branch whose message states it is 4.0.5,
- on a system of a 24-residue CHARMM peptide in an 8.5nm cube of water,
- using 64 Intel Harpertown cores on an SGI Altix XE 320 or 64 BlueGene/L nodes
- with 16 PME-only nodes (rlist=rvdw=rcoulomb=1.3, ewald_rtol 1e-5, fourier_n[xyz]=40/48/48)
- to rerun over 97 frames
- in single precision
I found that optimize_ncells in src/mdlib/domdec_setup.c narrowly preferred an 8x3x2 DD grid over 8x6x1 and others (judged from the debug output of comm_cost_est).
If I modified the code to enforce the 2D-DD grid 8x6x1 (by setting nc[*] suitably after the recursive assign_factors returns; git diff attached) and compared with either the naturally-selected 8x3x2 or that grid as an enforced choice, I got inconsistent results. There were significantly different energies (but some frames matched various of the components), and forces that were mostly noticeably different between corresponding frames, and rarely in close agreement. The first frame always looked correct - identical energies and virial. Output from gmxcheck -e -e2 is attached. Note that LJ and Coul in both SR and 14 forms all differ for some frames, but there is no obvious pattern except that Coul.(recip) and bonded interactions never appear. There are some frighteningly large virial components.
4.0.5 gmxcheck doesn't seem to actually compare forces regardless of -tol, but I had my own tool to hand that does let me do this, and a textual diff of gmxdump output of the trajectory forces agrees that substantial differences exist. Units of quantities that follow are all GROMACS standard kJ/mol/nm, of course. "Component error" is the difference between the corresponding component of the forces from the corresponding frame from two trajectories. "Frame relative error" is the RMS difference over all component errors from a frame, divided by the RMS value of all components from that frame from one of the trajectories.
The maximum component error varied over the rerun trajectory frames from zero to several hundred, with frame relative error from 5e-5 (about as low as you might expect for single precision) to about 0.015. The first frame seemed to be a correct outlier with relative error of 4e-7 with maximum component error of 0.02. It seems implausible that some frames' relative error should be nearly as large as the first frame's maximum component error, and so I think a bug must exist.
To confirm, I tested further by enforcing 4x4x3 DD. Compared with 8x3x2, the relative error was always less than 3e-7 and maximum component error about 0.003. This looks like the quality of correspondence I think should exist at single precision when the only variable is the nature of the DD. Comparison of .edr files with gmxcheck showed only the expected minor differences in virial and pressure on some frames.
I've tried generic, CPU-optimized and solvent-optimized kernels with no effect on the general nature of the above observations. Two different architectures also showed these observations.
My hypothesis is that the coordinates of reruns under 2D DD are going awry when the DD state does not match the new frame. Unfortunately the DD code is pretty impenetrable for the uninitiated, so I won't explore further.
I've attached a sample git diff for my DD grid-enforcing procedure, my .mdp/.tpr/rerun .xtc (common to all runs), a sample gmxcheck -e -e2 run output, and the three abovementioned sets of .log/.edr/.trr files.
My mdrun commands were always
mpirun -np 64 ~/progs/bin/mdrun_mpi_405 -npme 16 -rerun 1oei -deffnm 1oei_04_40-40_48-48_48-48 -dlb yes -debug 2
where mdrun_mpi_405 was a version with the DD grid choice enforced or not, per the above, and as can be seen in the .log files attached.
#1 Updated by Mark Abraham about 10 years ago
After reviewing old bugs I've submitted, it occurred to me that this trajectory was generated with 3.3.1 as was the one in bug #271 here http://bugzilla.gromacs.org/show_bug.cgi?id=271. Berk's theory about the assumption of localized charge groups might explain what's happening here, since the first frame works, and all subsequent ones don't (spaced by 100ps).
#2 Updated by Berk Hess about 10 years ago
I can not easily reproduce this.
I get only difference in the last decimal of the energies
when rerunning with 2D DD a trajectory generated since cpu.
Could you attach the tpr file and a few trr frames that produce
Why did you hardcode the DD grid dimensions?
You could have simply uses the mdrun -dd option, right?
I had already fixed the gmxdump force output option for 4.0.6 and git master.
#3 Updated by Mark Abraham about 10 years ago
Created an attachment (id=389)
.tpr and partial .trr requested by Berk
Here's the first 10 frames of the 3.3.1 trajectory, converted with a 4.0.5 trjconv, and the 4.0.5 .tpr I was using to rerun. I tried to load them earlier, but with some other files with them, the .tar was above 50MB and I didn't have time to try again.
Tomorrow when I have some time I'll see if I can reproduce the behaviour with a trajectory produced from 4.0.5, but my suspicion is that I won't be able to.
I managed to miss the fact that mdrun -dd would have done my DD-constraints without the need to modify code - thanks for pointing that out. I'm glad the trjconv issue is fixed for the future.
#4 Updated by Berk Hess about 10 years ago
This cost me some time do debug.
The cause of this problem is similar to a bug that I fixed recently
(but that I could not find now in bugzilla).
You have a tpr file with a static box (no pressure coupling),
whereas your xtc file has a changing box.
If fixed this problem in most of the code recently,
except for one point in the domain decomposition.
If a now considering if I should fix this, or if I should simply
disallow rerunning with a tpr without pressure coupling and
an xtc file with changing.
#5 Updated by Berk Hess about 10 years ago
It was actually not in the domain decomposition.
The only thing that was correct was 3D domain decomposition,
even serial was incorrect.
I have fixed it in git master.
For 4.0 you'll have to provide a tpr file with pressure coupling
turned on to get correct results.