Project

General

Profile

Bug #1848

Segfault with replica exchange at first successful exchange

Added by James Barnett about 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When I run a replica exchange simulation I get a segmentation fault right at the first successful exchange. I've been troubleshooting with 2 replicas. Their tpr files are attached.

cf. mailing list discussion:
https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2015-October/101724.html

Originally thought it didn't affect single node replica exchange simulations, but it does.

test0.tpr (568 KB) test0.tpr James Barnett, 10/30/2015 08:22 PM
test1.tpr (568 KB) test1.tpr James Barnett, 10/30/2015 08:22 PM
gmx1.debug (6.67 MB) gmx1.debug James Barnett, 11/20/2015 10:55 PM
gmx0.debug (6.77 MB) gmx0.debug James Barnett, 11/20/2015 10:55 PM
md_1.log (49.8 KB) md_1.log James Barnett, 11/21/2015 04:48 PM
md_0.log (50.1 KB) md_0.log James Barnett, 11/21/2015 04:48 PM
mda_0.log (46 KB) mda_0.log James Barnett, 11/21/2015 08:43 PM
mda_1.log (46.1 KB) mda_1.log James Barnett, 11/21/2015 08:43 PM
mdb_1.log (23.2 KB) mdb_1.log James Barnett, 11/21/2015 08:43 PM
mdb_0.log (23.2 KB) mdb_0.log James Barnett, 11/21/2015 08:43 PM
test-5.0.6_0.tpr (568 KB) test-5.0.6_0.tpr James Barnett, 11/24/2015 02:33 PM
test-5.0.6_1.tpr (568 KB) test-5.0.6_1.tpr James Barnett, 11/24/2015 02:33 PM
gmx0b.debug (3.91 KB) gmx0b.debug James Barnett, 11/24/2015 07:12 PM
gmx1b.debug (3.93 KB) gmx1b.debug James Barnett, 11/24/2015 07:12 PM
test_0b.log (23.2 KB) test_0b.log James Barnett, 11/24/2015 07:26 PM
test_1b.log (24.2 KB) test_1b.log James Barnett, 11/24/2015 07:26 PM

Related issues

Related to GROMACS - Bug #1858: compute globals should not have logic about which integrator is in useClosed

Associated revisions

Revision 318ed3bb (diff)
Added by Mark Abraham about 4 years ago

Fix error in multi-sim communication

I lost the [0] in converting the old code, and there was not enough
testing coverage to find it.

Fixes #1848

Change-Id: Ifb9ffaf5a525537231b1f4e848bba3ef0873a077

History

#1 Updated by James Barnett about 4 years ago

A few more details:

I ran a longer equilibration in NPT as Mark suggested in the thread. Still the same behavior. NVT also gives the same behavior. Nothing strange or very large in the energy files.

#2 Updated by James Barnett about 4 years ago

This bug doesn't seem to be present in 5.0.6. I still had that version installed and running a quick test I got a successful exchange with no problem or crash.

#3 Updated by Justin Lemkul about 4 years ago

  • Category set to mdrun

Useful information posted to gmx-users:

https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2015-November/102110.html

[67] 0x00000000007a10bd in add_binr (b=0x25f11c0, nr=9, r=0x0) at
/home/wmason/gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94
[4-7,12-15,20-31,60-67] 94 rbuf[i] = r[i];

0x0000000000725758 in global_stat (fplog=0x3543750, gs=0x3639420, cr=0x3536fa0,
enerd=0x36398b0, fvir=0x0, svir=0x0, mu_tot=0x7fff6a2dfe6c, inputrec=0x35424f0,
ekind=0x3635990, constr=0x363b920, vcm=0x0, nsig=0, sig=0x0,
top_global=0x3541860, state_local=0x363a090, bSumEkinhOld=0, flags=146) at
/home/wmason/gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229

0x000000000073efcd in compute_globals (fplog=0x3543750, gstat=0x3639420,
cr=0x3536fa0, ir=0x35424f0, fr=0x3599df0, ekind=0x3635990, state=0x363a090,
state_global=0x3543270, mdatoms=0x35cc860, nrnb=0x3599a20, vcm=0x3623b40,
wcycle=0x3599340, enerd=0x36398b0, force_vir=0x0, shake_vir=0x0, total_vir=0x0,
pres=0x0, mu_tot=0x7fff6a2dfe6c, constr=0x363b920, gs=0x0, bInterSimGS=0,
box=0x363a0b0, top_global=0x3541860, bSumEkinhOld=0x7fff6a2dff10, flags=146) at
/home/wmason/gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342

0x00000000004c3dca in do_md (fplog=0x0, cr=0x24a7fb0, nfile=35,
fnm=0x7fffa95b48a8, oenv=0x24b46e0, bVerbose=0, bCompact=1, nstglobalcomm=20,
vsite=0x252d890, constr=0x25e9a80, stepout=100, ir=0x24b2430,
top_global=0x24b4760, fcd=0x24ea8b0, state_global=0x24b31c0, mdatoms=0x252d980,
nrnb=0x24fa9b0, wcycle=0x24fa1b0, ed=0x0, fr=0x24fad80, repl_ex_nst=500,
repl_ex_nex=0, repl_ex_seed=-1, membed=0x0, cpt_period=15, max_hours=-1,
imdport=8888, Flags=1055744, walltime_accounting=0x2584a20) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun/md.cpp:969

0x00000000004d4a64 in mdrunner (hw_opt=0x7fff6a2e1d58, fplog=0x3543750,
cr=0x3536fa0, nfile=35, fnm=0x7fff6a2e15f8, oenv=0x35436d0, bVerbose=0,
bCompact=1, nstglobalcomm=-1, ddxyz=0x7fff6a2e11bc, dd_node_order=1, rdd=0,
rconstr=0, dddlb_opt=0x1f3b10c "auto", dlb_scale=0.800000012, ddcsx=0x0,
ddcsy=0x0, ddcsz=0x0, nbpu_opt=0x1f3b10c "auto", nstlist_cmdline=0,
nsteps_cmdline=-2, nstepout=100, resetstep=-1, nmultisim=40, repl_ex_nst=500,
repl_ex_nex=0, repl_ex_seed=-1, pforce=-1, cpt_period=15, max_hours=-1,
imdport=8888, Flags=1055744) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270

0x00000000004cb637 in gmx_mdrun (argc=15, argv=0x3531c20) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537

0x000000000050b26b in gmx::CommandLineModuleManager::runAsMainCMain (argc=15,
argv=0x7fffa95b5aa8, mainFunction=0x4c8d73 <gmx_mdrun(int, char**)>) at
/home/wmason/gromacs-5.1.1/src/gromacs/commandline/cmdlinemodulemanager.cpp:588

0x00000000004ba316 in main (argc=15, argv=0x7fff6a2e27f8) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun_main.cpp:43

The error is a classic segmentation fault--caused by accessing an array out of
bounds. It's a bug in the Gromacs 5.1.1 code. You will need to file a bug
report with gromacs, and they will need your job input to reproduce the error,
and the backtrace info above, which I've summarized if you just want to read
the code yourself:

"add_rbin" in gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94
called from "global_stat" in gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229
called from "compute_globals" in
gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342
called from "do_md" in /gromacs-5.1.1/src/programs/mdrun/md.cpp:969
called from "mdrunner" in gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270
called from "gmx_mdrun" in gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537

The code which fails is taking a "bin" of the "gmx_global_stat" type, which
holds an array of doubles and the size of the array, and tries to copy in data
from a "tensor" of the force virial, during a step in which energy is computed
globally.

I don't have any clue what this means, except for what the low-level code says
(I almost always have read the source code when debugging so I could try to
explain what's happening). There are multiple possible causes for this error,
such as:
A. The memory allocation for the "bin" fails quietly--the array is not resized
(or not the right size), and then the error occurs at the next function which
tries to write data there.
B. The tensor is actually not the size "DIM*DIM" (3x3 if I'm reading correctly)
that the function expects. Accessing the source tensor array out-of-bounds also
generates this error.

C. The tensor is actually a NULL pointer. This is the most likely explanation,
which one can see from the line:
add_binr (b=0x25f11c0, nr=9, r=0x0)
^-- Either r=NULL or the debugger is not reporting the value correctly.

This would mean the program is calling "do_md" from "mdrunner" with bad
parameters and not checking its parameters for errors. The error actually
occurs at a higher level of the code, rather than the low level where the error
is reported.

#4 Updated by Mark Abraham about 4 years ago

  • Status changed from New to Accepted
  • Assignee set to Mark Abraham

I do not observe the bug with James .tpr inputs, but the stack trace from Krysztof's sysadmin is clear enough.

This is indeed a bug in mdrun. In 488464e7 I removed an unrelated feature whose implementation was inadvertently making this work. The fully correct fix is not obvious - the compute_globals function is called in dozens of different ways, and simplifying this was part of the reason for removing this feature. :-(

A hack fix that I think will work is to change line 228 of stat.cpp from
if (bPres || !bVV)
to
if ((bPres || !bVV) && fvir != NULL)

If people could try that and let me know how they go, that might help me decide on a fix. Thanks, and sorry!

#5 Updated by Mark Abraham about 4 years ago

The good news is that as far as I understand right now, any run that didn't crash is fine.

#6 Updated by James Barnett about 4 years ago

Hey Mark, I made that change with 5.1.1, recompiled and installed a test installation. Unfortunately it still crashes at the first successful exchange for me.

In our exchange on the mailing list I gave the final output of mdrun's log, but I don't think I ever gave a debug log. If on the off chance that these are two different bugs, I've attached the two debug logs corresponding with a run using the two .tpr files I previously attached.

#7 Updated by Mark Abraham about 4 years ago

Thanks - unfortunately doesn't tell me anything new, though I now suspect we have two different issues going on. Can you share the md.log files from a failing 5.1.1 (with or without my "fix") , please James?

#8 Updated by James Barnett about 4 years ago

Here are the md.log files.

#9 Updated by Mark Abraham about 4 years ago

James Barnett wrote:

Here are the md.log files.

OK, that sheds light - James's issue is quite distinct.

First, your GPU setup is somehow broken, as reported in the md_0.log, so the run is falling back to the CPU only. (Arguably, mdrun should have aborted if -gpu_id was used but GPU setup failed, but that's a side issue.)

Second, the automated implementation of the Verlet scheme sets its buffer based upon the temperature and the duration of the neighborlist. The temperature obviously differs in the case of T-REMD, so in principle one should want to use buffers of different size. mdrun tries a few possible values for nstlist, and sees if an rlist satisfying the temperature, tolerance and nstlist would fit in the domain and be acceptably efficient. James was using 10 ranks per replica, and I can see in his log files that rank 0 decided that nstlist 20 would work, but rank 1 had to settle for the original nstlist 10.

I can't say offhand what might get broken by having replicas with different nstlist, but we definitely have not designed for it, so should probably take steps to make the tuning aware of replica exchange. I'll try James's .tpr files at higher parallelism and see what I can find.

If James would use fewer ranks per replica, that probably leads to uniform nstlist over the replicas, which will work. My test runs probalby were of that type, andis probably why I didn't see a problem. With just 15K atoms per replica, I suspect that a single domain and single GPU per replica will be faster than one-domain-per-core.

#10 Updated by James Barnett about 4 years ago

The GPU issue is something I introduced when I built and installed 5.1.1 by trying a newer CUDA driver on this system (7.0). Rebuilding it with the driver I was using before with 5.1 (6.5) it doesn't have the GPU detection issue. I'm guessing either 7.0 is not installed properly on the system or I'm just not building GROMACS correctly with it.

Using fewer MPI ranks does indeed fix the issue, but nstlist is still different for the two replicas.

To eliminate the GPU problem from being a factor I did a reinstall and reran a short simulation. I've attached the md.log files for a run with more MPI ranks, which crashes (mda_.log). I've also attached md.log files for a run with less MPI ranks, which does not crash (mdb_.log). In both cases it looks like nstlist is different across replicas.

#11 Updated by Mark Abraham about 4 years ago

OK. nstlist can be set unilaterally with gmx mdrun -nstlist. What happens when you set it to whichever of 10 or 20 seems to work? At high and low parallelism?

#12 Updated by James Barnett about 4 years ago

Setting nstlist with mdrun's option still results in the simulation with replicas using 10 ranks crashing, and the simulation with replicas using 1 rank running correctly. In both cases the md.log shows both replicas are using the correct nstlist I'm specifying on the command line.

Also decided to test a simulation with using 2 ranks per replica. It also crashes just like the simulation with 10 ranks per replica. The only REMD sims that don't crash are when I use 1 rank per replica.

#13 Updated by Mark Abraham about 4 years ago

Hmm, that's unexpected. I'll poke harder with the debugger on Monday.

#14 Updated by Mark Abraham about 4 years ago

  • Related to Bug #1858: compute globals should not have logic about which integrator is in use added

#15 Updated by Mark Abraham about 4 years ago

  • Status changed from Accepted to In Progress

Mark Abraham wrote:

I do not observe the bug with James .tpr inputs, but the stack trace from Krysztof's sysadmin is clear enough.

This is indeed a bug in mdrun. In 488464e7 I removed an unrelated feature whose implementation was inadvertently making this work. The fully correct fix is not obvious - the compute_globals function is called in dozens of different ways, and simplifying this was part of the reason for removing this feature. :-(

A hack fix that I think will work is to change line 228 of stat.cpp from
if (bPres || !bVV)
to
if ((bPres || !bVV) && fvir != NULL)

If people could try that and let me know how they go, that might help me decide on a fix. Thanks, and sorry!

This is now fixed, per #1858 (and its patch is linked in Gerrit).

James's issue is definitely different. WIP.

#16 Updated by Mark Abraham about 4 years ago

@James, this looks like a bug that has been in mdrun for a while, but was hidden by the way we do over-allocation of buffers. Can you share equivalent .tpr files generated with 5.0.x, please? That would help us track where it started and be confident about the fix.

#17 Updated by James Barnett about 4 years ago

Attached are the .tpr files from a 5.0.6 run.

#18 Updated by Gerrit Code Review Bot about 4 years ago

Gerrit received a related patchset '2' for Issue #1848.
Uploader: Mark Abraham ()
Change-Id: Ifb9ffaf5a525537231b1f4e848bba3ef0873a077
Gerrit URL: https://gerrit.gromacs.org/5376

#19 Updated by Mark Abraham about 4 years ago

  • Status changed from In Progress to Fix uploaded

Thanks James. There is an issue with over-allocation not being robust, but it does not seem to be the cause of the problem here. My latest patch works for me on your issue - please test and let me know how you go!

#20 Updated by James Barnett about 4 years ago

Unfortunately it still crashes at the first successful exchange with the patch. But, the debug log does seem to be giving different/more output this time. I've attached the last 100 lines of the debug logs after patching.

#21 Updated by Mark Abraham about 4 years ago

Normal log files are much more useful, please.

#23 Updated by Mark Abraham about 4 years ago

Yes, I observed a crash also, but code with both bug fixes does multiple exchaanges for me. Try https://gerrit.gromacs.org/#/c/5371/5, which I've now rebased to have both fixes

#24 Updated by James Barnett about 4 years ago

Great! When I apply both patches the crashes no longer occur.

#25 Updated by Mark Abraham about 4 years ago

  • Status changed from Fix uploaded to Resolved

#26 Updated by Mark Abraham about 4 years ago

  • Status changed from Resolved to Closed
  • Target version set to 5.1.2

Also available in: Atom PDF