Project

General

Profile

Bug #432

GB All vs All Crashes Stochastically on Celeron / Atom Win32 Systems

Added by Kyle Beauchamp over 9 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Created an attachment (id=467)
AllAll TPR (Fails)

Implicit solvent (GB) can cause crashes in mdrun for certain processors running windows. This one is a pain to reproduce, in that it only occurs on win32 (XP SP3) systems with specific processors (celeron, atom). Even on systems where it occurs, it can occur randomly, so it is a challenge to definitively say whether it has been solved or not.

Basically, mdrun crashes during the first force evaluation in the main MD while loop. I manually traced the the error to the function do_nonbonded. I then built a mdrun with debug symbols, and the windbg dump pointed to mdrun!genborn_allvsall_calc_chainrule+0x18f.

This will be hard to narrow down, as my build environment uses visual studio 2008+intel ICC+MKL. I tried to reproduce the bug on the same netbook using a 32 bit linux build (gcc+ubuntu 10.04), but I had no success. It really seems strongly linked to platform+processor.

I've attached two nearly identical tpr files: one uses a 100nm cutoff, while the other sets all cutoffs to 0. The cutoff tpr runs fine, while the AllVAll dies. Here's a crash dump from windbg:

(e10.ac0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=01d6b920 ebx=01d3cd50 ecx=01d11860 edx=01eb0eb8 esi=00000002 edi=0000023a
eip=00540d1f esp=018cd2b0 ebp=018cd3c0 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
  • WARNING: Unable to verify checksum for mdrun.exe
    mdrun!genborn_allvsall_calc_chainrule+0x18f:
    00540d1f 0f590cb3 mulps xmm1,xmmword ptr [ebx+esi*4]
    ds:0023:01d3cd58=3fe447c23fec4bf43fe75e5f3fd779d0

If you want to reproduce this bug, you may need to find a netbook that runs windows XP 32 on an Atom N270 processor. I reproduced the bug using an Acer Aspire One (winXP Home SP3 N270 1GB), but I've also seen it on an ASUS brand netbook. You'll also need to build using MKL (10.2.4.032) and ICC (11.0/066). I've also heard reports from users with celeron systems, but I haven't been able to verify those claims firsthand.

This bug has been observed using source from the GIT as recent as June 10, 2010. I suspect it has been present as far back as April or earlier.

topol.allall.tpr (193 KB) topol.allall.tpr AllAll TPR (Fails) Kyle Beauchamp, 06/11/2010 05:27 AM
topol.cutoff.tpr (193 KB) topol.cutoff.tpr Cutoff TPR (Runs) Kyle Beauchamp, 06/11/2010 05:28 AM

History

#1 Updated by Kyle Beauchamp over 9 years ago

Created an attachment (id=468)
Cutoff TPR (Runs)

#2 Updated by Kyle Beauchamp over 9 years ago

One last note: using Win32 systems with other processors (AMD, Core, P4, etc), the same build (and AllVAll tpr file) was able to stably run dynamics for hundreds of ns.

#3 Updated by Kyle Beauchamp over 9 years ago

Another tidbit: I've encountered the bug both with gmx_acceleration=SSE and gmx_acceleration=None.

#4 Updated by David van der Spoel over 9 years ago

Could you please try the latest git? I have patched some uninitialized variables and a loop counter bug.

#5 Updated by Erik Lindahl over 9 years ago

Hi,

I've fixed a possible memory error in the single precision all-vs-all code. I don't have any system where I can test win32 all-vs-all, but this might have been fixed. I'll close the bug for now, but please reopen it if crashes still occur.

#6 Updated by Kyle Beauchamp over 9 years ago

Thanks,

I'll check on this later today, after I return home to grab my netbook.

Also available in: Atom PDF