GB All vs All Crashes Stochastically on Celeron / Atom Win32 Systems
Created an attachment (id=467)
AllAll TPR (Fails)
Implicit solvent (GB) can cause crashes in mdrun for certain processors running windows. This one is a pain to reproduce, in that it only occurs on win32 (XP SP3) systems with specific processors (celeron, atom). Even on systems where it occurs, it can occur randomly, so it is a challenge to definitively say whether it has been solved or not.
Basically, mdrun crashes during the first force evaluation in the main MD while loop. I manually traced the the error to the function do_nonbonded. I then built a mdrun with debug symbols, and the windbg dump pointed to mdrun!genborn_allvsall_calc_chainrule+0x18f.
This will be hard to narrow down, as my build environment uses visual studio 2008+intel ICC+MKL. I tried to reproduce the bug on the same netbook using a 32 bit linux build (gcc+ubuntu 10.04), but I had no success. It really seems strongly linked to platform+processor.
I've attached two nearly identical tpr files: one uses a 100nm cutoff, while the other sets all cutoffs to 0. The cutoff tpr runs fine, while the AllVAll dies. Here's a crash dump from windbg:(e10.ac0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=01d6b920 ebx=01d3cd50 ecx=01d11860 edx=01eb0eb8 esi=00000002 edi=0000023a
eip=00540d1f esp=018cd2b0 ebp=018cd3c0 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
- WARNING: Unable to verify checksum for mdrun.exe
00540d1f 0f590cb3 mulps xmm1,xmmword ptr [ebx+esi*4]
If you want to reproduce this bug, you may need to find a netbook that runs windows XP 32 on an Atom N270 processor. I reproduced the bug using an Acer Aspire One (winXP Home SP3 N270 1GB), but I've also seen it on an ASUS brand netbook. You'll also need to build using MKL (10.2.4.032) and ICC (11.0/066). I've also heard reports from users with celeron systems, but I haven't been able to verify those claims firsthand.
This bug has been observed using source from the GIT as recent as June 10, 2010. I suspect it has been present as far back as April or earlier.