Project

General

Profile

Bug #437

mdrun Crashes with GB+SSE under windows

Added by Kyle Beauchamp over 9 years ago. Updated about 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

This bug occurs under windows XP (run using VMware player). I'm currently working with the GIT taken on June 12, 2010.

My topol.tpr file runs fine on linux machines. If I try to run the simulation on windows, however, it dies. If I examine the md.log of a failed run, the LJ energy terms appear huge (see below). If I build with gmx_acceleration=None (instead of SSE), the simulations run fine (with normal energies). If I comment out the GBSA lines in the MDP file, things also run fine.

My build was done with Visual studio 2008+MKL+ICC.

I'm currently looking into avoiding ICC to see if it has any effect.

I'm attaching a broken tpr file. Below are the energies from a bad simulation run. I'm also including the printout from a mini-dump--careful, however, as windows dumps don't always point to the cause of a crash.

BAD ENERGIES:

Energies (kJ/mol)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
4.93700e+002 1.36756e+003 9.58289e+001 1.49635e+003 -8.42394e+003
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+000 0.00000e+000 5.69317e+002 7.67128e+003 3.74253e+009
Coulomb (SR) Potential Kinetic En. Total Energy Temperature
-7.18576e+003 3.74253e+009 9.11819e+013 9.11857e+013 1.53058e+013
Pressure (bar)
0.00000e+000

CRASH DUMP PRINTOUT:

(40c.a9c): Access violation - code c0000005 (!!! second chance !!!)
  • WARNING: Unable to verify checksum for c:\tmp\mdrun.exe
    mdrun!nb_kernel400nf_ia32_sse+0x61d:
topol.tpr (193 KB) topol.tpr Broken TPR Kyle Beauchamp, 06/15/2010 02:56 AM
villin.pdb (47.2 KB) villin.pdb Villin PDB file from ffamber web page (with fixed residue names LYP->LYS) Kyle Beauchamp, 08/04/2010 06:40 PM
md.mdp (683 Bytes) md.mdp mdp file Kyle Beauchamp, 08/04/2010 06:42 PM
topol.top (170 KB) topol.top top file Kyle Beauchamp, 08/04/2010 06:54 PM
conf.gro (26.2 KB) conf.gro conf.gro Kyle Beauchamp, 08/04/2010 06:54 PM
amber03.tar (250 KB) amber03.tar Amber03 files Kyle Beauchamp, 08/05/2010 11:02 PM
segv.tpr (211 KB) segv.tpr tpr file that crashes on mac as well David van der Spoel, 08/06/2010 09:16 AM

History

#1 Updated by Kyle Beauchamp over 9 years ago

Created an attachment (id=471)
Broken TPR

#2 Updated by Erik Lindahl over 9 years ago

Hi Kyle,

Ouch, sounds like a possible compiler bug. I will look into the code, but if possible these things would accelerate the debugging:

1) Does it crash with a 32-bit linux build too, or work with a 64-bit wimdows one?

2) Does it work with MSVC compiler?

3) Does it work if you disable optimization?

Cheers,

Erik

#3 Updated by Kyle Beauchamp over 9 years ago

Hi,

I'll check these three things tomorrow morning. ETA: 12 hours

Thanks

#4 Updated by Kyle Beauchamp over 9 years ago

1. The TPR runs fine in a Ubuntu10.04 32 bit VM (via VMWare).

#5 Updated by Kyle Beauchamp over 9 years ago

2. I haven't been able to compile with MSVC. I checked with Peter, and he recalled having difficulty doing SSE builds using MSVC. I'm going to move on to #3 for now.

Most of the build errors appear to be located in a single header and of a single type:

3>Y:\fah_windows_build\gromacs-june12\include\gmx_sse2_single.h(2028) : error C2719: 'crf': formal parameter with __declspec(align('16')) won't be aligned

I might try some workarounds suggested by Google, such as consting certain expressions.

#6 Updated by Kyle Beauchamp over 9 years ago

3. I disabled optimization (\Od), but the same crash results.

#7 Updated by Kyle Beauchamp over 9 years ago

Hi,

I've been able reproduce a "somewhat similar" bug using a minGW build (run under XP VMware).

The bug is similar in that I get
(820.81c): Access violation - code c0000005 (!!! second chance !!!)

However, there bug appears stochastically. There appears to be some "history" involved, in that once it appears it seems to stay. However, sometimes I can open a shell (cmd.exe) and run mdrun with no problem.

This new "bug" is not specific to the GB topol.tpr, but also occurs in a reaction-field run that I have. Actually, the crash can occur even when I run mdrun.exe with a misnamed tpr file. My windbg dump also points to a wscanf (msvcrt!wscanf+0x6c:)

IMHO, this suggests (at least) two possibilities:

1. My minGW build is plagued by pointer / buffer problems that are unrelated to the ICC build bug.
2. There are general win32 buffer / pointer errors that appear in a platform-dependent manner, causing unpredictable crashes.

I might play with my Atom netbook, to see if I can come up with useful information.

#8 Updated by David van der Spoel over 9 years ago

Could you please try the latest git? I have patched some uninitialized variables and a loop counter bug.

#9 Updated by Kyle Beauchamp over 9 years ago

Thanks, I will rebuild from GIT.

PS: It looks I will get around to rebuilding and testing in 1 or 2 days. I'll post again once I've tested things.

#10 Updated by Kyle Beauchamp over 9 years ago

Looks like the problems are still here (building with the July 14 GIT). Just to summarize:

If I build in linux+gcc, simulations run fine (either with or without GBSA).

If I build in windows (single, 32bit, SSE; MSVS+ICC+YASM), simulations explode (instantly) with GBSA enabled. This happens when running in windows XP and under wine. If I change my mdp file to disable GBSA, things run fine.

I don't think I can build with MSVC+SSE (without ICC), as that build environment is still broken (at least for 32 bit).

If someone has simple GBSA test systems, I could try to run those. That might help us reduce this problem to a minimal example. If it is something with my setup, it's likely something subtle, as my tpr files run great under linux (with July 14 GIT). To double check my system, I both re-minimized and tried with and without constraints.

As a further test, I created a steepest descent TPR and ran it on both linux and wine. The results are interesting: in linux, the system properly minimizes. Under wine, I see the same explosion as with my previous MD runs. However, when I examine the resulting confout.gro files, the problem seems to be just a single atom. A methyl proton appears to show an extremely strong attraction to the nearby carbonyl, but only under windows. The "blowing up" then seems to be this proton being yanked towards the carbonyl. However, I don't understand why this proton is being pulled--all of the energetic terms appear to be normal (~1E3), except for the LJ-SR. The LJ-SR is extremely negative (under windows only). (See below for md.log energies.)

I have two guesses for what is happening: either the LJ-SR term is somehow being improperly modulated by the GBSA scaling, or some register / variable in the LJ-SR calculation is being modified by the GBSA calculation. The second case might better explain why this bug is platform-dependent--this variable might get wiped clean under some compliers (assemblers), while other compilers (assemblers) might fail to clear its value.

Energies (kJ/mol) (WINE)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
1.61149e+003 1.18637e+003 5.37274e+001 1.44675e+003 -7.55709e+003
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+000 0.00000e+000 5.81767e+002 7.84009e+003 -8.13357e+010
Coulomb (SR) Potential Pressure (bar)
-5.91848e+003 -8.13357e+010 0.00000e+000
Energies (kJ/mol) (LINUX)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
8.01382e+01 2.77049e+02 1.14143e+01 1.27071e+03 -1.75343e+03
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+00 0.00000e+00 4.77830e+02 7.81302e+03 -1.14196e+03
Coulomb (SR) Potential Pressure (bar)
-1.10142e+04 -3.97942e+03 0.00000e+00

Thanks

#11 Updated by Kyle Beauchamp over 9 years ago

Hi,

This bug seems to be cutoff only--not allvall, because if I change rlist, rvdw, rgbradii, rcoulomb from 100 to 0, the bug disappears and I get normal energies: see below (energies are after convergence of steepest descent).

CUTOFF LINUX
Energies (kJ/mol)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
8.01342e+01 2.77041e+02 1.14122e+01 1.27069e+03 -1.75341e+03
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+00 0.00000e+00 4.77825e+02 7.81302e+03 -1.14197e+03
Coulomb (SR) Potential Pressure (bar)
-1.10142e+04 -3.97950e+03 0.00000e+00

CUTOFF WINE
Energies (kJ/mol)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
1.71848e+003 1.18831e+003 5.37274e+001 1.44677e+003 -7.55686e+003
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+000 0.00000e+000 5.81758e+002 7.84010e+003 -8.62051e+010
Coulomb (SR) Potential Pressure (bar)
-5.91831e+003 -8.62051e+010 0.00000e+000

ALLALL LINUX
Energies (kJ/mol)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
8.02515e+01 2.77250e+02 1.14705e+01 1.27135e+03 -1.75273e+03
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+00 0.00000e+00 4.77958e+02 7.81289e+03 -1.14149e+03
Coulomb (SR) Potential Pressure (bar)
-1.10139e+04 -3.97695e+03 0.00000e+00

ALLALL WINE

Energies (kJ/mol)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
7.98729e+001 2.76597e+002 1.12736e+001 1.26917e+003 -1.75504e+003
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+000 0.00000e+000 4.77514e+002 7.81331e+003 -1.14304e+003
Coulomb (SR) Potential Pressure (bar)
-1.10149e+004 -3.98528e+003 0.00000e+000

#12 Updated by Kyle Beauchamp over 9 years ago

So I'm still having the same problems with the July 25 GIT.

To check that my system (tpr) wasn't configured incorrectly, I rebuilt my top file from scratch, using the villin.pdb taken from the ffamber page.

/opt/gromacs/july25/bin/pdb2gmx -f villin.pdb -renum
(select amber03, tip3p)

Then edit the topol.top file to add the line:
#include "./ffamber03_gb.itp"

Then grompp:

/opt/gromacs/july25/bin/grompp -f md.mdp -c conf.gro -p topol.top

The relevant lines of my mdp file are:
nstlist = 0
nstype = simple
pbc = no
rlist = 100

coulombtype = cut-off
epsilon_r = 1
rcoulomb = 100

vdwtype = cut-off
rvdw = 100

tc-grps = system
tau_t = 0.01099
ref_t = 300
gen_vel = yes
gen_temp = 300
constraints = hbonds
constraint_algorithm = shake

gb_epsilon_solvent = 78.3
nstgbradii=1
rgbradii=100
gb_algorithm = OBC
implicit_solvent = GBSA

#13 Updated by David van der Spoel over 9 years ago

With the gromacs beta 4.5 this now gives a fatal error:

Fatal error:
Domain decomposition does not work with nstlist=0
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

Can you post more input files?

#14 Updated by Kyle Beauchamp over 9 years ago

Created an attachment (id=503)
Villin PDB file from ffamber web page (with fixed residue names LYP->LYS)

#15 Updated by Kyle Beauchamp over 9 years ago

Created an attachment (id=504)
mdp file

#16 Updated by Kyle Beauchamp over 9 years ago

I included the pdb, mdp, gro, and top file I'm using. I also changed my nstlist from 0 to 1 to avoid the error you mentioned.

To reproduce this bug, I just have to

/opt/gromacs/aug3/bin/pdb2gmx -f villin.pdb
/opt/gromacs/aug3/bin/grompp -f md.mdp

If I use a linux binary, things work fine:
/opt/gromacs/aug3/bin/mdrun -nt 1

If I use a windows binary, things die:
wine gromacs-build-aug1-notfah/src/kernel/Release/mdrun.exe -nt 1

The same problem shows up with integrator = SD or steep.

My build system is Visual Studio 2008 + ICC + YASM +MKL. I don't know whether this is ICC only, because I haven't been able to compile SSE with the Microsoft compiler.

Please let me know if there's anything else I can do to help.

#17 Updated by Kyle Beauchamp over 9 years ago

Created an attachment (id=505)
top file

#18 Updated by Kyle Beauchamp over 9 years ago

Created an attachment (id=506)
conf.gro

#19 Updated by David van der Spoel over 9 years ago

The ff03 force field files are missing, and it would be good as a matter of principle to run EM first. With ammber99sb-ild (experimental support in gromacs 4.5beta) it nevertheless runs fins on my mac too.

#20 Updated by Kyle Beauchamp over 9 years ago

Hi,

My amber force field files are correctly working.

Also, this same bug occurs during minimization. Or when I reduce the timestep by a factor of 25X. The problem is not "typical system instability."

#21 Updated by David van der Spoel over 9 years ago

Sorry if I wasn't clear enough. Your topology is referring to amber files that are not part of the standard distribution (and which I don't have). Maybe you can upload all of those in a zip file as well.

#22 Updated by Kyle Beauchamp over 9 years ago

Created an attachment (id=507)
Amber03 files

#23 Updated by Kyle Beauchamp over 9 years ago

Hi,

I attached the Amber03.ff files from the Aug. 3 GIT. These files should be included in the Gromacs distribution now, at far as I know. If you're having problems with them, it may be due to some hardcoded paths in my topol.top file. You could re-run pdb2gmx -f villin.pdb to solve that.

#24 Updated by David van der Spoel over 9 years ago

Created an attachment (id=508)
tpr file that crashes on mac as well

This tpr file, based on the original files, but with slightly altered mdp parameters crashes on my mac in the nb_kernel_allvsallgb routine due to unitialized variable mask.

#25 Updated by Erik Lindahl over 9 years ago

I think the last error in comment #24 was something entirely different - it had to do with an incorrect #endif that should have been an #else, which caused the C all-vs-all kernel to be called again after the SSE one returned. I've fixed that in commit dcbf47ce9a6928e9ea9fe781c8c5a72b68f1e63a, but I don't think it solves Kyle's problem.

#26 Updated by David van der Spoel over 9 years ago

Thanks, that does indeed fix the segv. You are right that Kyle explicitly stated that all vs all works, but not cut-off. I don't have a windows vm on my mac at the moment, and the cut-off tpr file runs happily. I just noted in the original post that the error occurs in mdrun!nb_kernel400nf_ia32_sse+0x61d:

That is in the noforce routines. Should these be called at all during an MD run?

#27 Updated by Erik Lindahl over 9 years ago

I had missed that too. No, that routine shouldn't be called in the first place, which could indicate a (severe) memory or compiler error outside those routines that corrupt the function pointer list.

I have some Amber FF debugging that I'll focus on over the weekend, but I plan to get working on the Windows side next week, including both full Cmake assembly support and this.

#28 Updated by Kyle Beauchamp over 9 years ago

Please let me know if there's anything I can do.

For example, the crashes I've shown require a windows build, but the actual execution can be done in linux+wine, so I'm more than happy to help with builds.

Thanks

#29 Updated by Erik Lindahl over 9 years ago

Hi Kyle,

I haven't found anything specific for this yet (but I haven't looked for it either). However, since we've just fixed a ton of other issues with SSE / CMAKE / Windows, could you

1) Check if the bug is still there?
2) Check if it works with MSVC?

Cheers,

Erik

#30 Updated by Kyle Beauchamp over 9 years ago

Hi,

I'll rebuild with the current GIT (tonight or tomorrow).

I'll also try to build with MSVC.

I may also test 64 bit, as I need to use 64 bit build for another project.

#31 Updated by Kyle Beauchamp over 9 years ago

It appears that the Aug. 12 GIT solves this issue. I checked both wine and my windows XP VM. MD seems to run ok, and energies appear normal:

Energies (kJ/mol)
Bond Angle Proper Dih. Ryckaert-Bell. GB 1-2
4.59931e+002 1.30869e+003 6.41560e+001 1.39312e+003 -1.51784e+003
GB 1-3 GB 1-4 LJ-14 Coulomb-14 LJ (SR)
0.00000e+000 0.00000e+000 5.11574e+002 7.60279e+003 -8.58338e+002
Coulomb (SR) Potential Kinetic En. Total Energy Temperature
-1.07982e+004 -1.83412e+003 1.83608e+003 1.96240e+000 3.08204e+002
Pressure (bar)
0.00000e+000

I built using ICC+NASM. I'm not 100% sure that NASM is doing the assembly because it doesn't explicitly say NASM, but here is some output from my MSVS log:

3>genborn.c
4>Generating nb_kernel232_ia32_sse_intel_syntax.obj (Microsoft VC++ Environment)
4>Generating nb_kernel333_ia32_sse_intel_syntax.obj (Microsoft VC++ Environment)
4>Generating nb_kernel200_ia32_sse_intel_syntax.obj (Microsoft VC++ Environment)
4>Generating nb_kernel113_ia32_sse_intel_syntax.obj (Microsoft VC++ Environment)
4>Generating nb_kernel104_ia32_sse_intel_syntax.obj (Microsoft VC++ Environment)
4>Generating nb_kernel210_ia32_sse_intel_syntax.obj (Microsoft VC++ Environment)
4>Generating nb_kernel301_ia32_sse_intel_syntax.obj (Microsoft VC++ Environment)

PS I have no idea what the actual fix was. I do know that some change between aug. 3 and aug. 12 was it. I'm currently mdrun running to make sure things remain stable.

Thanks!

#32 Updated by Erik Lindahl about 9 years ago

Hi Kyle!

I'll close this for now since it appears to be fixed - just reopen it if you find the problem again.

Also available in: Atom PDF