Project

General

Profile

Bug #2418

Incorrect results with Nose-Hoover temperature coupling

Added by Marvin Bernhardt over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I have a segmentation fault, when trying to run a simulation on our new workstation.
Observations:
  • It only appears, when tcoupl = nose-hoover.
  • It only appears, when -ntmpi > 1 or unset (using both processors)
  • If i do not write the energy at every step it fails with: Fatal error:
    3720 particles communicated to PME rank 4 are more than 2/3 times the cut-off
    out of the domain decomposition cell of their charge group in dimension x.
    This usually means that your system is not well equilibrated.
  • On at least one other machine with two processors this works fine
  • It does not matter if I use GPU or not (-nb cpu)

Since this is machine dependent, here is the hardware detected from md.log:

Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
    Family: 6   Model: 79   Stepping: 1
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rd
rnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  20] [   1  21] [   2  22] [   3  23] [   4  24] [   5  25] [   6  26] [   7  27] [   8  28] [   9  29]
      Socket  1: [  10  30] [  11  31] [  12  32] [  13  33] [  14  34] [  15  35] [  16  36] [  17  37] [  18  38] [  19  39]
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: compatible
    #1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: compatible

My colleague told me to compile gromacs in debug mode, which i did. Here is the output and backtrace, even though I don't understand it:

GROMACS:      gmx mdrun, version 2018
Executable:   /cluster/local/software/gromacs-2018-debug/bin/gmx
Data prefix:  /cluster/local/software/gromacs-2018-debug
Working dir:  /home/mbernhardt/run/bug-mdrun-pme-rank
Command line:
  gmx mdrun

Back Off! I just backed up md.log to ./#md.log.1#
[New Thread 0x7fffe35a0700 (LWP 30015)]
[New Thread 0x7fffe2d9f700 (LWP 30016)]
[New Thread 0x7fffe21ff700 (LWP 30018)]
[Thread 0x7fffe21ff700 (LWP 30018) exited]
[Thread 0x7fffe2d9f700 (LWP 30016) exited]
[New Thread 0x7fffe2d9f700 (LWP 30019)]
[New Thread 0x7fffe21ff700 (LWP 30020)]
[Thread 0x7fffe21ff700 (LWP 30020) exited]
[Thread 0x7fffe2d9f700 (LWP 30019) exited]
Reading file topol.tpr, VERSION 2018 (single precision)
[New Thread 0x7fffe2d9f700 (LWP 30021)]
[New Thread 0x7fffe21ff700 (LWP 30022)]
[New Thread 0x7fffe19fe700 (LWP 30023)]
[New Thread 0x7fffe11fd700 (LWP 30024)]
[New Thread 0x7fffe09fc700 (LWP 30025)]
[New Thread 0x7fffcbfff700 (LWP 30026)]
[New Thread 0x7fffcb7fe700 (LWP 30027)]
Changing nstlist from 10 to 100, rlist from 1.2 to 1.304

No option -multi
No option -multi
No option -multi
Using 8 MPI threads
No option -multi
No option -multi
No option -multi
No option -multi
No option -multi
Using 5 OpenMP threads per tMPI thread

On host gpu0 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1
[New Thread 0x7fffcaffd700 (LWP 30029)]
[New Thread 0x7fffca7fc700 (LWP 30028)]
[New Thread 0x7fffc99ff700 (LWP 30033)]
[New Thread 0x7fffc91fe700 (LWP 30034)]
[New Thread 0x7fff94fff700 (LWP 30042)]
[New Thread 0x7fffaa1fc700 (LWP 30041)]
[New Thread 0x7fffa89f9700 (LWP 30040)]
[New Thread 0x7fffa99fb700 (LWP 30035)]
[New Thread 0x7fffab9ff700 (LWP 30037)]
[New Thread 0x7fffab1fe700 (LWP 30036)]
[New Thread 0x7fffaa9fd700 (LWP 30039)]
[New Thread 0x7fffa91fa700 (LWP 30038)]
[New Thread 0x7fff947fe700 (LWP 30043)]
[New Thread 0x7fff91ff9700 (LWP 30047)]
[New Thread 0x7fff93ffd700 (LWP 30046)]
[New Thread 0x7fff92ffb700 (LWP 30045)]
[New Thread 0x7fff937fc700 (LWP 30044)]
[New Thread 0x7fff927fa700 (LWP 30048)]
[New Thread 0x7fff917f8700 (LWP 30049)]
[New Thread 0x7fff90ff7700 (LWP 30050)]
[New Thread 0x7fff907f6700 (LWP 30051)]
[New Thread 0x7fff8d7f0700 (LWP 30056)]
[New Thread 0x7fff8fff5700 (LWP 30054)]
[New Thread 0x7fff8f7f4700 (LWP 30053)]
[New Thread 0x7fff8eff3700 (LWP 30052)]
[New Thread 0x7fff8e7f2700 (LWP 30055)]
[New Thread 0x7fff8cfef700 (LWP 30057)]
[New Thread 0x7fff8dff1700 (LWP 30058)]
[New Thread 0x7fff8c7ee700 (LWP 30059)]
[New Thread 0x7fff897e8700 (LWP 30065)]
[New Thread 0x7fff8bfed700 (LWP 30064)]
[New Thread 0x7fff8b7ec700 (LWP 30063)]
[New Thread 0x7fff8afeb700 (LWP 30061)]
[New Thread 0x7fff8a7ea700 (LWP 30060)]
[New Thread 0x7fff89fe9700 (LWP 30062)]
[New Thread 0x7fff88fe7700 (LWP 30066)]

Back Off! I just backed up traj_comp.xtc to ./#traj_comp.xtc.1#

Back Off! I just backed up ener.edr to ./#ener.edr.1#

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'PNiPAMWaterSalt in water'
10 steps,      0.0 ps.

Thread 31 "gmx" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff927fa700 (LWP 30048)]
0x00007ffff475dfeb in evaluate_single (r2=-nan(0x7fff18), tabscale=500, vftab=0x7fffcc0b0300, tableStride=12, qq=-2.08403182, 
    c6=0.00192321674, c12=2.06313848e-06, velec=0x7fff927f9928, vvdw=0x7fff927f992c)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:113
113        Y                = vftab[ntab];
+(gdb) backtrace
#0  0x00007ffff475dfeb in evaluate_single (r2=-nan(0x7fff18), tabscale=500, vftab=0x7fffcc0b0300, tableStride=12, qq=-2.08403182, 
    c6=0.00192321674, c12=2.06313848e-06, velec=0x7fff927f9928, vvdw=0x7fff927f992c)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:113
#1  0x00007ffff476013b in do_pairs_general (ftype=33, nbonds=51, iatoms=0x7fffcc25551c, iparams=0x7fffcc013c90, x=0x7fffcc341500, 
    f=0x7ffefc23e080, fshift=0x7ffefc000b40, pbc=0x7fffe19fbf20, g=0x0, lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40, 
    fr=0x7fffcc0a7590, grppener=0x7fffcc0e8808, global_atom_index=0x7fffcc2f89f0)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:507
#2  0x00007ffff476055c in do_pairs (ftype=33, nbonds=51, iatoms=0x7fffcc25551c, iparams=0x7fffcc013c90, x=0x7fffcc341500, 
    f=0x7ffefc23e080, fshift=0x7ffefc000b40, pbc=0x7fffe19fbf20, g=0x0, lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40, 
    fr=0x7fffcc0a7590, bCalcEnergyAndVirial=768, grppener=0x7fffcc0e8808, global_atom_index=0x7fffcc2f89f0)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:698
#3  0x00007ffff47554b1 in (anonymous namespace)::calc_one_bond (thread=2, ftype=33, idef=0x7fffcc22e230, x=0x7fffcc341500, 
    f=0x7ffefc23e080, fshift=0x7ffefc000b40, fr=0x7fffcc0a7590, pbc=0x7fffe19fbf20, g=0x0, grpp=0x7fffcc0e8808, nrnb=0x7fffcc0a71b0, 
    lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40, fcd=0x7fffcc04f460, bCalcEnerVir=768, global_atom_index=0x7fffcc2f89f0)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/listed-forces.cpp:389
#4  0x00007ffff4756a60 in calcBondedForces () at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/listed-forces.cpp:471
#5  0x00007ffff3a108ee in gomp_thread_start (xdata=<optimized out>) at /build/gcc/src/gcc/libgomp/team.c:120
#6  0x00007ffff35cc08c in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007ffff3303e7f in clone () from /usr/lib/libc.so.6

grompp.mdp (558 Bytes) grompp.mdp Marvin Bernhardt, 02/21/2018 11:36 AM
topol.top (373 KB) topol.top Marvin Bernhardt, 02/21/2018 11:36 AM
urea.itp (2.2 KB) urea.itp Marvin Bernhardt, 02/21/2018 11:36 AM
conf.gro (1.73 MB) conf.gro Marvin Bernhardt, 02/21/2018 11:36 AM
md.log (18.3 KB) md.log Marvin Bernhardt, 02/21/2018 11:40 AM

Associated revisions

Revision ee8b06ea (diff)
Added by Berk Hess over 1 year ago

Fix md integrator with Nose-Hoover coupling

When applying NH T-coupling at an MD step and no PR P-coupling,
the md integrator could apply pressure scaling with an uninitialized
or outdated PR scaling matrix.

Fixes #2418

Change-Id: I835db72776e7782ac044807961bb899e4f8c6c7b

History

#1 Updated by Paul Bauer over 1 year ago

According to my bisecting this first appears in efefd3a49c043df1a846753539cb0d4e205c8701.

Reproduced on my machine (Linux 4.4.0-109-generic x86_64, i7-4790K CPU, no GPU support enabled, compiled with GNU 5.4.1, ran the test input with only mdrun -deffnm).

#2 Updated by Berk Hess over 1 year ago

The bug only seems to occur with nsttcouple=1, which is likely the reason why we have not noticed this before release. This must be an issue with communicating the kinetic energy before/at step=0.

#3 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2418.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I835db72776e7782ac044807961bb899e4f8c6c7b
Gerrit URL: https://gerrit.gromacs.org/7614

#4 Updated by Berk Hess over 1 year ago

  • Subject changed from Segmentation fault with two processors to Incorrect results with Nose-Hoover temperature coupling
  • Category set to mdrun
  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess
  • Priority changed from Normal to High
  • Target version set to 2018.1

This bug is much more serious than I thought. With Nose-Hoover T-coupling and the md integrator, we can do pressure scaling with an uninitialized or outdated Parrinello-Rahman pressure scaling matrix.

#5 Updated by Mark Abraham over 1 year ago

Berk Hess wrote:

This bug is much more serious than I thought. With Nose-Hoover T-coupling and the md integrator, we can do pressure scaling with an uninitialized or outdated Parrinello-Rahman pressure scaling matrix.

Yes that sounds right.

#6 Updated by Berk Hess over 1 year ago

To be more precise: the bug does not occur when using NH+PR coupling with nsttcouple=nstpcouple, which is the most common setup and likely also the reason why our tests did not catch this.

#7 Updated by Berk Hess over 1 year ago

  • Status changed from Fix uploaded to Resolved

#8 Updated by Mark Abraham over 1 year ago

  • Status changed from Resolved to Closed

#9 Updated by Mark Abraham over 1 year ago

A quick glance at the commit Paul identified did not make clear to me why the bug emerged there. Oh well.

Also available in: Atom PDF