Project

General

Profile

Bug #1701

arrayref doesn't always work on BlueGene/Q xlc

Added by Mark Abraham over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Some recently merged commit is leading to mdrun trying to checkpoint too often. The obvious suspect would be my 4ed549c5e7d, but I can't see the problem yet.

Associated revisions

Revision b31a7c6e (diff)
Added by Mark Abraham over 4 years ago

Add unit tests for ArrayRef

This fixes mdrun signalling on BG/Q. The templated constructor for
ArrayRef does not work with xlc 12 on BG/Q with array fields of
structs unless the base type has size equal to char. Presumably this
is another example of the way that attributes of struct fields just
get thrown away by compilers.

Changed name of local variable to sig, just in case "signal" clashes
with a preprocessor symbol somewhere...

Some changes to the arrayref.h implementation to make it easier to
write the tests as type-parameterized, by making the factory functions
also appear as members of the corresponding classes, with the same
name for both ArrayRef and ConstArrayRef. While there, remove some
documentation duplication.

Fixes #1701

Change-Id: I6894706b224dc5f3db7893503371107f1ff324d2

History

#1 Updated by Mark Abraham over 4 years ago

So far I can only reproduce this on BG/Q, in Release build, and not in Debug or RelWithDebInfo builds. git revert on my suspect commit does fix the problem, so it is tempting to blame xlc...

#2 Updated by Szilárd Páll over 4 years ago

Haven't noticed anything like that. How much does the checkpointing frequency increase?

#3 Updated by Mark Abraham over 4 years ago

It is definitely the commit I mentioned. With

 mdrun_mpi -ddorder interleave -npme 0 -s ion-channel-vsites-lincs-4-2 -nsteps 15000 -resetstep 10000 -noconfout -dlb yes -notunepme

I get normal results on any build of the parent, or on a Debug or RelWithDebug build on the commit. (I haven't seen or heard of any such report on another kind of machine.) But with Release:

...
Started mdrun on rank 0 Wed Mar 18 05:42:04 2015
           Step           Time         Lambda
              0        0.00000        0.00000

   Energies (kJ/mol)
          Angle    Proper Dih. Ryckaert-Bell.  Improper Dih.  Improper Dih.
    6.12678e+04    7.82004e+04    1.51701e+04    1.96606e+03    2.45323e+03
          LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)
    4.77233e+04    3.42398e+05    7.42826e+04   -3.46885e+04   -2.51439e+06
   Coul. recip.      Potential    Kinetic En.   Total Energy  Conserved En.
    2.12644e+04   -1.90435e+06    3.31661e+05   -1.57269e+06   -1.57269e+06
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    3.09451e+02   -3.54583e+02    3.08390e+01    0.00000e+00

step 1: resetting all time and cycle counters

Restarted time on rank 0 Wed Mar 18 05:42:04 2015
DD  step 9  vol min/aver 1.000  load imb.: force 77.8%

Writing checkpoint, step 10 at Wed Mar 18 05:42:04 2015

step 11: resetting all time and cycle counters

Restarted time on rank 0 Wed Mar 18 05:42:05 2015
Writing checkpoint, step 20 at Wed Mar 18 05:42:05 2015

step 21: resetting all time and cycle counters
...

#4 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1701.
Uploader: Mark Abraham ()
Change-Id: I6894706b224dc5f3db7893503371107f1ff324d2
Gerrit URL: https://gerrit.gromacs.org/4498

#5 Updated by Mark Abraham over 4 years ago

  • Status changed from New to Fix uploaded

#6 Updated by Mark Abraham over 4 years ago

  • Priority changed from Normal to Low

There's no real problem with mdrun here - simply a buggy compiler on BlueGene/Q

#7 Updated by Mark Abraham over 4 years ago

  • Subject changed from mdrun signalling not working properly (maybe?) to arrayref doesn't always work on BlueGene/Q xlc

#8 Updated by Erik Lindahl over 4 years ago

  • Status changed from Fix uploaded to Resolved

#9 Updated by Erik Lindahl over 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF