Project

General

Profile

Bug #2642

mdrun with SIMD triggers floating point exceptions

Added by Berk Hess about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
likely also 2018
Affected version:
Difficulty:
uncategorized
Close

Description

The SIMD padding in the padded vectors in state and f is not cleared leading to floating point exceptions in the integration.


Related issues

Related to GROMACS - Bug #2404: Enabling floating point exceptions makes some tests failClosed
Related to GROMACS - Bug #2702: PME gather reduction race in OpenCL (and CUDA)Accepted
Related to GROMACS - Bug #2776: FP exception in pullAllReduceClosed

Associated revisions

Revision 3043cd77 (diff)
Added by Mark Abraham about 1 year ago

Implement PaddedVector

Implements a vector-like container that maintains padding such that
SIMD memory operations will always use initialized memory. All use of
the padded form data takes place via an ArrayRef, so that this is
explicit at the point of creation of the view. The PaddedArrayRef view
creates some safety through strong typing, also.

Increased the timeout for TSAN tests, which now take longer. Since
there are now two special case configurations for testing where we
permit a longer timeout, simplified the way we implement the increase
for the OpenCL case.

Fixes #2642

Change-Id: Ibcafe161689f4a02f7aa6fd0410edec47300264e

Revision 2778c81d (diff)
Added by Roland Schulz about 1 year ago

Fix MSAN errors

Refs #2642

Change-Id: I94a649f5bf0fdcc85a826882efa520776de7740a

Revision d22cb18c (diff)
Added by Mark Abraham about 1 year ago

Fix PaddedVector issues

The move construction with allocator was producing an empty buffer
upon change of pinning policy for scalar types. A change of allocator
means the implementation of move construction has to be done with a
copy, which is how we intended it to work.

Only PaddedVectors of copyable types can have non-zero size, because
we need to be able to copy elements for the padding region, so the
MoveOnly types in the HostVector tests have to be updated.

Fixes #2642

Change-Id: I646a57b0b0dd727647baa2544a1c5f7f1700661d

Revision 93109ddf (diff)
Added by Mark Abraham about 1 year ago

Introduce ArrayRefWithPadding

Modules need to access views of memory with and without padding. This
class can be constructed from a PaddedVector, and provides the
capacity to hide the underlying padded container (and allocator),
while being able to obtain ArrayRef with and without padding.

Also updated wording of docs for ArrayRef, and attribution

Fixed issue with shell code using the same pointer for trial and
minimum configuration.

Fixes #2642
Fixes #2705

Change-Id: I45197215342ae011298abc310b877c34f8fab88b

History

#1 Updated by Berk Hess about 1 year ago

  • Related to Bug #2404: Enabling floating point exceptions makes some tests fail added

#2 Updated by Roland Schulz about 1 year ago

Do you plan to upload a patch or would you like help?

#3 Updated by Berk Hess about 1 year ago

Exactly for this issue I put in a resize function that clears the padding. But Erik didn't like that. I don't understand why using the padding size for every resize is better.

Do we want to put the resize function in release-2018.

#4 Updated by Mark Abraham about 1 year ago

PaddedRVecVector is currently a std::vector except that you have to call a different resize function and when we do the padding region can contain problematic values (whether getting larger or smaller).

We are proposing that we replace it with a type that implements a (subset of) the std::vector interface, so that we can have the semantics of e.g. clearing the padding region when required, and being able to use resize(). I have much of that already in an old Gerrit, so will push something up later today. HostVector can then specialize this new type. We can decide later how we want to approach fixing release-2018.

#5 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related DRAFT patchset '1' for Issue #2642.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~Ibcafe161689f4a02f7aa6fd0410edec47300264e
Gerrit URL: https://gerrit.gromacs.org/8407

#6 Updated by Mark Abraham about 1 year ago

  • Assignee set to Mark Abraham
  • Target version changed from 2018.4 to 2019

https://gerrit.gromacs.org/c/8407/ proposes a fix. Part of the issue is probably that RVec has a default constructor that leaves elements uninitialized, and this is what std::vector construction, resize and insert will use. So we need to manually manage the padding elements somehow. That could either be in a container, or in the code using it...

#7 Updated by Mark Abraham about 1 year ago

  • Status changed from Accepted to Resolved

#8 Updated by Mark Abraham about 1 year ago

  • Target version changed from 2019 to 2018.4

We should reconsider a (simpler) fix for 2018.4

#9 Updated by Mark Abraham about 1 year ago

  • Status changed from Resolved to Accepted

I think there's an issue remaining (other than the TPI one). http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/6598/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20host=bs_gpu01,label=bs_gpu01/ failed in complex/pull_constraint, and https://gerrit.gromacs.org/#/c/8505/ failed differently (in CG steepest descent for a non-GPU non-SIMD build). The issue appears to manifest in the update code.

#10 Updated by Berk Hess about 1 year ago

Is it in the update or in the constraints?
And only with the pull constraint test?

#11 Updated by Mark Abraham about 1 year ago

Mark Abraham wrote:

I was wrong

I think there's an issue remaining (other than the TPI one). http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/6598/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20host=bs_gpu01,label=bs_gpu01/ failed in complex/pull_constraint, and https://gerrit.gromacs.org/#/c/8505/ failed differently (in CG steepest descent for a non-GPU non-SIMD build). The issue appears to manifest in the update code.

It could be anywhere:

19:41:17 [ RUN      ] MinimizersWorkWithConstraints/EnergyMinimizationTest.WithinTolerances/4
19:41:17 
19:41:17 NOTE 1 [file /mnt/workspace/Matrix_PreSubmit_master/60f6e166/gromacs/src/programs/mdrun/tests/Testing/Temporary/MinimizersWorkWithConstraints_EnergyMinimizationTest_WithinTolerances_4_input.mdp, line 28]:
19:41:17   /mnt/workspace/Matrix_PreSubmit_master/60f6e166/gromacs/src/programs/mdrun/tests/Testing/Temporary/MinimizersWorkWithConstraints_EnergyMinimizationTest_WithinTolerances_4_input.mdp did not specify a value for the .mdp option "cutoff-scheme". Probably it
19:41:17   was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
19:41:17   scheme was introduced, but the group scheme was still the default. The
19:41:17   default is now the Verlet scheme, so you will observe different behaviour.
19:41:17 
19:41:17 
19:41:17 NOTE 2 [file /mnt/workspace/Matrix_PreSubmit_master/60f6e166/gromacs/src/programs/mdrun/tests/Testing/Temporary/MinimizersWorkWithConstraints_EnergyMinimizationTest_WithinTolerances_4_input.mdp]:
19:41:17   With Verlet lists the optimal nstlist is >= 10, with GPUs >= 20. Note
19:41:17   that with the Verlet scheme, nstlist has no effect on the accuracy of
19:41:17   your simulation.
19:41:17 
19:41:17 Generated 2211 of the 2211 non-bonded parameter combinations
19:41:17 Generating 1-4 interactions: fudge = 0.5
19:41:17 Generated 2211 of the 2211 1-4 parameter combinations
19:41:17 Excluding 3 bonded neighbours molecule type 'Alanine_dipeptide'
19:41:17 
19:41:17 NOTE 3 [file unknown]:
19:41:17   You are using constraints on all bonds, whereas the forcefield has been
19:41:17   parametrized only with constraints involving hydrogen atoms. We suggest
19:41:17   using constraints = h-bonds instead, this will also improve performance.
19:41:17 
19:41:17 Cleaning up constraints and constant bonded interactions with virtual sites
19:41:17 Removed     18           Angles with virtual sites,    21 left
19:41:17 Removed     10     Proper Dih.s with virtual sites,    44 left
19:41:17 Converted   15      Constraints with virtual sites to connections,     7 left
19:41:17 Removing all charge groups because cutoff-scheme=Verlet
19:41:17 Number of degrees of freedom in T-Coupling group System is 23.00
19:41:17 
19:41:17 NOTE 4 [file /mnt/workspace/Matrix_PreSubmit_master/60f6e166/gromacs/src/programs/mdrun/tests/Testing/Temporary/MinimizersWorkWithConstraints_EnergyMinimizationTest_WithinTolerances_4_input.mdp]:
19:41:17   You are using a plain Coulomb cut-off, which might produce artifacts.
19:41:17   You might want to consider using PME electrostatics.
19:41:17 
19:41:17 
19:41:17 
19:41:17 There were 4 notes
19:41:17 Compiled SIMD: None, but for this host/run AVX2_256 might be better (see log).
19:41:17 Reading file /mnt/workspace/Matrix_PreSubmit_master/60f6e166/gromacs/src/programs/mdrun/tests/Testing/Temporary/MinimizersWorkWithConstraints_EnergyMinimizationTest_WithinTolerances_4.tpr, VERSION 2019-dev-20181008-3e21c07 (single precision)
19:41:17 Non-default thread affinity set, disabling internal thread affinity
19:41:17 
19:41:17 Using 1 MPI thread
19:41:17 
19:41:17 WARNING: Using the slow plain C kernels. This should
19:41:17 not happen during routine usage on supported platforms.
19:41:17 
19:41:17 Steepest Descents:
19:41:17    Tolerance (Fmax)   =  1.00000e+01
19:41:17    Number of steps    =            4
19:41:17 
19:41:17       Start 40: GmxapiExternalInterfaceTests

#12 Updated by Mark Abraham about 1 year ago

Perhaps a vsites theme: http://jenkins.gromacs.org/job/Matrix_PostSubmit_master/949/OPTIONS=clang-4%20simd=sse4.1%20openmp%20nranks=1%20gpuhw=nvidia%20cuda-8.0%20clang_cuda%20host=bs_nix1204,label=bs_nix1204/ failed in dd121 and nbnxn_vsite, respectively:

GROMACS:      gmx mdrun, version 2019-dev-20181008-3043cd7
Executable:   /home/jenkins/workspace/Matrix_PostSubmit_master/babf8417/gromacs/bin/gmx
Data prefix:  /home/jenkins/workspace/Matrix_PostSubmit_master/babf8417/gromacs (source tree)
Working dir:  /mnt/workspace/Matrix_PostSubmit_master/babf8417/regressiontests/complex/dd121
Command line:
  gmx mdrun -ntmpi 1 -ntomp 2 -notunepme

Compiled SIMD: SSE4.1, but for this host/run AVX2_256 might be better (see
log).
Reading file topol.tpr, VERSION 2019-dev-20181008-3043cd7 (single precision)
Can not increase nstlist because verlet-buffer-tolerance is not set or used
Using 1 MPI thread
Using 2 OpenMP threads 

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).

-------------------------------------------------------
Program:     gmx mdrun, version 2019-dev-20181008-3043cd7
Source file: src/gromacs/mdlib/vsite.cpp (line 761)
Function:    auto constructVsitesGlobal(const gmx_mtop_t &, gmx::ArrayRef<gmx::RVec>)::(anonymous class)::operator()() const

Assertion failed:
Condition: x.size() >= static_cast<gmx::index>(mtop.natoms)
x should contain the whole system

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

and

GROMACS:      gmx mdrun, version 2019-dev-20181008-3043cd7
Executable:   /home/jenkins/workspace/Matrix_PostSubmit_master/babf8417/gromacs/bin/gmx
Data prefix:  /home/jenkins/workspace/Matrix_PostSubmit_master/babf8417/gromacs (source tree)
Working dir:  /mnt/workspace/Matrix_PostSubmit_master/babf8417/regressiontests/complex/nbnxn_vsite
Command line:
  gmx mdrun -ntmpi 1 -ntomp 2 -notunepme

Compiled SIMD: SSE4.1, but for this host/run AVX2_256 might be better (see
log).
Reading file topol.tpr, VERSION 2019-dev-20181008-3043cd7 (single precision)
Can not increase nstlist because verlet-buffer-tolerance is not set or used
Using 1 MPI thread
Using 2 OpenMP threads 

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).

-------------------------------------------------------
Program:     gmx mdrun, version 2019-dev-20181008-3043cd7
Source file: src/gromacs/mdlib/vsite.cpp (line 761)
Function:    auto constructVsitesGlobal(const gmx_mtop_t &, gmx::ArrayRef<gmx::RVec>)::(anonymous class)::operator()() const

Assertion failed:
Condition: x.size() >= static_cast<gmx::index>(mtop.natoms)
x should contain the whole system

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

#13 Updated by Roland Schulz about 1 year ago

There are multiple MSAN errors of which 1 looks related and none look like false positives. I'll upload a patch.

#14 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '4' for Issue #2642.
Uploader: Berk Hess ()
Change-Id: gromacs~master~I94a649f5bf0fdcc85a826882efa520776de7740a
Gerrit URL: https://gerrit.gromacs.org/8511

#15 Updated by Mark Abraham about 1 year ago

  • Status changed from Accepted to Resolved

With Roland's MSAN fixes merged, we again think the issues are resolved. But we should see if we can make an msan configuration. I've been thinking we had one all along!

#16 Updated by Mark Abraham about 1 year ago

  • Status changed from Resolved to Accepted

#17 Updated by Mark Abraham about 1 year ago

I will look into the vsite issue

#18 Updated by Mark Abraham about 1 year ago

I can reproduce the issue, looks like an issue with paddedvector interacting with hostallocator wrongly.

#19 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2642.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~I646a57b0b0dd727647baa2544a1c5f7f1700661d
Gerrit URL: https://gerrit.gromacs.org/8517

#20 Updated by Mark Abraham about 1 year ago

  • Status changed from Accepted to Fix uploaded

#21 Updated by Mark Abraham about 1 year ago

  • Status changed from Fix uploaded to Resolved

#22 Updated by Mark Abraham about 1 year ago

  • Status changed from Resolved to In Progress

#23 Updated by Mark Abraham about 1 year ago

  • Status changed from In Progress to Accepted

Fresh report of an issue:

http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/6746/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20host=bs_gpu01,label=bs_gpu01/testReport/junit/(root)/complex/pull_constraint/

GROMACS:      gmx mdrun, version 2019-dev-20181012-80fae6c
Executable:   /home/jenkins/workspace/Matrix_PreSubmit_master/cf01250f/gromacs/bin/gmx
Data prefix:  /home/jenkins/workspace/Matrix_PreSubmit_master/cf01250f/gromacs (source tree)
Working dir:  /mnt/workspace/Matrix_PreSubmit_master/cf01250f/regressiontests/complex/pull_constraint
Command line:
  gmx mdrun -ntmpi 2 -ntomp 2 -notunepme

Reading file topol.tpr, VERSION 2019-dev-20181012-80fae6c (single precision)
Using 2 MPI threads
Using 2 OpenMP threads per tMPI thread

On host bs-gpu01 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:0,PP:1

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).
starting mdrun 'Water'
20 steps,      0.0 ps.
Floating point exception (core dumped)

--------------------------------

#24 Updated by Mark Abraham about 1 year ago

And another, same config

GROMACS:      gmx mdrun, version 2019-dev-20181012-dedf3de
Executable:   /home/jenkins/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/bin/gmx
Data prefix:  /home/jenkins/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs (source tree)
Working dir:  /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/regressiontests/complex/pull_geometry_angle-axis
Command line:
  gmx mdrun -ntomp 2 -notunepme

The current CPU can measure timings more accurately than the code in
gmx mdrun was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding gmx mdrun with the GMX_USE_RDTSCP=ON CMake option.
Reading file topol.tpr, VERSION 2019-dev-20181012-dedf3de (single precision)
Changing nstlist from 10 to 100, rlist from 0.9 to 0.9

Using 2 MPI processes
Using 2 OpenMP threads per MPI process

On host bs-nix1310 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:0,PP:1

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).
starting mdrun 'Alanine-dipeptide'
20 steps,      0.0 ps.
[bs-nix1310:25061] *** Process received signal ***
[bs-nix1310:25061] Signal: Floating point exception (8)
[bs-nix1310:25061] Signal code:  (7)
[bs-nix1310:25061] Failing at address: 0x7f6923475b50
[bs-nix1310:25060] *** Process received signal ***
[bs-nix1310:25060] Signal: Floating point exception (8)
[bs-nix1310:25060] Signal code:  (7)
[bs-nix1310:25060] Failing at address: 0x7fd4c6235b50
[bs-nix1310:25060] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fd4c5094330]
[bs-nix1310:25060] [ 1] /usr/lib/libmpi.so.1(ompi_op_base_sum_double+0x20) [0x7fd4c6235b50]
[bs-nix1310:25060] [ 2] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x4a8) [0x7fd4baf4ab48]
[bs-nix1310:25060] [ 3] /usr/lib/libmpi.so.1(PMPI_Allreduce+0x183) [0x7fd4c620b873]
[bs-nix1310:25060] [ 4] /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/bin/../lib/libgromacs.so.4(_Z8gmx_sumdiPdPK9t_commrec+0x12f) [0x7fd4c71821cd]
[bs-nix1310:25060] [ 5] /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/bin/../lib/libgromacs.so.4(+0x763670) [0x7fd4c6ea5670]
[bs-nix1310:25060] [ 6] /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/bin/../lib/libgromacs.so.4(+0x767a7d) [0x7fd4c6ea9a7d]
[bs-nix1310:25060] [ 7] /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/bin/../lib/libgromacs.so.4(_Z14pull_calc_comsPK9t_commrecP6pull_tPK9t_mdatomsP5t_pbcdPA3_KfPA3_f+0xd5e) [0x7fd4c6ea7bdb]
[bs-nix1310:25060] [ 8] /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/bin/../lib/libgromacs.so.4(_Z14pull_potentialP6pull_tPK9t_mdatomsP5t_pbcPK9t_commrecdfPA3_KfPN3gmx15ForceWithVirialEPf+0xed) [0x7fd4c6e8f372]

#25 Updated by Mark Abraham about 1 year ago

The issue in comment #12 is still recurring.

#26 Updated by Berk Hess about 1 year ago

I don't understand how that is possible. The global state x atom count has to match mtop.natoms.
Is there a particular test where this fails?

#27 Updated by Mark Abraham about 1 year ago

Berk Hess wrote:

I don't understand how that is possible. The global state x atom count has to match mtop.natoms.
Is there a particular test where this fails?

The two in comment 12, dd121 and nbnxn_vsite do recur in this way.

#28 Updated by Mark Abraham about 1 year ago

  • Target version changed from 2018.4 to 2019

#29 Updated by Mark Abraham about 1 year ago

Complex/orientation_restraints is also know to fail occasionally, but with checkforce errors, not an FPE

#30 Updated by Berk Hess about 1 year ago

I found a bug: do_force gets passed state->x.paddedArrayRef() but uses x.size() for getting the number of atoms. This doesn't work because PaddedArrayRef (only) provides the padded size.

#31 Updated by Berk Hess about 1 year ago

This might also explain the size issues. if paddedarrayref is used there along the way.

#32 Updated by Berk Hess about 1 year ago

Looks like this size issue (currently) doesn't affect anything in master. Still should be fixed though.

valgrind complains about uninitialized memory in complex.orientation_restraints. This is in the host pme force buffer. I don't know if this is a false positive.

#33 Updated by Berk Hess about 1 year ago

Filed #2702 for the orientation restraint test issue.

#34 Updated by Berk Hess about 1 year ago

  • Related to Bug #2702: PME gather reduction race in OpenCL (and CUDA) added

#35 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2642.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~I45197215342ae011298abc310b877c34f8fab88b
Gerrit URL: https://gerrit.gromacs.org/8572

#36 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2642.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2019~I45197215342ae011298abc310b877c34f8fab88b
Gerrit URL: https://gerrit.gromacs.org/8595

#37 Updated by Mark Abraham about 1 year ago

  • Status changed from Accepted to Resolved

#38 Updated by Erik Lindahl about 1 year ago

Hi,

Sorry if I'm storming in and providing thoughts you have already considered, but in that case you can just ignore this :-)

However, I suspect many of the bugs we've had related to these padded arrays are related to a less than ideal separation about duties of the classes/modules - in particular having PaddedRVec being too aware of how it might be used to save work elsewhere.

For instance, I understand that it's convenient that a vector's padding is automatically zeroed, or that there is a method that returns the number of elements including the padding for iterations. However... I also see how that will cause other problems because we now need to keep in mind that the default zero value might not always work (1/x), and that there are now two sizes...

Don't you think it will make for a cleaner and safer interface to strictly limit PaddedRVec to only be a vector where we are guaranteed there will not be any memory errors if we read/write elements up to the SIMD padding?

All other logic deciding exactly how to zero- or value-pad the extra data, and determine how many extra data items to iterate over, would then go in the specific place where we do it.

#39 Updated by Mark Abraham about 1 year ago

Erik Lindahl wrote:

Hi,

Sorry if I'm storming in and providing thoughts you have already considered, but in that case you can just ignore this :-)

However, I suspect many of the bugs we've had related to these padded arrays are related to a less than ideal separation about duties of the classes/modules - in particular having PaddedRVec being too aware of how it might be used to save work elsewhere.

You do have a point about questionable separation of concerns.

For instance, I understand that it's convenient that a vector's padding is automatically zeroed, or that there is a method that returns the number of elements including the padding for iterations. However... I also see how that will cause other problems because we now need to keep in mind that the default zero value might not always work (1/x), and that there are now two sizes...

There are ways to template the padded vector class so that it understands a suitable default value and uses it for the padded region whenever you call resize(). Or (less nice) one can require that the resize() method takes an argument that is such a default value.

Don't you think it will make for a cleaner and safer interface to strictly limit PaddedRVec to only be a vector where we are guaranteed there will not be any memory errors if we read/write elements up to the SIMD padding?

IIRC memory bound errors have never arisen since we added a PaddedVector type in 2018 release. However it did introduce the possibility of FP exceptions from (SIMD) operations on elements where there was no enforcement of suitable default value upon construction and/or resize operations. It keeps track of the ability to return the unpadded size and arrayref, because we have code that is still expressed in terms of that region.

All other logic deciding exactly how to zero- or value-pad the extra data, and determine how many extra data items to iterate over, would then go in the specific place where we do it.

Which value is the default is currently not a problem, so let's set that aside.

There's not really a "specific place where we do it" because we are mostly talking about x and f buffers held in t_state, and various uses of those buffers are expressed in [0,homenr), or [0,natoms), or padded versions of these. A stable alternative with the simpler form of PaddedVector that Erik proposes would be to have t_state contain

class t_state 
{
  private:
    PaddedVector<RVec> xMemory;
    PaddedVector<RVec> fMemory;
  public:
    //! Handles xMemory and fMemory and sets the below array refs, which is
    //! the convenience functionality currently in PaddedVector, and which
    //! would now be duplicated everywhere we use PaddedVector
    resize(size_t natoms);

    ArrayRef<RVec>     xWithPadding;
    ArrayRef<RVec>     x;
    ArrayRef<RVec>     fWithPadding;
    ArrayRef<RVec>     f;
};

That doesn't stop any client using the wrong arrayref, of course, but it's strictly better than the old days, (and can be better still if we make arrayref operator[] do bounds checking in debug mode).

#40 Updated by Mark Abraham about 1 year ago

A useful simplification would be to have fewer places where we control the size of such vectors, e.g. by simplifying the code so that the DD module also handles the single-domain case.

#41 Updated by Mark Abraham about 1 year ago

In post-submit of release-2019 https://gerrit.gromacs.org/8608, after all known fixes are in, we still have instability in

complex.pull_geometry_angle in gcc-4.9 gpuhw=none cuda-10.0 openmp no-tng release-with-assert host=bs_mic
complex.pull_geometry_angle-axis in gcc-5 gpuhw=nvidia nranks=4 gpu_id=1 cuda-8.0 no-hwloc release-with-assert host=bs_nix1204

The former is a CUDA build that does not run on a GPU device, which could be consistent with use of uninitialized memory somewhere.

We did cc265c4d4fae58f12233f39c5b05daa937019c5d in regressiontests repo to make these more stable. Was that ineffective because irrelevant or not a large enough change?

#42 Updated by Mark Abraham about 1 year ago

Also on gcc-8 openmp simd=avx2_256 gpuhw=amd opencl-1.2 clFFT-2.14 host=bs_gpu01 we had complex.pull_constraint fail.

#43 Updated by Berk Hess about 1 year ago

I don't see any error message in those failed tests.
It's not that the cut-off doesn't fit the decomposition, since a decomposition is chosen.

#44 Updated by Mark Abraham about 1 year ago

On bs-nix_amd-gpu for release-matrix build gcc-8 gpuhw=amd opencl-1.2 release we see

00:42:32.391 Abnormal return value for ' gmx mdrun -ntmpi 12      -notunepme >mdrun.out 2>&1' was 1
00:42:32.391 Retrying mdrun with better settings...
00:42:32.391 
00:42:32.391 Abnormal return value for ' gmx mdrun -ntmpi 6      -notunepme >mdrun.out 2>&1' was 1
00:42:32.391 Retrying mdrun with better settings...
00:42:32.391 FAILED. Check checkforce.out (1 errors) file(s) in orientation-restraints for orientation-restraints
00:42:32.391 Re-running orientation-restraints using CPU-based PME
00:42:32.391 Re-running pull_geometry_angle using CPU-based PME
00:42:32.391 Re-running pull_geometry_angle-axis using CPU-based PME
00:42:32.391 Re-running pull_geometry_dihedral using CPU-based PME
00:42:32.391 FAILED. Check checkpot.out (4 errors), checkforce.out (3 errors) file(s) in tip4p_continue for tip4p_continue
00:42:32.391 2 out of 55 complex tests FAILED

Incidentally it's also rather surprising that those are the only runs that used PME on GPU, so now re-rerunning with CPU (ie. they're getting run with a single rank, unlike the other complex regressiontests).

#45 Updated by Mark Abraham about 1 year ago

awh_multibias has also failed on the AMD gpu slave

#46 Updated by Berk Hess about 1 year ago

Does someone know what the current situation is? Are all setups that are currently giving exceptions running PME on the GPU?

#47 Updated by Mark Abraham about 1 year ago

Berk Hess wrote:

Does someone know what the current situation is? Are all setups that are currently giving exceptions running PME on the GPU?

My suspicion is that the tests that still fail are those that happen to run PME with one rank, so may be eligible to run PME on GPU. But I have not attempted to support that by looking at the setups, including the per-directory files in the regressiontests suits that restrict rank count, etc.

#48 Updated by Berk Hess about 1 year ago

Is it even only PME with OpenCL?

#49 Updated by Mark Abraham about 1 year ago

Berk Hess wrote:

Is it even only PME with OpenCL?

I suspect not, e.g. comment 41

#51 Updated by Szilárd Páll about 1 year ago

  • Status changed from Resolved to Feedback wanted

Wasn't a quickie, but managed to get a backtrace of the last reported error. Looks like an FP exception in pullAllReduce, so it seems to be unrelated to this. I'll file a separate report and attach the relevant data, OK?

#52 Updated by Szilárd Páll about 1 year ago

Szilárd Páll wrote:

Wasn't a quickie, but managed to get a backtrace of the last reported error. Looks like an FP exception in pullAllReduce, so it seems to be unrelated to this. I'll file a separate report and attach the relevant data, OK?

See #2776

#53 Updated by Berk Hess about 1 year ago

  • Related to Bug #2776: FP exception in pullAllReduce added

#54 Updated by Berk Hess about 1 year ago

We have fixed issue #2776 for FPE in the pull code, which likely caused all pull test FPEs.
That leaves only the orientation restraints test FPEs. There is a bug in PME with OpenCL on NVIDIA. We disabled that combination. Since we don't seem to have FPEs anymore in jenkins, all issues might be fixed. But some people think they saw FPEs in the orientation restraint test without PME on the GPU.

#55 Updated by Berk Hess about 1 year ago

Got another orientation restraint force mismatch on OpenCL:
http://jenkins.gromacs.org/job/Matrix_PreSubmit_2019/362/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20host=bs_gpu01,label=bs_gpu01/testReport/junit/(root)/complex/orientation_restraints/

I have not seen FP exceptions any more though, so maybe this issue is fixed, but we need an OpenCL orientation restraint test issue instead.

#56 Updated by Berk Hess about 1 year ago

I see now that there is already an issue for the orientation restraint failure: #2737

#57 Updated by Paul Bauer about 1 year ago

  • Status changed from Feedback wanted to Closed

I think this can be closed now, the remaining instabilities all seem to be OpenCL related and not floating point exceptions.

#58 Updated by Szilárd Páll about 1 year ago

Berk Hess wrote:

I see now that there is already an issue for the orientation restraint failure: #2737

This is indeed worrying. We need to get back to that PME reduction issue; more on #2702

Also available in: Atom PDF