Project

General

Profile

Bug #2088

affinity setting code issues spurious warning notes

Added by Szilárd Páll over 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
simple
Close

Description

$  gmx mdrun -quiet -v -nsteps 0 -ntomp  -ntmpi 2 -nb cpu
[...]
Overriding nsteps with value passed on the command line: 0 steps, 0 ps

Using 2 MPI threads
Using 1 OpenMP thread per tMPI thread

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin thread to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).

NOTE: Thread affinity setting failed. This can cause performance degradation.
      If you think your settings are correct, ask on the gmx-users list.

[...]

We noted that we won't be pinning by default, so the code shouldn't try to pin, but even if it does it shouldn't fail (no warnings with -pin on or all cores used). It seems that something broke between the 5.1 and '16 release.


Related issues

Has duplicate GROMACS - Bug #2319: Affinity setting fails when only starting a single threadClosed

Associated revisions

Revision 7bc4ec9b (diff)
Added by Berk Hess over 1 year ago

Tone down note about not pinning threads

The note mdrun prints to stderr and log file when not pinning
threads is now a short note instead of the old failure message,
since not pinning is often a choice, not a failure. With more than
one thread, a failure or longer explanatory note will anyhow have
been issued before.
Also removed a warning about OpenMP set affinity when the total
number of threads was chosen by mdrun.
Updated the thread affinity tests for the new message. Added tests
that are now possible with the recently introduced auto thread count
variable.

Fixes #2088 and #2319

Change-Id: I1411a1ce6e222d22da8d70bf7bab2c9bb7564507

History

#1 Updated by Szilárd Páll over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Szilárd Páll
  • Target version set to 2016.2
  • Difficulty simple added
  • Difficulty deleted (uncategorized)

Looks like the path that took an an early return got turned into a false result in gmx_set_thread_affinity().

#2 Updated by Teemu Murtola over 2 years ago

This was a (semi-)intentional change that was also explicitly mentioned in the commit message here: https://gerrit.gromacs.org/5820

I would rather try to formulate the messages differently if you are not happy with the current wording, rather than try to make the code much more complicated to have the ranks collectively agree on a message that should be produced out of several different possibilities.

#3 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2088.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2016~Ia2acbff74c18f7fe93e96cda08ab72d47ccfe3b5
Gerrit URL: https://gerrit.gromacs.org/6362

#4 Updated by Szilárd Páll over 2 years ago

I see now that this can deadlock. The output however looks rather confusing. We're reporting about an error (with emphasis) that in fact is a perfectly normal, default behavior that's triggered pretty often. Personally, I'd rather have duplicate messages than emitting a spurious message about an error that's not an error.

Perhaps we should move the consistency checks that have nothing to do with the role of get_thread_affinity_layout() somewhere else (and use barriers?).

#5 Updated by Mark Abraham over 2 years ago

Without logic changes, to handle this case, we can make the code produce

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not set thread affinities.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).

NOTE: Thread affinities were not set. This can cause performance degradation.
      If you think your settings are correct, ask on the gmx-users list.

That also seems reasonably sound for other cases. If set_affinity failed, it already produces stderr output (which perhaps we should reconsider to go to mdlog.warning).

But I suggest Szilard and I both review https://gerrit.gromacs.org/#/q/topic:thread-affinity-tests+status:open before we consider changing any logic. Then, one reasonable way to reach a stable solution for such issues is to actually have unit tests that trigger the various conditions and have the user-facing output in refdata and checkable. Then reviewers get to see which set of inputs produces which outputs and we can hope to cover everything in a way that a user might find intelligible and actionable.

#6 Updated by Mark Abraham over 2 years ago

  • Target version changed from 2016.2 to 2016.3

#7 Updated by Mark Abraham over 2 years ago

Mark Abraham wrote:

But I suggest Szilard and I both review https://gerrit.gromacs.org/#/q/topic:thread-affinity-tests+status:open before we consider changing any logic. Then, one reasonable way to reach a stable solution for such issues is to actually have unit tests that trigger the various conditions and have the user-facing output in refdata and checkable. Then reviewers get to see which set of inputs produces which outputs and we can hope to cover everything in a way that a user might find intelligible and actionable.

Those patches are now merged, so I propose that we shift the target for any fixes for this issue to 2017 release.

#8 Updated by Mark Abraham over 2 years ago

  • Target version changed from 2016.3 to 2018

#9 Updated by Mark Abraham over 1 year ago

  • Has duplicate Bug #2319: Affinity setting fails when only starting a single thread added

#10 Updated by Szilárd Páll over 1 year ago

Looks like we're lucky as Berk's recent change 9648979634bcca0014299f05ce46ecf83dc4b01d introduced a boolean that should allow adding a condition to avoid emitting this message when not necessary.

Will look into this tonight either for beta2 or if time is short for beta3 we should be able to fix.

#11 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2088.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I1411a1ce6e222d22da8d70bf7bab2c9bb7564507
Gerrit URL: https://gerrit.gromacs.org/7291

#12 Updated by Berk Hess over 1 year ago

  • Status changed from In Progress to Fix uploaded

This is easy to fix the 2018 release because there I added hw_opt->totNumThreadsIsAuto.

#13 Updated by Erik Lindahl over 1 year ago

  • Status changed from Fix uploaded to Resolved

#14 Updated by Erik Lindahl over 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF