Project

General

Profile

Bug #2584

regressiontests/complex fails on i686

Added by Christoph Junghans over 1 year ago. Updated 3 months ago.

Status:
Rejected
Priority:
Normal
Category:
testing
Target version:
Affected version - extra info:
2018.x
Affected version:
Difficulty:
uncategorized
Close

Description

With 2018.2 on OpenSuse, but can reproduce on Fedora 29 as well:

35/39 Test #35: regressiontests/complex ..........***Failed   93.44 sec
FAILED. Check checkpot.out (27 errors) file(s) in awh_multibias for awh_multibias
1 out of 51 complex tests FAILED

_log.txt (14.2 MB) _log.txt Christoph Junghans, 07/23/2018 02:36 PM
build.log.txt (36.2 MB) build.log.txt Christoph Junghans, 07/25/2018 05:12 PM

History

#1 Updated by Szilárd Páll over 1 year ago

I can't reproduce with gcc 8.1 (built from source) and the cmake configure line from your log. Is the failure reproducible?

#2 Updated by Christoph Junghans over 1 year ago

I am having a hard time reproducing it on Gentoo as well.

Easiest might be to create an account at https://build.opensuse.org/ and then branch https://build.opensuse.org/package/show/science/gromacs and debug it there.

#3 Updated by Szilárd Páll over 1 year ago

Sounds just a bit too tedious as I've no idea how build.opensuse.org works. However, we might have some OpenSUSE users, so perhaps we can check on their machines in the near future.

In the meantime, any idea what the difference is between a "vanilla" gcc 8.1 and the OpenSUSE gcc 8.1.1?

#4 Updated by Christoph Junghans over 1 year ago

It also fails on i686 on f28 in https://koji.fedoraproject.org/koji/taskinfo?taskID=28595483, build log attached, you should be able to reproduce this easier using fedora f28 docker container or fedora mock (see the command at the being of the log)

#5 Updated by Szilárd Páll over 1 year ago

  • Status changed from New to Accepted

Confirmed on Fedora 28 with gcc 8.0.1 and 8.1.1 (after update). The test that fails does both 2D bias and checkpointing (in comparison with the other 1D awh test), so initially it was not clear whether the former or the latter is the issue, but AWH frames from continuation step 0 seem correct, so it seems the issue is not in checkport/restart with AWH.

#6 Updated by Szilárd Páll over 1 year ago

PS: x86_64 is not affected, neither with vanially gcc 8.1 nor on F28 x86_64, so it seems it is a 32-bit (or 32-bit + gcc 8 issue); if so I suggest postponing if it's not found with moderate effort before 2018.3.

#7 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '2' for Issue #2584.
Uploader: Szilárd Páll ()
Change-Id: regressiontests~release-2018~I60a525c51e9d39c2b4c66cb55f07b5fee93a96a8
Gerrit URL: https://gerrit.gromacs.org/8149

#8 Updated by Szilárd Páll over 1 year ago

In the F28 i686 VM I could repro the issue also with gcc 7.3 built from source; checked mulitple optimization levels too with gcc 8.1 as well as with/without SIMD, so it looks like something is incorrect or consistently mis-compiled in the AWH code (at least with multibias).

Scanning the code I had no further ideas, so tips would be welcome.

#9 Updated by Paul Bauer over 1 year ago

Hello, any chance for a fix for 2018.3? Otherwise I'll bump it to the next patch release.

#10 Updated by Szilárd Páll over 1 year ago

  • Target version changed from 2018.3 to 2018.4

Paul Bauer wrote:

Hello, any chance for a fix for 2018.3? Otherwise I'll bump it to the next patch release.

Haven't found the problem yet, so let's bump it.

What I do know is that it looks like a memory corruption as the results become correct if I run in valgrind.

#11 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2584.
Uploader: Szilárd Páll ()
Change-Id: regressiontests~master~I60a525c51e9d39c2b4c66cb55f07b5fee93a96a8
Gerrit URL: https://gerrit.gromacs.org/8270

#12 Updated by Szilárd Páll over 1 year ago

Update: as a month has passed and I've neither found the issue nor have I had to debug further, this risks not making it in 2018.4 either. Can anyone else have a look? Any idea if other memchecker tools can be made functional (IIRC clang tools don't work in 32-bit).

#13 Updated by Berk Hess over 1 year ago

I tried gcc 7.3.1 in 32-bit op OpenSuse and the results look correct to me.

#14 Updated by Berk Hess over 1 year ago

Adding SIMD=none allows me to run with valgrind: no warnings and still correct results.

#15 Updated by Paul Bauer about 1 year ago

Hello Christoph, is this still an issue?

#16 Updated by Christoph Junghans about 1 year ago

I just did a rebuild on Fedora30, https://koji.fedoraproject.org/koji/taskinfo?taskID=30691828, let's see what happens on i686 for the complex test.

#17 Updated by Christoph Junghans about 1 year ago

It seems to gone on Fedora 30, but let me do a build on F28 as well.

#19 Updated by Christoph Junghans about 1 year ago

Ok, still persists in 2018.3 on i686 under Fedora 28!

#20 Updated by Christoph Junghans about 1 year ago

And still reproducible on OpenSuse Tumbleweed.

#21 Updated by Paul Bauer about 1 year ago

@Berk, any more ideas here?

#22 Updated by Mark Abraham about 1 year ago

  • Status changed from Accepted to Closed

I think we've investigated enough. It could be a real issue that's only exposed on this platform, but unless we have a repro environment and case that we can debug, then we should close the issue and move on. Redmine searches will still find it if we want it later.

#23 Updated by Christoph Junghans about 1 year ago

Klaus Kaempf from OpenSUSE sent some details why it fails to Szilard a while back.

Considering the fact that it is reproductible on two distros recently, OpenSuse and Fedora28 independently, I would really vote to not close this.

#24 Updated by Christoph Junghans about 1 year ago

  • Status changed from Closed to Accepted
  • Target version changed from 2018.4 to 2019

Still persists with 2019-rc1 on OpenSuse.

#25 Updated by Paul Bauer about 1 year ago

  • Target version changed from 2019 to 2019.1

bumped to 2019.1, unlikely to be fixed before the release. Could you please share the extra information also on Redmine?

#26 Updated by Christoph Junghans about 1 year ago

  • Affected version changed from 2018.2 to 2019-rc1

There was an email discussion between Szilard and Klaus trying to nail it down, concluding with Szilard being able to reproduce the issue on Fedora 28.

On OpenSuse, I just did a test build of 2019-rc1 with regression tests enabled and the original issue popped right back up.

#27 Updated by Mark Abraham about 1 year ago

  • Related to Bug #2813: regressiontests/complex fails on Fedora30 with x86_64, i686 and other archs. added

#28 Updated by Szilárd Páll about 1 year ago

  • Related to deleted (Bug #2813: regressiontests/complex fails on Fedora30 with x86_64, i686 and other archs.)

#29 Updated by Roland Schulz 12 months ago

Any updates on this?

#30 Updated by Mark Abraham 12 months ago

  • Assignee set to Szilárd Páll

Assigning szilard so we get some information to inform our decisions

#31 Updated by Szilárd Páll 12 months ago

  • Affected version - extra info set to 2018.x

No updates, I did not have time to look further and won't have time before the target date of 2019.1. Unless someone else can look into it better postpone it.

My previous findings should be a good lead for anyone to pick up the thread; i.e. 32-bit arch with gcc 7 or 8 repros the corruption, and valgrid makes it disappear (AFAIK ASAN did not work on 32-bit Linux for some reason).

#32 Updated by Paul Bauer 12 months ago

I continued looking at this, but I'm still not closer to solving this. From the output of checkpot it is clear that the energy values are not getting written to the energy file. This is not due to the number of blocks being different or the block data size not being the same. I still can't find where the initial values are getting scrambled. There is no error from valgrind, just a number of expected leaks.

#33 Updated by Paul Bauer 12 months ago

I managed to run with an ASAN build, but no reports there as well

#34 Updated by Mark Abraham 12 months ago

  • Target version changed from 2019.1 to 2019.2

Postponed

#35 Updated by Mark Abraham 10 months ago

  • Status changed from Accepted to Rejected

@Christoph Can you consider removing any 32-bit platforms from any automated testing by distros, please? GROMACS does not intend to support them, and it's hard for us to understand whether any problem is a quirky platform, a genuine issue where 32-bit doesn't work with our code, and a genuine issue that affects 64-bit that we somehow don't notice on such platforms.

We don't have any leads or clues, and we can't tell whether this is particular to 32-bit platforms, so rejecting this for now.

Please reopen if there's a consistent report of a problem on a 64-bit platform

#36 Updated by Szilárd Páll 3 months ago

Mark Abraham wrote:

@Christoph Can you consider removing any 32-bit platforms from any automated testing by distros, please? GROMACS does not intend to support them, and it's hard for us to understand whether any problem is a quirky platform, a genuine issue where 32-bit doesn't work with our code, and a genuine issue that affects 64-bit that we somehow don't notice on such platforms.

We don't have any leads or clues, and we can't tell whether this is particular to 32-bit platforms, so rejecting this for now.

Correction: we know and have reproduced the issue in a 32-bit VM with multiple compilers (although only with a single distro/libc version).

@Christoph: not sure what actions have been taken, but I would encourage disabling only the specific test that fails -- we do gain from more testing, even if that excludes some test cases.

Also available in: Atom PDF