Project

General

Profile

Bug #1559

Writing TNG files fails on Xeon Phi

Added by Roland Schulz over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
core library
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The unit tests:
  • mdrun-test CanWrite/Trajectories.ThatDifferInNstxout/*
  • legacy-tools-test NoFatalErrorWhenWritingFrom/TrjconvWithIndexGroupSubset.WithDifferentInputFormats/*

fail. And "gmx trjconv" fails with segfault if the output is TNG. This is without zlib. But running the tests with the same compiler (ICC14.0.2) on the CPU or without zlib seems fine.


Related issues

Related to GROMACS - Bug #1542: two unit tests fail on 32-bit ARMClosed2014-06-30
Related to GROMACS - Bug #1546: Legacy tools tests failed on armv7a with SIGBUSClosed2014-07-02
Related to GROMACS - Bug #1547: MDRUN tools tests failed on armv7a with SIGBUSClosed2014-07-02

Associated revisions

Revision 898166c3 (diff)
Added by Magnus Lundborg over 3 years ago

Fixed TNG memory alignment problem and reset pointers.

This is a temporary fix to the memory alignment problems on some
platforms. In the main TNG repository the whole I/O system is
rewritten to address this problem. This fix is to avoid the
problems without making too large changes.

There are also some pointers that were not reset after memory
was freed. This is also fixed in here.

This commit does not correspond to any commit in the TNG
repository.

Fixes #1542, #1546, #1547 and #1559.

Change-Id: I90a6406cccbc43fd57d4423c2b661019cf7763e8

History

#1 Updated by Mark Abraham over 3 years ago

Roland Schulz wrote:

The unit tests:
  • mdrun-test CanWrite/Trajectories.ThatDifferInNstxout/*
  • legacy-tools-test NoFatalErrorWhenWritingFrom/TrjconvWithIndexGroupSubset.WithDifferentInputFormats/*

fail. And "gmx trjconv" fails with segfault if the output is TNG.

I don't think it makes sense to have I/O code run anywhere near a Phi. Presumably to reproduce one needs to do cmake .. -DCMAKE_TOOLCHAIN_FILE=Platform/XeonPhi, but what was the run-time setup?

This is without zlib. But running the tests with the same compiler (ICC14.0.2) on the CPU or without zlib seems fine.

One of those "without" should be "with?"

#2 Updated by Roland Schulz over 3 years ago

Mark Abraham wrote:

I don't think it makes sense to have I/O code run anywhere near a Phi. Presumably to reproduce one needs to do cmake .. -DCMAKE_TOOLCHAIN_FILE=Platform/XeonPhi, but what was the run-time setup?

In native mode the I/O has to happen on the Phi. And native mode is currently the only supported option (we are still working on offload). Also the next generation will be self hosted. And even if we think that we don't care about TNG on Phi, the fact that this isn't working is hinting to some bug which probably will be present on some other architectures too. The Phi card has a standard Linux environment and I'm actually surprised that it isn't working. All other Gromacs I/O, unit- and regressiontests work without any problems (with no changes). No special run-time setup is needed. To make it easy to run the unit-tests, the source and build folders should be located in a directory which is mounted to the same location on the Phi (e.g. /data/gromacs or /home/.../gromacs on both host and MIC - the reason is that the refdata classes use the CMAKE_SOURCE_DIR to find its xml files).

This is without zlib. But running the tests with the same compiler (ICC14.0.2) on the CPU or without zlib seems fine.

One of those "without" should be "with?"

no. I tested on a standard CPU that the unit tests pass when one disables zlib. I also tested that the same compiler version is fine on a standard CPU.

#3 Updated by Magnus Lundborg over 3 years ago

  • Assignee set to Magnus Lundborg

Hopefully this is related to the other TNG issues. I hope to have a fix ready quite soon.

#4 Updated by Magnus Lundborg over 3 years ago

  • Related to Bug #1542: two unit tests fail on 32-bit ARM added

#5 Updated by Magnus Lundborg over 3 years ago

  • Related to Bug #1546: Legacy tools tests failed on armv7a with SIGBUS added

#6 Updated by Magnus Lundborg over 3 years ago

  • Related to Bug #1547: MDRUN tools tests failed on armv7a with SIGBUS added

#7 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1559.
Uploader: Magnus Lundborg ()
Change-Id: I96e0704d3858264ca918603bf1d7e3b27b4db7ea
Gerrit URL: https://gerrit.gromacs.org/3799

#8 Updated by Roland Schulz over 3 years ago

Yes this patch fixes it. It segfaulted in quantize_float because the compiler assumed that the float* passed to it is 4-byte aligned.

#9 Updated by Magnus Lundborg over 3 years ago

  • Status changed from New to Fix uploaded

Thanks for the feedback.

#10 Updated by Roland Schulz over 3 years ago

Was the previous code valid C code or is OK for the C compiler to assume 4-byte alignment? If it was valid it might be good to report a compiler bug.

#11 Updated by Teemu Murtola over 3 years ago

  • Category set to core library
  • Target version set to 5.1

#12 Updated by Roland Schulz over 3 years ago

  • Status changed from Fix uploaded to Closed

Also available in: Atom PDF