Project

General

Profile

Bug #2326

double precision avx(2)_128_fma and sse4.1 broken

Added by Mark Abraham about 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
core library
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

With clang-5.0.0 in double and with AVX_128_FMA (e.g. gpu07 or gpu08)

[----------] 3 tests from ArrayRefTest/2, where TypeParam = gmx::ArrayRef<gmx::SimdDInt32>
[ RUN      ] ArrayRefTest/2.ConstructFromPointersWorks

-------------------------------------------------------
Program:     simd-test, version 2018-beta1-dev-20171204-a89205b
Source file: src/gromacs/simd/simd_memory.h (line 204)
Function:    gmx::internal::SimdArrayRef<gmx::SimdDInt32>::SimdArrayRef(gmx::internal::SimdArrayRef::pointer, gmx::internal::SimdArrayRef::pointer) [U = gmx::SimdDInt32]

Assertion failed:
Condition: reinterpret_cast<size_type>(end)%sizeof(value_type) == 0
Size of ArrayRef needs to be divisible by type size

Other tests also fail, e.g. ewald-test

[ RUN      ] SaneInput/PmeGatherTest.ReproducesOutputs/24

-------------------------------------------------------
Program:     ewald-test, version 2018-beta1-dev-20171204-a89205b
Source file: src/gromacs/ewald/pme-gather.cpp (line 222)
Function:    typename std::enable_if<Order == 4 || Order == 5, RVec>::type do_fspline::operator()(std::integral_constant<int, Order>) const [Order = 4]

Assertion failed:
Condition: gridNZ % 4 == 0
For aligned SIMD4 operations the grid size has to be padded up to a multiple
of 4

but i'm not yet sure if these are related

Associated revisions

Revision 19a1ac53 (diff)
Added by Berk Hess about 2 years ago

Fix pme gather in double with AVX_128

The 4NSIMD PME gather change did not change the conditional
for grid alignment. This is made consistent here.
Note that the 4NSIMD change lowered the performance of PME gather
on AVX_128_FMA and AVX2_128 in double precision. We should consider
using 256-bit AVX for double precision instead.

Refs #2326

Change-Id: I07bfb3ca8d334bce18ed0b6989405bbc02c25b7b

Revision 7569ef6d (diff)
Added by Roland Schulz about 2 years ago

Fix ArrayRef<SimdDInt32> for SSE/AVX128

Fixes #2326

Change-Id: I0af16deb658984fb9d9cbf0cfaf9926ff479f9af

Revision 80dd3f5b (diff)
Added by Mark Abraham almost 2 years ago

Update double-precision test configurations

These changes improve coverage of double precision, using more release
mode, particularly with latest gcc and icc, and using 128-bit SIMD,
which have been cases that were buggy recently. The other aspects of
the configurations that have been modified have been
non-critical. Where appropriate, brief rationales are recorded. This
resolves an old TODO item in the post-submit matrix.

Fixed a sign mismatch in initializing an OpenCL variable that didn't
need to be initialized.

Noted relevant new TODOs.

Refs #2300, #2325, #2326, #2334, #2335, #2336, #2337, #2338

Change-Id: I131fa1a6776d1e7809799c3f931a1fc8100fcdc9

History

#1 Updated by Aleksei Iupinov about 2 years ago

Alright, so I don't know much about SIMD, but I can dump what I understand from running ewald-test in this build.

GMX_SIMD_DOUBLE_WIDTH is 2 on AVX_128_FMA, which sets GMX_SIMD_HAVE_4NSIMD_UTIL_DOUBLE to 0 (as opposed to float version). That disables the order 4 unaligned flavor of PME gather. The other specific flavor that is getting picked and asserts
is order 4/5 aligned flavor. The grid however was not aligned in set_grid_alignment(), because PME_SIMD4_UNALIGNED is 1, because GMX_SIMD4_HAVE_REAL is 1. I guess the discrepancy lies within use of GMX_SIMD_HAVE_4NSIMD_UTIL_REAL and GMX_SIMD4_HAVE_REAL?

Also, GMX_SIMD4_HAVE_DOUBLE has a comment "Uses 256-bit avx for SIMD4-double".

#2 Updated by Paul Bauer about 2 years ago

So, I tried reproducing this and the ewald-test error also happens on my machine with clang-3.8 and SIMD forced to use AVX_128_FMA. Still checking for the first one.

#3 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2326.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I07bfb3ca8d334bce18ed0b6989405bbc02c25b7b
Gerrit URL: https://gerrit.gromacs.org/7282

#4 Updated by Berk Hess about 2 years ago

  • Subject changed from double precision avx_128_fma broken to double precision avx(2)_128_fma broken
  • Status changed from New to In Progress

I pushed up a fix for the PME grid alignment.
The SimdArrayRef change seems completely unrelated to me. Could it be because simdDInt32 contains 32 bit integers but uses 64 bit per element.

#5 Updated by Berk Hess about 2 years ago

  • Status changed from In Progress to Resolved

#6 Updated by Mark Abraham about 2 years ago

  • Status changed from Resolved to Accepted

ewald-test issue is resolved, but simd-test issue remains

#7 Updated by Mark Abraham about 2 years ago

  • Subject changed from double precision avx(2)_128_fma broken to double precision avx(2)_128_fma and sse4.1 broken

Also SSE4.1 breaks the same way

#8 Updated by Roland Schulz about 2 years ago

The problem is that simdWidth (for DInt32 with SSE2/AVX128 it's 2) and storageWidth=sizeof(SimdTrait<T>::type)/sizeof(T) (sizeof(__m128i)/sizeof(int32)=4) is different. simdMemory and the test needs to use the same everywhere for things to be correct. I'm not sure which one should be used though. Should a ArrayRef<DInt32> access an array in increments of 4 int32 and ignore two of them or in increment of 2 int32? The first case seems bad because it requires the user code to be aware of not only simdWidth but also storageWidth. The 2nd requires load/stores of the part of the vector. E.g. SSE we would need to change _mm_load_si128 to _mm_loadl_epi64.

#9 Updated by Mark Abraham about 2 years ago

Roland Schulz wrote:

The problem is that simdWidth (for DInt32 with SSE2/AVX128 it's 2) and storageWidth=sizeof(SimdTrait<T>::type)/sizeof(T) (sizeof(__m128i)/sizeof(int32)=4) is different. simdMemory and the test needs to use the same everywhere for things to be correct. I'm not sure which one should be used though. Should a ArrayRef<DInt32> access an array in increments of 4 int32 and ignore two of them or in increment of 2 int32? The first case seems bad because it requires the user code to be aware of not only simdWidth but also storageWidth. The 2nd requires load/stores of the part of the vector. E.g. SSE we would need to change _mm_load_si128 to _mm_loadl_epi64.

Agreed, I spent quite some time looking at this. The use cases for SimdInt32 are internal to table interpolation and exclusion masking, and in those uses the details of the widths are also internal because it depends on how int-to-float conversions have to work for the different SIMD implementations (see docs of SimdInt32 in simd.h). I think that there's no need to have SimdArrayRef<SimdInt32> because there's no need to expose such a type in an interface, so we can resolve this simply by supporting only real types with SimdArrayRef. If we ever need SimdInt in some kind of natural sense, then we should support SimdArrayRef with that - but SimdInt32 is not that type.

#10 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2326.
Uploader: Roland Schulz ()
Change-Id: gromacs~release-2018~I0af16deb658984fb9d9cbf0cfaf9926ff479f9af
Gerrit URL: https://gerrit.gromacs.org/7285

#11 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2326.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2018~I7ba06bc64f83ee61e3c52000d28852fabcba635e
Gerrit URL: https://gerrit.gromacs.org/7286

#12 Updated by Mark Abraham about 2 years ago

  • Target version changed from 2018 to 2018-beta2

#13 Updated by Roland Schulz about 2 years ago

I looked this up incorrectly. Looked into the float rather than the double implementation. The double implementation is already using _mm_loadl_epi64. Thus I uploaded the 2nd solution because it doesn't seem to have any downside.

#14 Updated by Mark Abraham about 2 years ago

Roland Schulz wrote:

I looked this up incorrectly. Looked into the float rather than the double implementation. The double implementation is already using _mm_loadl_epi64. Thus I uploaded the 2nd solution because it doesn't seem to have any downside.

OK. We can certainly have both Roland's fix, which as Roland notes on my patch could be useful also for Simd4Real, and mine that removes the specialization that is not yet known to be useful.

#15 Updated by Roland Schulz about 2 years ago

  • Status changed from Accepted to Resolved

#16 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2326.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2018~I131fa1a6776d1e7809799c3f931a1fc8100fcdc9
Gerrit URL: https://gerrit.gromacs.org/7303

#17 Updated by Mark Abraham about 2 years ago

  • Status changed from Resolved to Closed

#18 Updated by Erik Lindahl almost 2 years ago

Just a comment about the integer width, for the record:

This is unfortunately a property that varies widely between available SIMD architectures. Many of them will only support 32-bit integers when converting (even from double), and that means 64-bit integers were out.
The internal organization and storage width of the different SIMD register is deliberately NOT part of the API definition. QPX for instance uses SIMD registers that can hold either 4 floats, 4 doubles, or 4 integers.

All SIMD memory operations we provide operate on the number of integers specified by the respective SIMD width. Whether the internal SIMD register actually has more bits is not exposed to the user. For this to work, we have also said that users are not permitted to start doing memory references/dereferencing of the SIMD datatypes themselves to work around that - all memory access operations must be done with the load/store methods we provide.

There will of course be architectures where this might require less-than-optimal operations to load or extract integers (IIRC; we even comment about that somewhere and tell users to be highly restrictive), which is the price we pay for a general API. However, if we ever expose the storage width through any traits, I suspect that's a strong sign we are starting to violate the encapsulation of the SIMD module.

#19 Updated by Roland Schulz almost 2 years ago

The rest makes sense but what do you mean with "The [] storage width of the different SIMD register is deliberately NOT part of the API definition". Do you mean how it is stored in the register or in memory? I can't see how it is internally stored in the register would ever matter. And I don't see how we don't expose how it is stored in memory. We have load/store operations which take an address and operates on 32*simd-width bits. In case you are talking about memory, how is that not defining a storage width? Note that the test-case which failed was about memory storage width being different from register size (the later should never be used - which is why I considered the failure to expose an underlying problem which mattered even though it got only exposed for a type we don't actual want to use). If it isn't correct to assume that the memory storage size is 32*simd-width bits for the int type then the int type shouldn't be used for test-coverage either.

#20 Updated by Roland Schulz almost 2 years ago

PS: It is easy to misuse it because it is already exposed. Because sizeof(SimdReal) and sizeof(SimdInt32) does expose the SIMD register width in bytes (including unused elements) which should never be used. I don't think it is possible for us to prohibit the user to call sizeof on these types. Maybe we should add a comment to simd.md that this should never be done.

Also available in: Atom PDF