Project

General

Profile

Bug #2327

AVX2_128 and AVX_128_FMA double precision PME gather regression

Added by Szilárd Páll about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The SIMD4 change broke the 128-wide gather implementation (see #2326), and the current fix provided requires two 256-bit loads which will penalize the performance of these platforms.

Potential solutions:
- fix the actual issue by re-introducing the old code that did not use the full SIMD witdth
- consider switching to 256-bit AVX2 at least for Zen (which is anyway very close in total performance, even with nobnondeds on the CPU)


Related issues

Related to GROMACS - Task #2328: assess AVX2 128 vs 256 on AMD Zen and change defaultsClosed

Associated revisions

Revision 868b62c1 (diff)
Added by Roland Schulz about 2 years ago

Support Simd4N for SimdRealWidth<4

If the SIMD with is smaller 4 but Simd4N is supported
then use Simd4 for Simd4N.

Also move Traits into internal namespace to signify that they
are not intended for usage outside of the simd module.

Fixes #2327

Change-Id: I3d49c57cebc5d565df442d01e322c89312771699

History

#1 Updated by Berk Hess about 2 years ago

  • Subject changed from AVX2_128 and AVX_128_FMA PME gather regression to AVX2_128 and AVX_128_FMA double precision PME gather regression

Are you confusing two issues here? In single precision there is no regression, only in double (updated the subject for this).
I assume you still only tested single precision. In double 256 might be faster.

I was looking at the code, but it's not trivial to switch kernels at runtime. We have to e.g. handle the situation correctly where we might have mixed Intel and Threadripper nodes (not that we care about performance, but we do about correctness or crashes).

#2 Updated by Szilárd Páll about 2 years ago

Berk Hess wrote:

Are you confusing two issues here? In single precision there is no regression, only in double (updated the subject for this).
I assume you still only tested single precision. In double 256 might be faster.

No, I was not, just forgot to mention that it's in double.

#3 Updated by Szilárd Páll about 2 years ago

  • Related to Task #2328: assess AVX2 128 vs 256 on AMD Zen and change defaults added

#4 Updated by Berk Hess about 2 years ago

PME gather got 15% slower on Threadripper in double with AVX2_128 (measured by comparing with a AVX2_256 build). I used default flags, so march might have affected these results.
AVX2_128 has slightly faster PME (tabulated) non-bonded kernels, so is faster overall.

#5 Updated by Roland Schulz about 2 years ago

The underlying problem is that PME gather doesn't have 2-wide SIMD implementation (which would benefit not just AVX_128 but also SSE). This would be easy with the Pack4 type suggested in the discussion on the originally 4N patch. If a potential 2-wide implementation wouldn't be faster than the original 4-wide (Simd4) implementation, than this is also an issue of different simd-width being optimal in different parts of the code (which we probably have for AVX512 too).

Obviously Pack4 is not something which can go into 2018. If the temporary solution for 2018 isn't to use AVX256, then we could either
1) Have pme-gather use Simd4NReal instead of SimdReal which is a typedef to SimdReal other than on AVX_128 where it is a typedef to Simd4Real. Also define the 3 load4N functions for AVX_128.
2) Add the original code back as another #ifdef which is only used for AVX_128

Solution 2 is easier and thus probably better as a temporary solution. Neither of these temporary solutions should go into master which should get a proper solution.

#6 Updated by Erik Lindahl about 2 years ago

Simd4N was supposed to be a trivial addition that should simplify things, so I would be against solutions that lead to requirements for new datatypes - the SIMD layer is complex enough as-is. Similarly, it sounds like a bad solution for different architectures to use old vs. new code implementations, because then we will never move to just having one of them.

I see two reasonable solutions, both long- and short-term:

1) For most other parts of AVX-128 that need to do something with 4 doubles, we use 256-bit AVX. If this works without extra datatypes, it should be implemented.

2) Revert the change that introduced Simd4N.

#7 Updated by Roland Schulz about 2 years ago

Note that this is issue is only about preserving the hack of using 256-bit AVX for when the user chooses 128-bit. The most obvious thing is to not do that. Gather doesn't need to use 4 doubles.

I can implement solution 1 above. It doesn't use a new datatype and is actually helpful towards further changes towards Pack4 proposed on Gerrit with positive feedback from Mark.

I don't understand why you are against replacing the Simd4 type with a Pack4 type. The Simd4 type doesn't scale and is way more complex for most implementation than it needs to be. Thus this would actually simplify the Simd layer. But I guess we can leave that discussion to the future when someone has actually time to implement it.

#8 Updated by Berk Hess about 2 years ago

  • Status changed from New to Accepted

It seems that on AMD Zen AVX2_256 is always faster than AVX2_128 when we anyway have sufficient operands for the width (which makes sense, since it's one instruction instead of two). So ideally we would want to use AVX2_256 for all code that has sufficient operands and AVX2_128 where we might not, like in the non-bonded kernels.
For release-2018 we should either accept the performance loss or bring back the old SIMD4 spread code.
For master the ideal solution in this case would be to support 4-packs using ideally 256-bit instructions, but 128-bit would be a reasonable second choice.

#9 Updated by Erik Lindahl about 2 years ago

@Roland: I'm not at all against replacing something, but I am against a proliferation into more and more types/functions, hoping for it to be cleaned up "later". This far there has never been any change that simply replaced it :-)

@Berk: At some point we have to stop doing micro-optimization, or we will eventually have people trying to tune every single routine in AVX-512 for using either 16, 8, 4 or 2-way SIMD. The width is a property of the implementation. We have extended this so that we both allow a specific SIMD4 type (for now), and also allow the single/double implementations to be independent so we can use wider SIMD in double. However, starting to use different SIMD in different routines is one step too far, IMHO. There is a cost to maintenance that we keep forgetting.

#10 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2327.
Uploader: Roland Schulz ()
Change-Id: gromacs~release-2018~I3d49c57cebc5d565df442d01e322c89312771699
Gerrit URL: https://gerrit.gromacs.org/7323

#11 Updated by Erik Lindahl about 2 years ago

NB: This will urgently require attention of several of us, or we will have to consider reverting the SIMD4N change end-of-week.

#12 Updated by Roland Schulz about 2 years ago

  • Status changed from Accepted to Resolved

#13 Updated by Erik Lindahl about 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF