Project

General

Profile

Bug #1066

FMA and icc

Added by Mark Abraham over 6 years ago. Updated over 6 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I noticed on amd1 with icc 12 that CMake notes that there is no FMA4 flag (which is gcc-specific) but allows configuration to proceed. Then make will fail because (at least) _mm_macc_ps and friends are undefined. Intel seems to call these _mm_fmadd_ps and friends, and probably uses the -fma flag to trigger their use.

Intel Haswell chips will have FMA3 in 2013, so IMO we should not proceed on the assumption that only AMD will have FMA and that we can ignore the use of icc.

1) Seems we should be aliasing the use of compiler-specific intrinsics to gmx_mm_macc_* etc. and hoping that Intel's future syntax will just work

2) Am I right that the only material difference between FMA3 and FMA4 is that the latter permits four-register operations, but that this would be immaterial because one would use the same FMA compiler intrinsic and the scheduling would be the compiler's problem? If so, doing 1) should more or less future-proof us from FMA3 vs FMA4 effects?

3) CMake detection needs some fixing according to the solution we adopt.

History

#1 Updated by Erik Lindahl over 6 years ago

  • Status changed from New to Feedback wanted
  • Priority changed from Normal to Low

AVX_128_FMA is an acceleration set specifically optimized for AMD. For all modern Intel processors it is going to be faster to use 256-bit AVX, so I see no point in investing resources in making the AMD-specific optimizations run on Intel. Once we have Intel Haswell hardware it is going to be trivial (<24h) to write 256-bit AVX kernels that support FMA, so I think that's where we should concentrate our efforts.

I have already tried using icc to produce FMA3 code to run on our new AMD piledriver CPU, and FMA3 is significantly slower than FMA4 (which you get with gcc), so that's another reason it's irrelevant right now.

#2 Updated by Mark Abraham over 6 years ago

  • Status changed from Feedback wanted to In Progress
  • Priority changed from Low to Normal

OK, but the configuration is still broken right now - if building after icc+detection on amd produces compiler errors, then there's a problem to fix.

Is the solution to refuse to configure AVX_128_FMA with icc and force the user to use
  • gcc (preferred) or
  • SSE4.1 (if they have to have icc in some hypothetical universe)?

#3 Updated by Erik Lindahl over 6 years ago

As far as I know, nobody officially supports using icc on AMD, so the configuration isn't really more broken than if the user picks any other random compiler apart from gcc - and they we're going to have fairly long lists of compilers that we recommend against on each platform - the AMD Open64 compilers is the first example.

Is this really critical enough that we want to clutter our already-cluttered CMakeLists.txt with automated tests for all those unsupported cases? Once the icc compilation fails they will use gcc instead, and everything will be fine.

#4 Updated by Mark Abraham over 6 years ago

  • Assignee changed from Erik Lindahl to Mark Abraham

As I noted in the initial post, there's already a test for FMA4, which issues a warning when it can't find it. That test only runs on non-MSVC compilers. If we're already doing all of that, I see no harm in another clause for icc. Why test for something we don't worry about?

Anyway, you've answered my question, so this can go on my list of things to fix. :)

#5 Updated by Szilárd Páll over 6 years ago

Erik Lindahl wrote:

As far as I know, nobody officially supports using icc on AMD, so the configuration isn't really more broken than if the user picks any other random compiler apart from gcc - and they we're going to have fairly long lists of compilers that we recommend against on each platform - the AMD Open64 compilers is the first example.

Intel Compilers are installed on all Crays I've had my hands on, in fact all AMD machines, so I think it is a relevant case -- especially that we'll have to suggest Intel for AMD K10.

Still, I would not consider this a critical issue in comparison to the loads of stuff on our todo list because people will have get an obvious warning.

#6 Updated by Erik Lindahl over 6 years ago

I've fixed both FMA wrappers and a compiler warning in a new gerrit commit. However, remember that only piledriver hardware will be able to do FMA3 instructions. All present Crays and other clusters will segfault if you try to execute a binary compiled with icc (not to mention it's significantly slower).

#7 Updated by Erik Lindahl over 6 years ago

  • Status changed from In Progress to Closed

#8 Updated by Szilárd Páll over 6 years ago

Erik Lindahl wrote:

I've fixed both FMA wrappers and a compiler warning in a new gerrit commit. However, remember that only piledriver hardware will be able to do FMA3 instructions. All present Crays and other clusters will segfault if you try to execute a binary compiled with icc (not to mention it's significantly slower).

Do you mean that AVX_128_FMA is be automatically chosen and compiles with icc, but binaries will segfault? If that's the case, I think there is still room for improvement, as the user will be left with a cryptic error. We could issue a compile-time warning when the auto-picked acceleration is AVX_128_FMA anc the compiler is icc (especially if the underlying hardware is not Piledriver).

Btw, a while ago when I last tried SSE4.1 did work with icc on Bulldozer, but the AMD compiler guide suggests -msse3 with icc.

#9 Updated by Szilárd Páll over 6 years ago

Szilárd Páll wrote:

Do you mean that AVX_128_FMA is be automatically chosen and compiles with icc, but binaries will segfault? If that's the case, I think there is still room for improvement, as the user will be left with a cryptic error. We could issue a compile-time warning when the auto-picked acceleration is AVX_128_FMA anc the compiler is icc (especially if the underlying hardware is not Piledriver).

Ignore my comment, as the gerrit change was not linked, I didn't realize that your change contains this warning.

One last question: is it worth using FMA with icc at all? Is it faster than SSE4.1/SSE2?

#10 Updated by Szilárd Páll over 6 years ago

What I'm still not sure about how reliable is the icc + FMA3 combinations on AMD. The Piledriver Opteron compiler reference guide does not mention FMA for icc at all. I wonder if this is an oversight or deliberate omission. As mentioned above, I've a similar concern regarding SSE4.1 on Bulldozer.

#11 Updated by Erik Lindahl over 6 years ago

icc is fully compatible with FMA. Starting with version 13, icc can generate proper code for AVX2 targets, i.e. haswell and later Intel processors that support FMA3 instructions.

Starting with Piledriver, AMD explicitly added support for FMA3 in addition to FMA4.

#12 Updated by Szilárd Páll over 6 years ago

Erik Lindahl wrote:

icc is fully compatible with FMA. Starting with version 13, icc can generate proper code for AVX2 targets, i.e. haswell and later Intel processors that support FMA3 instructions.

Starting with Piledriver, AMD explicitly added support for FMA3 in addition to FMA4.

...and Bulldozer supports SSE4.1, AMD still recommends -msse3 with icc, see here, the same way as the previously linked document suggests –msse4.2 for icc with Piledriver.

Please do check the linked documents which clearly show that for some reason AMD does not recommend what the gerrit change #1968 does - and that reason might very well not be because in practice/most cases icc + FMA does not work. Still, my question was not whether icc supports FMA or Haswell, but whether we can be sure that going against the AMD official recommendation is not a dangerous practice.

#13 Updated by Erik Lindahl over 6 years ago

  • Status changed from Closed to Feedback wanted

By definition it is impossible to prove that something cannot be dangerous. For all I know, icc could be dangerous on Sandy Bridge in some cases :-)

The original point of this issue was to make it possible to compile the AMD AVX_128 kernels - including FMA instructions - with the Intel compiler. There are plenty of reasons for this, including debugging, comparing different compilers, and not least making it easy to test the AVX_128_FMA kernels on future Intel hardware even though it might not be the most optimal solution. Gerrit #1968 achieves this, and it passes all regression tests. For all other compiler and OS combinations, that is our definition of "reliable".

Even though it might not be the default choice, I fail to see what we would gain by specifically not being able to use the Intel compiler to generate AVX_128 kernels.

Finding the most optimal combination of compiler settings with icc on different AMD platforms is a different issue that I won't even try to address. For Piledriver it is also relatively unimportant since icc will never be able to get close to gcc performance due to the lack of FMA4 instructions. If performance matters you should never use Piledriver+icc to start with for Gromacs.

#14 Updated by Szilárd Páll over 6 years ago

Erik Lindahl wrote:

By definition it is impossible to prove that something cannot be dangerous. For all I know, icc could be dangerous on Sandy Bridge in some cases :-)

True.

Hypothetically, if you found a way to get compiler X to generate code for CPU Y that seems to work fine even though the docs of vendor Y don't list X among the compatible compilers, would you enable this combination with a "just in case" reasoning? I know the example is rather abstract, but it reflects my point.

The original point of this issue was to make it possible to compile the AMD AVX_128 kernels - including FMA instructions - with the Intel compiler. There are plenty of reasons for this, including debugging, comparing different compilers, and not least making it easy to test the AVX_128_FMA kernels on future Intel hardware even though it might not be the most optimal solution. Gerrit #1968 achieves this, and it passes all regression tests. For all other compiler and OS combinations, that is our definition of "reliable".

...additionally to the fact that documentations of the respective compilers state that the respective CPU-specific instruction set is supported on the CPUs/platforms in question.

I have a hard time finding a good reason to agree with allowing icc to compile code for AMD+FMA3 instead of max SSE4.2 which is suggested by the AMD docs.

I don't want to drag out this discussion further, but I don't see the point in going against the AMD documentation and enabling a feature which:
  • is not endorsed by the CPU vendor (for whatever reason) and I have yet to see examples of others reporting that this works and it is stable;
  • does not provide any performance benefit, and in fact it leads to worse performance than most if not all other possible compilers with actual support;
    iii) does not provide any additional functionality for AMD PD that otherwise would not be available with alternative compilers (MSVC can compile with GMX_CPU_ACCELERATION=AVX_128_FMA, right?).

Note that having early pre-CPU-release kernels that might just work on Haswell is OK and quite cool - although we have yet to see whether these AVX_128+FMA kernels will reach reasonable performance compared to AVX2+FMA (my guess it that they won't).

To conclude, I am still leaning towards down-voting patch set 1968 simply because it endorses the use if icc on AMD with FMA3, but will stay neutral and let others review/comment first.

#15 Updated by Erik Lindahl over 6 years ago

Hypothetically, if you found a way to get compiler X to generate code for CPU Y that seems to work fine even though the docs of vendor Y don't list X among the compatible compilers, would you enable this combination with a "just in case" reasoning? I know the example is rather abstract, but it reflects my point.

As you might see from my first comment in this thread I thought it was a close-to-useless case to start with.

Still, if the documentation for a compiler X explicitly says it supports generating code for the CHEWBACCA instruction set, and the documentation of CPU Y says it supports the CHEWBACCA instruction set, I don't find it outrageous to enable the combination - in particular not when somebody is asking for it.

However, I'm completely uninterested in this case myself, so if nobody else speaks up I'll kill #1968 in 24h, and then people will have to live with the error message Mark got a month ago.

#16 Updated by Erik Lindahl over 6 years ago

  • Status changed from Feedback wanted to Rejected

This is a highly un-important corner case, in particular since nobody who cares about performance will want to use icc on FMA4-enabled AMD processors for Gromacs. We've used far too much time for this discussion already, so the solution for now is simply that Gromacs-4.6 will not work with icc when selecting the AVX_128_FMA acceleration.

#17 Updated by Szilárd Páll over 6 years ago

Erik Lindahl wrote:

However, I'm completely uninterested in this case myself, so if nobody else speaks up I'll kill #1968 in 24h, and then people will have to live with the error message Mark got a month ago.

I was not suggesting to completely kill 1968. it is useful to throw it at Haswell as soon as it's out and I'd be fine with it if we'd either automatically fall back to SSE4.1 on Piledriver. Additionally, your the message which warns the user is clear and concise, so if others approve the patch I won't block it!

Also available in: Atom PDF