Project

General

Profile

Feature #1612

generating SIMD code

Added by Mark Abraham about 5 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Now that we have portable SIMD intrinsics, and while I was looking at the bonded module, I was motivated to try to use expression templates to do nice things like automatically generate a call to a SIMD intrinsic from statements like a = b + c * d. Roughly speaking, that involves using template meta-programing to defer the generation of the intrinsic to function declared in a type that knows that there exists three operands and two operations and that this can be a multiply-add. This would be good to have, so that we can have generic and SIMD code paths that look/are the same, which will make developing correct code easier.

I kind-of got some basic 2-operand unit tests to pass on latest icc, gcc and clang using a CRTP approach based on various things I found around the web, but didn't really get a multiply-add working. The problem is exacerbated by the facts that
  • alignment of SIMD variables stored by value inside classes is not assured (and at least at low optimization levels, some compilers don't optimize away such temporaries, so you get segfaults, probably from mis-alignment; at full optimization there would be no temporaries...),
  • operator overloads are only legal for class types, and the native SIMD types to which we preprocess are native,
  • gcc and clang support some operators on native SIMD types (and you can't turn them off) which makes it hard to have template classes that keep and return references to the real values, because you want a * b to compile to BinaryExpression<LValue, RValue, Multiply>::operator gmx_simd_real_t const&() without the compiler deciding that (Expression<LValue>::operator gmx_simd_real_t const&()) * (Expression<RValue>:: operator gmx_simd_real_t const&()) is the way to go (and this gets worse for multiply-add), and
  • template-metaprogramming error messages are compiler outputs (though there is some research work on run-time unit testing for such) and not much fun to understand - and this could end up in the face of general Gromacs developers in a way we probably do not want
    I can share draft code if anybody thinks they want to try to produce a magic bullet here.
Two other approaches come to mind for avoiding writing code in the SIMD intrinsics:
  • use a (say) python generator script to take scalar code and generate the matching code implemented in our SIMD intrinsics
  • use libclang to do the same

In both cases, the generator output is source code, which we then compile and unit-test. In both cases, I think we'd keep the generated code in the repo, so that the generator infrastructure is not a hard requirement for compiling the development branch on some bleeding-edge machine.

Python would make it easy to write the generator, but since we won't have a full syntax parser, it will only be able to generate if the input code conforms to restricted syntax subset, e.g. writing scalar code like a = b + c * d. This would mean we'll need to have (and to name) a bunch of temporary variables, and teach the script about calls to math functions like sqrt(), exp() and sin().

libclang would have the advantage that a = b + c * ((d*d) + (q*f)) can get parsed to a proper tree, so that @a = gmx_simd_madd_r(b, c, gmx_simd_madd_r(gmx_simd_mul_r(d, d), q, f)) is the result, which lets the writer of the serial code express themselves more naturally. Getting an install of clang is no longer a problem, but the generator now has to be written in C++ and someone has to learn how to grok the AST produced by libclang. That knowledge could have other cool applications, though.

Having this extra layer of infrastructure would probably make the process of writing a kernel generator more straightforward, because it separates the concerns of generating SIMD from generating the different kernel code paths for architectures and model physics.

Are there alternatives I haven't thought of? Any enthusiasm for any particular approach?

I think Python would be the best choice for the short term.

template.cc (1.11 KB) template.cc Roland Schulz, 10/23/2014 06:53 PM
test.cc (3.5 KB) test.cc Basic implementation of expression templates to support fused-multiply-add Roland Schulz, 07/12/2015 08:56 AM
test2.cc (3.47 KB) test2.cc Working FMA expression template (subtracting missing) Roland Schulz, 07/15/2015 03:01 AM

Related issues

Related to GROMACS - Task #1017: C++ Vector/Matrix classesNew

History

#1 Updated by Roland Schulz about 5 years ago

Mark Abraham wrote:

  • alignment of SIMD variables stored by value inside classes is not assured (and at least at low optimization levels, some compilers don't optimize away such temporaries, so you get segfaults, probably from mis-alignment; at full optimization there would be no temporaries...),

All compilers should have an attribute to fix that. GCC/Clang/ICC: attribute ((aligned (#))), MSVC: __declspec(align(#)), XLC __align(#). Can't find the Fujitsu C++ manual, so I couldn't look it up.

  • operator overloads are only legal for class types, and the native SIMD types to which we preprocess are native,

The type should definitely be a class of our own containing the SIMD type.

  • gcc and clang support some operators on native SIMD types (and you can't turn them off) which makes it hard to have template classes that keep and return references to the real values, because you want a * b to compile to BinaryExpression<LValue, RValue, Multiply>::operator gmx_simd_real_t const&() without the compiler deciding that (Expression<LValue>::operator gmx_simd_real_t const&()) * (Expression<RValue>:: operator gmx_simd_real_t const&()) is the way to go (and this gets worse for multiply-add), and

Shouldn't be a problem with own type.

  • template-metaprogramming error messages are compiler outputs (though there is some research work on run-time unit testing for such) and not much fun to understand - and this could end up in the face of general Gromacs developers in a way we probably do not want

Not sure which compiler you tried. Clang tends to be better with the error message. But even then I agree this is a weakness.

I can share draft code if anybody thinks they want to try to produce a magic bullet here.

I think it shouldn't be hard. I can help if that's the way we want to go.

Two other approaches come to mind for avoiding writing code in the SIMD intrinsics:
  • use a (say) python generator script to take scalar code and generate the matching code implemented in our SIMD intrinsics
  • use libclang to do the same

I suggest we look at using ispc. It has already your 2nd approach implemented. It has SIMD implemented for Intel (SEE, AVX, MIC) and some Arm (Neon) but can be extended (for Power, other Arm, K). It is open source. It has support for source-to-source using the Generic targets (but can also do source-to-object) so it works if no clang compiler is available on target arch. There is active development. I haven't tried it but it looks good to me. I'm happy to try to research specific questions if that's an option we might choose.

libclang would have the advantage that a = b + c * ((d*d) + (q*f)) can get parsed to a proper tree, so that @a = gmx_simd_madd_r(b, c, gmx_simd_madd_r(gmx_simd_mul_r(d, d), q, f)) is the result, which lets the writer of the serial code express themselves more naturally. Getting an install of clang is no longer a problem, but the generator now has to be written in C++ and someone has to learn how to grok the AST produced by libclang. That knowledge could have other cool applications, though.

I have done it before. There are python wrappings for it and I used that. It isn't that hard.

Are there alternatives I haven't thought of? Any enthusiasm for any particular approach?

I think we should either use the C++ template approach or ispc. I think it depends on the syntax we want. ispc strength is in implicit parallel syntax (code mostly looks serial). I assume the template syntax we would develop would be more explicit.

Of course there are other solutions (e.g. OpenMP simd).

I think Python would be the best choice for the short term.

Not sure we should invest time into a temporary solution.

#2 Updated by Roland Schulz about 5 years ago

What code do we want to use to analyze different approaches? The nbnxn kernels or something else like PME or bonded? For the nbnxn kernels we have the ref kernels but they don't use the jx_S/ix_S? temporary buffers and don't have an option to choose the extra unroll of i (as in 2xnn). If we want to see whether something like ispc can generate as good SIMD from the ref kernel, we probably need to make a new version of the ref kernels which add those. At least the temporary work buffers are an optimization one probably cannot expect from something like ispc.

#3 Updated by Roland Schulz about 5 years ago

What is the goal for the generic path? Should it be only a reference for which the performance isn't a high priority or should it be as fast as it is now? It would be trivial to use the current SIMD path also as the generic path. As long as the SIMD path supports simd-width=1 one can use the reference simd implementation (if one thinks the compiler is too dumb to optimize the loop over 1 away one could fix that with an ifdef), and one could use the SIMD path also as the generic path. It would work even with simd-width>1 but then one has the extra loop. The compiler should unroll those but that might cause register spill. But of course that wouldn't be optimal even for simd-width=1. The SIMD path contains e.g. blend instead of if and copying data into aligned work-buffers (e.g. forceatoms in angles_noener_simd). But converting if<->blend and only optionally doing copying into a work-buffer is something much more complicated to do.

#4 Updated by Szilárd Páll about 5 years ago

Roland Schulz wrote:

I think Python would be the best choice for the short term.

Not sure we should invest time into a temporary solution.

We can't assume that any of the alternatives will produce reasonably: i) fast, ii) portable, iii) and easy to use code/framework. I'm particularly skeptical about template-meta-programming which both hides the vectorized code produced, is fragile wrt compilers and obscures compile-time errors quite a lot.

I think a bridge-gap solution is necessary. It is probably safer to proceed right now with the approach that has the least pitfalls and experimental aspects to avoid the risk of running into issues with some of the above points and delaying new SIMD code until a stable working solution materializes.

This is especially important as there are several SIMD architectures planned for the next release and the future of the nbnxn module/Verlet scheme is not clear either.

What code do we want to use to analyze different approaches? The nbnxn kernels or something else like PME or bonded? For the nbnxn kernels we have the ref kernels but they don't use the jx_S/ix_S? [...]

My first thought is: the simpler the better - which points toward the bondeds, but I may be wrong about this.

Instead of the reference kernels which are insanely slow, I think the ref SIMD kernels should be tailored for performance to provide the starting point future work, as well as thefallback when GMX_SIMD=None because they are much faster than the current ones (see #1417).

#5 Updated by Roland Schulz about 5 years ago

Szilárd Páll wrote:

I'm particularly skeptical about template-meta-programming which both hides the vectorized code produced, is fragile wrt compilers and obscures compile-time errors quite a lot.

I think it depends on what we each exactly mean by "template-meta-programming". Since it is just a method not an approach we have to describe exactly the approach. The 1st approach I would suggest is: Only use template-meta-programming as syntactic sugar. Thus all it would do is convert a*b into gmx_simd_mul, a+b into gmx_simd_add, and a*b+c into gmx_simd_madd (for add and mul it isn't meta - just simple operator overloading). It would not do any unrolling, converting of if into blend or any other complex code transformation. In that case I don't think it would hide any vectorized code (just nicer syntax), wouldn't be fragile wrt to compilers (would only use very standard C++ constructs), and the compiler errors would still be reasonable easy to understand.

The other place where templates would be useful, would be to replace the current usage of the preprocessor. I suggested this before on gmx-dev.

In that thread Mark wrote:

The downside for template meta-programming for non-bonded kernels is that you still get a kernel with every possible code path in it, whereas with a generator script doing the metaprogramming, you can read whichever one suits your current purpose. Being able to see the meta-programming output can be useful when developing new kernels. I imagine compile time of ~100 kernels might be a little faster overall with the generator script, and debuggers might have a better time, too.

What do you mean with "read whichever one suits you". I'm pretty sure we don't want to get into compiling kernels on-demand at runtime. Thus all kernels which we think the user might want at runtime need to be compiled at compile-time. This would work for templates the same way. For templates not all possible combinations are created but only those which are instantiated. The compile time shouldn't be much different (the template generation should be small compared to optimized compilation - it might be noticeable for non-optimized/debug compilation). And I don't see why debuggers should have a problem with it. This is very basic technique nowadays.

#6 Updated by Szilárd Páll about 5 years ago

Roland Schulz wrote:

Szilárd Páll wrote:

I'm particularly skeptical about template-meta-programming which both hides the vectorized code produced, is fragile wrt compilers and obscures compile-time errors quite a lot.

I think it depends on what we each exactly mean by "template-meta-programming". Since it is just a method not an approach we have to describe exactly the approach. The 1st approach I would suggest is: Only use template-meta-programming as syntactic sugar. Thus all it would do is convert a*b into gmx_simd_mul, a+b into gmx_simd_add, and a*b+c into gmx_simd_madd (for add and mul it isn't meta - just simple operator overloading). It would not do any unrolling, converting of if into blend or any other complex code transformation. In that case I don't think it would hide any vectorized code (just nicer syntax), wouldn't be fragile wrt to compilers (would only use very standard C++ constructs), and the compiler errors would still be reasonable easy to understand.

You're right, I just got a bit distracted by crazy stuff I've seen before. However, what you describe above - and I now realize that Mark also mentioned only simple arithmetics -, is exactly what I thought of when discussing the lack of SIMD in some of the bonded module and I showing Mark the rather simplistic overloaded operators (not even templated) that we use in CUDA (gmxlib/cuda_tools/vectype_ops.cuh).

The other place where templates would be useful, would be to replace the current usage of the preprocessor. I suggested this before on gmx-dev.

I missed that discussion, that is indeed a good point! I've actually seen the technique before (in the CUDA SDK) the SO link you posted presents, but I have to admit that and even though CUDA has always been compiled in C++ mode, I avoided templates to a great extent just because I preferred C-style coding. It's time to revise that decision, I guess. It should be relatively easy to convert step-by-step the kernel function pointers to templated code and verify that dead code elimination preserves the performance.

I'm pretty sure we don't want to get into compiling kernels on-demand at runtime.

Actually, we do & quite likely we will, although for now only possible with OpenCL and soon CUDA. JIT compiling should allow cool things, but that's another topic.

#7 Updated by Roland Schulz about 5 years ago

Szilárd Páll wrote:

You're right, I just got a bit distracted by crazy stuff I've seen before.

Too bad one cannot overload the ternary operator, then one could turn it into a blend ;-)

Actually, we do & quite likely we will, although for now only possible with OpenCL and soon CUDA. JIT compiling should allow cool things, but that's another topic.

It is also possible with C++ (libclang, libjit, CLI, cling, ceemple).

#8 Updated by Roland Schulz about 5 years ago

Currently the reference kernel always uses nbatXYZ. Do have a version which also supports nbatX4/nbatX8? Just checking before I add it myself.

I tried the template technique to replace the preprocessor. For a small example the dead code elimination worked for gcc even without optimization enabled. It will be easy to keep the current mechanism that each kernel is compiled in its own compilation unit. Each source file can do an explicit template instantiation (template void f<4>();) and in the central source file which assigns the function pointer extern templates can be used (extern template void f<4>();).

#9 Updated by Teemu Murtola about 5 years ago

Roland Schulz wrote:

Each source file can do an explicit template instantiation (template void f<4>();) and in the central source file which assigns the function pointer extern templates can be used (extern template void f<4>();).

Note that extern template is only in C++11 (although at least gcc has had support for a long time), but that may cause trouble with some unnamed compilers...

#10 Updated by Szilárd Páll about 5 years ago

Roland Schulz wrote:

Currently the reference kernel always uses nbatXYZ. Do have a version which also supports nbatX4/nbatX8? Just checking before I add it myself.

When talking about "reference kernel" do you mean reference plain C, the one that's called when GMX_SIM=None, or the plain C SIMD ones? AFAIK the latter supported all flavors of the nxn kernels (with minor tweaking) with nabatx4 too, but I'm not sure anymore.

I tried the template technique to replace the preprocessor. For a small example the dead code elimination worked for gcc even without optimization enabled. It will be easy to keep the current mechanism that each kernel is compiled in its own compilation unit. Each source file can do an explicit template instantiation (template void f<4>();) and in the central source file which assigns the function pointer extern templates can be used (extern template void f<4>();).

Sounds promising.

#11 Updated by Szilárd Páll about 5 years ago

Teemu Murtola wrote:

Roland Schulz wrote:

Each source file can do an explicit template instantiation (template void f<4>();) and in the central source file which assigns the function pointer extern templates can be used (extern template void f<4>();).

Note that extern template is only in C++11 (although at least gcc has had support for a long time), but that may cause trouble with some unnamed compilers...

That may actually be a showstopper, otherwise we'd have to include all kernels into the same compilation unit which will cause compilation bottlenecks like it does now with the CUDA kernels, right?

#12 Updated by Teemu Murtola about 5 years ago

Szilárd Páll wrote:

Teemu Murtola wrote:

Roland Schulz wrote:

Each source file can do an explicit template instantiation (template void f<4>();) and in the central source file which assigns the function pointer extern templates can be used (extern template void f<4>();).

Note that extern template is only in C++11 (although at least gcc has had support for a long time), but that may cause trouble with some unnamed compilers...

That may actually be a showstopper, otherwise we'd have to include all kernels into the same compilation unit which will cause compilation bottlenecks like it does now with the CUDA kernels, right?

No, it doesn't need to be. You can just as well do something like

void kernel_name_like_it_is_now(params...)
{
  kernel_template<flags>(params...);
}

Or, you can make each file declare a function that returns a function pointer to the kernel, if you are worried that the compiler would not inline the function call above. In either case, code that needs to call into the kernel from outside the translation unit only sees a normal function and does not need to call the template directly.

#13 Updated by Szilárd Páll about 5 years ago

Thanks, good point, I forgot about that option.

I think we should try to evaluate this option asap and decide whether we want to go with all templates or maybe a combination of templates and generator - the kernels may become unreadable if we put in all flavors through if/switch conditionals.

#14 Updated by Erik Lindahl about 5 years ago

One caveat is that we should test with the worst compilers, not the most well-behaved ones ;-)

It would be great if somebody could create a test program that we can test on a bunch of different compilers. It would be a good idea if the contents of the if/switch statement isn't completely trivial, so we can check that the compiler does the right thing from an optimization point-of-view.

#15 Updated by Roland Schulz about 5 years ago

Creating a test program should be easy to do. What do you suggest as non-trivial content?

#16 Updated by Roland Schulz about 5 years ago

One option would be to try it on pme. It would be easy to convert, is relative simple, is real code, and has a numerical constant which is used both as const expression (specifying the number of loop iterations) and for an if statement (PME_ORDER), and also has a boolean constant (PME_SIMD4_UNALIGNED). But it might be a bit harder to verify that the compiler actually optimized it.

#17 Updated by Erik Lindahl about 5 years ago

Not really sure about the best test, but if a switch statement have empty cases those will obviously be optimized away no matter what the templates do. I was thinking that it would be much easier to check that we get a reasonable runtime for a simple algorithm (adding numbers, or whatever) rather than trying to disassemble the resulting objects to check what the compiler does, in particular if we want to check it on a handful of different architectures!

PS: The alignment issue Mark talked about in the original post is/was real. The __m128 and similar datatypes already have alignment specifiers, but apparently some versions of gcc and clang manage to disregard those alignment instructions when the variables are placed inside a class. This was the reason I had to rewrite the unit tests for the SIMD module to avoid placing SIMD variables inside classes.

#18 Updated by Roland Schulz about 5 years ago

How about something like:

#include <cstdio>

template<int n, bool b>
float f(float x)
{
    for (int i; i<n; i++) x*=2;
    if (n>5) x+=n;
    if (b) x/=2;
    return x;
}

int main() {
    printf("%f\n", f<3, false>(10));
    return 0;
}

What specific compilers have a problem with aligned data in a class? It seems it is possible to workaround these issues. Eigen supports GCC>=4.1 and Clang>=2.8 and it couldn't if that's not solvable.

#19 Updated by Erik Lindahl about 5 years ago

Something like that; I'll try to create a double loop that takes a while to execute without overflowing so we can check that it is slow with an if-statement (at least in a separate translation unit), but that we get the same speed from the template as when we remove the conditional statement in a separate translation unit.

I guess you should be able to see from the gerrit changes/discussion about the SIMD patch where the unit tests had bus error problems - I don't remember the exact compilers anymore. Start with PGI - that one usually fails most things :-) Unfortunately that seems to be related to using the C++ frontend from the Edison Design Group, and that's what a lot of commercial vendors do to avoid developing their own (Fujitsu too).

Obviously, anything is possible to work around, but that will require people to spend time to find and implement those workarounds on various compilers. For Gromacs-6.0 Mark & I have talked about reserving the last 3-4 weeks before the release for portability patches or critical fixes, but not allowing any new code whatsoever. Unfortunately this is still a clear handicap with C++ compared to C - for now I have given up on getting Intel TBB to pass their unit tests on Fujitsu or PGI after a few weeks of work.

#20 Updated by Roland Schulz about 5 years ago

The attached test doesn't show a difference between f1 and f2. I think it is because the compiler is smart enough to move the if outside the loops. Not sure how to create a small example but which shows the problem of an if statement (normal if without template or preprocessor) inside the inner loop.

If we use the PME as a test we wouldn't need to look at the asm to verify the compiler eliminated all if. We could just look at the runtime of the PME code.

#21 Updated by Gerrit Code Review Bot about 5 years ago

Gerrit received a related patchset '1' for Issue #1612.
Uploader: Roland Schulz ()
Change-Id: Id366621fa3ad02ca182b8a4da48cae940059cf46
Gerrit URL: https://gerrit.gromacs.org/4183

#22 Updated by Roland Schulz about 5 years ago

The commit I uploaded changes pme gather. I think this should allow anyone to test whether their compiler can optimize as well as before. I recommend to compile with GMX_CYCLE_SUBCOUNTERS to get just the gather/spread time.

#23 Updated by Roland Schulz about 5 years ago

Why is i manually (partially) unrolled in the 4xn/2xnn kernels? Do we know whether any compiler produces less efficient code if this wasn't unrolled manually? Without the manual unroll it should be relatively easy to combine both kernel into one. We could even abstract the load/store/reduce (which would be different depending on whether we operate on full/half/... vectors) so that the code wouldn't become less readable. And it would allow to create easily other kernels (e.g. for Phi we want a 444 kernel (UNROLLI=UNROLLJ=4 with both fully unrolled giving the 16-wide-simd)). But none of this is possible if we know that some (important) compiler requires us to do the manual unroll.

#24 Updated by Mark Abraham about 5 years ago

Roland Schulz wrote:

Why is i manually (partially) unrolled in the 4xn/2xnn kernels?

It permits the implementation of the HALF_LJ functionality, where if 2-3 of the i atoms have LJ parameters that are zero we can skip 2 of the LJ evaluations per i cluster. This caters to the frequently used water models that lack LJ on hydrogen atoms, of course. In principle, the 4xn kernels might benefit from separate inner loops for 0/1/2/3/4 i atoms with LJ, but the marginal benefit over the current scheme (supporting 0/2/4) isn't large.

Do we know whether any compiler produces less efficient code if this wasn't unrolled manually? Without the manual unroll it should be relatively easy to combine both kernel into one. We could even abstract the load/store/reduce (which would be different depending on whether we operate on full/half/... vectors) so that the code wouldn't become less readable. And it would allow to create easily other kernels (e.g. for Phi we want a 444 kernel (UNROLLI=UNROLLJ=4 with both fully unrolled giving the 16-wide-simd)). But none of this is possible if we know that some (important) compiler requires us to do the manual unroll.

It is certainly conceivable that some compiler would not attempt to unroll over the i-cluster loop, because it fails some instruction-count heuristics, or the use of intrinsics suppresses it, or some alignment assertion is missing / not propagated. But I don't know if this has been observed (or with what, and what versions). If I'd been Berk writing the original explicit-SIMD implementation, I'd have been explicit about the unroll, too.

Whether the unroll is good depends on the number of registers (and/or the ability of the compiler/processor to re-order accordingly), so our current approach is brittle in that respect. When we have a kernel generator, we should probably add such manual-unroll dimensions to the mix.

#25 Updated by Roland Schulz over 4 years ago

  • Related to Task #1017: C++ Vector/Matrix classes added

#26 Updated by Roland Schulz over 4 years ago

GCC 4.7, ICC 15, clang 3.5 (maybe also earlier version - those are just the ones I tested), all support basic math operators on SIMD, e.g.:

__m256 fmadd(__m256 a,__m256 b,__m256 c) {
    return a*b+c;                                                                                                                                                                                                                                       
}

And they all correctly convert it to fused-multiply-add. Thus we could just use those arithmetic operators without templates with most compilers. And only have templates as a fall-back (for e.g. MSVC).

We could also use namespaces to make calling simd functions less verbose. If we change gmx_simd_rsqrt_f to gmx::simd::rsqrt_f one would only have to write simd::rsqrt_f or rsqrt_f.

#27 Updated by Roland Schulz over 4 years ago

I couldn't find any existing SIMD library which does fused-multiply-add for AVX2. It seems that Boost SIMD, Eigen, VC, Agner's Vector Library at least don't.

Doing it for simple cases is simple. See attached file for a basic implementation. The problem I haven't been able to resolve is, to make it work for more than one FMA in a single expression. In that case it depends on the order of operators whether it works:

Vec(a)*Vec(b)+Vec(c)*Vec(d)+Vec(e);   //doesn't work                                                                                                                                                                                                             
Vec(a)*Vec(b)+(Vec(c)*Vec(d)+Vec(e)); //works    

As it is now it actually gives an error because the operator overloading is ambiguous. That should be relatively easy to resolve. But I don't have a good idea of how to convert the first into 2 FMA.

#28 Updated by Roland Schulz over 4 years ago

I think I know a solution to make it work in all cases. But I probably won't be able to implement it until next week (just in case someone else is looking at this).

#29 Updated by Erik Lindahl over 4 years ago

Not working on it, but remember that the inner kernels is likely the most sensitive code in the sense that we need to have it working for more than a dozen platforms and possibly proprietary compilers.

FMA is likely the least of our problems; we could certainly use custom FMA operations (and FMS, FNMA, FNMS) in the kernels - the question is if all compilers/platforms support basic math operations on SIMD types?

#30 Updated by Roland Schulz over 4 years ago

I think we want a solution which we can use for all SIMD code. This includes the kernel but also other (listed, PME, ..).
This is only about having nicer syntax. Maybe not the most important issue but I think it does help with readability and thus would be nice to have. Of course I agree it has to work with all compilers and SIMD types and be as optimized. What I propose is the following:
- use +, -, * operators with SIMD types for nicer syntax
- If the compiler supports these operators on SIMD-types then just use the compiler support (at least GCC, Clang, ICC). In that case gmx_simd_float_t would stay as is. And the compiler has the freedom to optimize it as it sees fits (GCC does more optimization with it than with intrinisics which it treats the same as inline asm and doesn't change). If the compiler does a bad job at optimizing (e.g. not using FMA) one can just treat it as if the compiler doesn't support it.
- If the compiler doesn't support these operators (at least MSVC) then use templates instead. These templates then produce the same intrinsic code as now. But we can write the nicer operator syntax (and the same code works whether it is the compiler operator support or the template). For that it is important that the templates correctly convert a*b+c into FMA intrinsic. For such a compiler gmx_simd_float_t would be the template (as a wrapper around the real SIMD type) which otherwise works exactly like now but has support for these operators. The FMA support is possible with expression templates. The attached code already solves it partly and I think I have a solution to solve it for any expression.

#31 Updated by Roland Schulz over 4 years ago

#32 Updated by Erik Lindahl about 4 years ago

I'll add a couple of things based on recent discussion both in gerrit and offline:

1) Obviously, it would be nice if we could have one set of kernels that work for both SIMD and vanilla C++ code. However, looking at the code things like operator overloading and FMA is just the start. We would also need to handle conversions between types (including SIMD types where e.g. float and double might be different width), and in particular conditional statements that are represented with masks and selections in SIMD. A simple table lookup that works fine for a scalar should usually be analytical for 8-way SIMD (but the analytical form would be much more expensive for a scalar). Basically: There's a LOT of remaining complications to get to a state where we could use the same code, and that would likely require is to rewrite the scalar code in a way that makes it LESS readable. Is that really something we want?

2) If we can't use the same code, this mostly ends up being a cosmetic change. Cosmetics can be nice (in particular if it improves readability of complex code), but it would likely require us to wrap the SIMD types inside a simple struct or class to make them unique. That is not hard in terms of C++, but it would lead to a bit of extra code with ~300 small functions where arguments might need to be wrapped for each architecture. That might be possible to outsource among developers if there's a lot of interest in helping.

3) One potential problem with using wrappers is that many functions that need to return a number might have to create a temporary object, assign to the SIMD variable inside it, and then return it. Again, not complicated C++, but given that many of our SIMD architectures have been quite performance-sensitive even to changes such as using intermediate variables vs. putting all statements on a single line, it is quite possible this would hurt the performance optimization, at least for some compiler/hardware combinations. This will likely require a LOT of testing due to all the compiler, SIMD, and hardware combinations we target.

Sorry if that sounds negative, but my concern is largely that at the end of the day, if everything works, right now it seems like this will get us .... 0% better performance. At the same time, it's still not clear to me if we'll actually be able to combine any SIMD & scalar routines. I simply currently don't see that typing "simdAdd(a,b)" instead of "a*b" accounts for a such a huge fraction of our development time compared to the efforts this will take to test and realize?

That doesn't mean that I'm against it, but for the time being I think my efforts are better spend on implementing the actual extended SIMD implementations so we get to the general nbnxn kernels by christmas or so (which we need to get them in the next release).

If other people are still interested, and they believe it is possible to get to common template:d routines, I think the best option is to start and test things (both samples and actual modified Gromacs kernels) on a wider range of hardware and compilers. Presently I think we can arrange access to these:

1) All the x86 flavors. gcc-4.6 through gcc-5.2 is installed in most places, clang-3.5 through 3.7 and icc in a few places. tcbs14 has both pathscale and PGI compilers installed in /opt.

2) We can arrange accounts on two Power8 machine at PDC. gcc-4.9 and 5.1. Presently the IBM xlC compilers appear to have expired, but they are working on fixing that.

3) We have access to an Arm32+CUDA Jetson devkit as jetson.theophys.kth.se. Localadmin has an account, for other users we have to create it.

4) We (but not guests) can access Arm64 at Nvidia. However, overall I expect the behavior to be similar to Arm32.

5) There's a Xeon Phi card in mic1.theophys.kth.se, and icc is installed.

6) The Juelich account we have gives us access to a BlueGene/Q with gcc and IBM compilers, and there's also a Power7 frontend where we can test VMX/VSX with both gcc and xlc.

7) K computer cannot compile all of Gromacs right now (c++11 support scheduled for spring), but I'm working on getting access to a newer FX100 system in Japan that might have Fujitsu compilers that work with c++11.

#33 Updated by Erik Lindahl about 4 years ago

This might be a viable plan:

I'm reworking the SIMDv2 implementations currently in Gerrit so they are based on the C++ change. While doing so, I'm not fully separating SIMD4 so all SIMD types are unique. Having full type safety is is good in any case; I found two places in the kernels where we were using things in strange ways that are not bugs yet, but they could become.

Once that is done, we can experiment with and/or change one occurrence of overloading at a time, for all architectures, and simultaneously change they places where it is used outside the SIMD module. This will also make the benchmarking and portability testing trivial since we can check it for our actual code.

The alternative of doing it one architecture at a time would likely be more work, because all current routines would need to be kept in the mean time (since other architectures only provide those), and then we'd effectively be doubling large parts of the implementation temporarily.

#34 Updated by Roland Schulz about 4 years ago

How many types do we have which have the same SIMD type for at least one architecture but need different implementations?

#35 Updated by Erik Lindahl about 4 years ago

All of them. There are architectures (IBM QPX) where all SIMD types are implemented in a single type of double-precision register.

#36 Updated by Erik Lindahl about 4 years ago

I've basically finished moving the extended SIMD implementation for all the 128/256-bit SSE and AVX implementations to be based on the C++ SIMD patch. The good news is that at least with Clang and GCC (back to 4.6) the performance seems to be fine even when I use wrapper structures. I've also examined the assembly output, and the compiler optimizes it away just fine. Hopefully I can push them in later tonight or tomorrow.

In principle we would first need to implement all other architectures before we can introduce operators and overloading in all archs at the same time, but another option could be that we temporarily disable SIMD support for all other architectures, add that patch, and then we can add back the other SIMD architectures (Phi, AVX512, Armx2, IBMx3) if we're willing to prioritize the review of those patches? (Short term we'll ruin performance on those platforms, so I suspect we don't want to be in that state for months).

#37 Updated by Roland Schulz almost 4 years ago

Anyone any opinion of whether we should use

SimdFloat a = load(m);
store(m, a);

or
SimdFloat a = deref(m);
deref(m) = a;

?

And whether we should use

d=fma(a,b,c);

or
d=a*b+c;

?

The pros and cons are discussed at https://gerrit.gromacs.org/#/c/4524. Let me know if someone would like a summary of the pros/cons.

#38 Updated by Mark Abraham almost 4 years ago

I think that in the short term we should accept things that are straightforward to implement well, so prefer store-load pair and fma call. It would indeed be nice to be able to write SIMD code using operators, but it is a higher priority to get SIMD module in shape for new Verlet kernels (potentially allowing new performance optimizations) and killing the group scheme (removing lots of short-ranged code, and special-purpose code in DD, PME, setup, tuning...).

AFAICS, a future implemention of unary operator* to do dereferencing of memory usable both as l-value and r-value shouldn't be any harder depending on whether the routines which those two implementations call are the two operations load() and store(), or overloaded deref().

#39 Updated by Roland Schulz almost 4 years ago

"deref()" is as easy to implement as "load()/store()" , and I have done most work for the fma already. I don't think it is possible to do unary operator*. The opperant is a float*. A derefence for a simd isn't distinquishable from a dereference for a std float. And unless we don't mind, that any expression involving *f (where f is a float*) gets first converted into a proxy, we can't do that.

#40 Updated by Erik Lindahl almost 4 years ago

I'm with Mark. We need to stop the feature creep.

The "trivial" decision that we wanted to be able to overload things with classes has just cost me two full days of debugging and now I need to redesign a bunch of code because clang-3.7 will not maintain alignment correctly of SIMD types within classes when the class object in turn is put into a structure.

#41 Updated by Mark Abraham almost 4 years ago

Roland Schulz wrote:

"deref()" is as easy to implement as "load()/store()" , and I have done most work for the fma already.

OK, if the choice is merely about what's more usable in the long term, I think it is quite similar for load/store() vs deref(), so we can use whatever's ready sooner. Mild preference for load()/store(). fma() will be easier to implement and test than all the attendant operator overloading and proxy stuff for d=a*b+c, so that can be something we do later.

I don't think it is possible to do unary operator*. The opperant is a float*. A derefence for a simd isn't distinquishable from a dereference for a std float. And unless we don't mind, that any expression involving *f (where f is a float*) gets first converted into a proxy, we can't do that.

OK, scratch that idea, then. I was speculating that you were also anticipating that deref() could be sugared just as fma() can be sugared.

#42 Updated by Roland Schulz almost 4 years ago

load and deref are as easy to implement (store by itself is easier but deref can reuse the same proxy object for load/store). So mostly different names. I mainly suggested deref to avoid the store(m, a) syntax where one has to remember the argument order.

#43 Updated by Mark Abraham almost 4 years ago

Roland Schulz wrote:

load and deref are as easy to implement (store by itself is easier but deref can reuse the same proxy object for load/store). So mostly different names. I mainly suggested deref to avoid the store(m, a) syntax where one has to remember the argument order.

OK, that's a good reason for deref, go with that.

But I assume that m and a have types that can't interconvert, so the compiler would catch the misuse? The horrible use cases are things like double pow(double, double)... :-)

#44 Updated by Mark Abraham over 3 years ago

  • Status changed from New to Closed
  • Target version changed from 5.x to 2016

Generating SIMD code is no longer a priority, because the SIMD intrinsics library addresses all our known interests.

Also available in: Atom PDF