Bug #923

Get rid of the raw SSE assembly kernels

Added by Rossen Apostolov about 7 years ago. Updated almost 7 years ago.

Target version:
Affected version - extra info:
Affected version:


The present raw assembly kernels cause problems in Google Native client (since we use every single register available), and possibly more important - they have very low performance on new AMD hardware (bulldozer) since they execute as SSE rather than AVX instructions, and incur expensive switches between instruction sets.

This is straightforward to fix with the new kernel generator Erik has, and we will just translate instructions 1-to-1 to intrinsics instead.

Associated revisions

Revision 5ba7125c (diff)
Added by Erik Lindahl almost 7 years ago

New CPU detection & AVX/SSE code, removed raw assembly files.

Removed all raw assembly files and deprecated altivec support.
Removed support for NASM and other assemblers, and replaced
previous SSE detection code with a new module using CPUID instead.
Added detection for SSE2, SSE4.1, AVX 128-bit with FMA, and AVX 256-bit.
Added Cmake detection of build platform based on CPUID, and output this
to the log file. The executables now compare the compile-time platform
and selected acceleration with the run-time platform and most suitable
acceleration and warns the user if they do not match. The compiler
detection code has also been reordered slightly to produce more readable
warnings when OpenMP is not available, and correctly disable pragma

Added intrinsics code and math functions for SSE2, SSE4.1, AVX128/256
both in single and double precision. All math functions and permutation
code have been tested & verified. Single precision math functions are
correct apart from the least significant bit, and double precision has
roughly twice the accuracy.

This has forced me to temporarily disable the SSE & Fortran acceleration.
SSE will be added back soon based on new intrinsics-only kernels currently
in testing, and we will test if Fortran still makes sense then.

Finally, the patch includes a modification to gmx_rmsdist where
a regression issue was introduced recently by using sqrtf() for
the norm function. This caused the intel compiler to produce slightly
different results at high optimization leves, which got evident here.

Closes #926 - Raw assembly code has been removed.
Refs #923 - Old kernels removed, new will be added shortly.
Fixes #914 - Cmake now does architecture-speficic optimization.
Fixes #912, #913
Fixes #857 - We detect rdtscp support with CPUID and use it if possible.
Fixes #750
Closes #537, #574 - Altivec is now deprecated.

Change-Id: Icfca5a940762f8d82ae67b59c65b2d2ac683256d


#1 Updated by Szilárd Páll about 7 years ago

Just want to note that from what I've seen intrinsics tend to blow the fuse at the more exotic compilers like Cray, Pathscale, PGI, Open64. These four are known to generate incorrect code for the verlet PME kernels (some for both).

This is not a major issue and it obviously does not block this task, but it is a concern we need to keep in mind.

#2 Updated by Szilárd Páll about 7 years ago

PS: this is not a bug, it's rather a task, isn't it?

#3 Updated by Erik Lindahl about 7 years ago

Definitions-schmefinitions. Previously it was no big deal, but with Bulldozer it is suddenly a very severe performance regression, and we've had it on the 4.6 feature list for quite a while. Even on SSE platforms, icc 12.x is often able to do 10% better than my hand tuned kernels, which gcc has never achieved. Hopefully it will stay compatible with more compilers, but in worst case those compilers will still work well with the C kernels.

I just finished the utility routines and should be able to commit kernels over the weekend.

Also available in: Atom PDF