work around denorm handling performance penalty on AMD Vega
The concrete issue we ran into is that the ROCm stack's compiler generates code that handles denorms on AMD Vega and later as it's assumed to have low penalty. However, this turns out to not be the case for the nonbonded kernels and wall-time is reduced by ~25-30% by flushing denormals to zero -- as it's done by default on all other architectures, as well as in CUDA (when using the fast-math flag).
While avoiding this performance penalty we would also make the denorm handling more uniform across platforms by explicitly use the
-cl-denorms-are-zero flag as a hint to the compiler that we prefer denormalized numbers flushed to zero.
Request flushing denorms to zero in OpenCL
This change adds by default the
cl-denorms-are-zero to the flags used avoid a large performance penalty on AMD Vega with ROCm (which by
for kernel compilation. This is done to:
default handles denorms on GFX9 or later).
- make the defaults uniform across CUDA and OpenCL.