assess AVX2 128 vs 256 on AMD Zen and change defaults
The performance difference between 128 and 256-bit builds on AMD Zen needs to be re-assessed and defaults changed in the light of recent more thorough testing.
Initial assumption that 256-bit AVX2 will never be worth it turns out to not be correct because:
- Parts of the code (e.g. bondeds, search) run a lot faster with 256-bit AVX, and many other kernels run with about the same performance
- Nonbonded performance significantly improves with tabulated and/or 2xNN kernels
Given the former GPU runs will often benefit from 256-bit AVX, e.g. octanol 55K with 1950X + Quadro P6000:
1TPC 2TPC AVX2_128 81.69 84.859 AVX2_256 83.995 86.083
Regarding CPU-only runs, some initial benchmarks indicate that tabulated Ewald kernels seem to be faster and possibly we might want to switch to 2xNN too with 256-bit SIMD:
water 48k octanol 55k 1TPC 2TPC 1TPC 2TPC AVX2_128 4xN ewTab 44.223 49.841 34.797 38.45 AVX2_128 4xN 44.165 47.124 34.246 36.141 AVX2_256 2xNN 43.45 46.926 34.963 36.762 AVX2_256 2xNN ewTab 42.702 48.085 34.031 37.71 AVX2_256 4XN ewTab 42.049 43.318 33.345 34.14 AVX2_256 4XN 42.961 43.148 33.641 33.811
Choose faster nbnxn SIMD kernels on AMD Zen
On AMD Zen tabulated Ewald kernels are always faster than analytical.
And with AVX2_256 2xNN kernels are faster than 4xN.
These faster choices are now made based on CpuInfo at run time.
Remove SIMD warning for AMD Zen
After choosing nbnxn 2xNN kernels and changing the to tabulated Ewald
nonbonded kernels, AVX2_256 is only a few percent slower than AVX2_128
on AMD Zen and is faster with nonbondeds and PME on a GPU. So we
should not warn the user when AVX2_256 is used.
#6 Updated by Szilárd Páll over 2 years ago
Erik Lindahl wrote:
I don't expect this to be too system-dependent, but when making performance decisions we should include benchmarks for slightly more realistic systems with proteins or membranes.
Sure, I can do a broader set of tests, but note that these were intentionally picked because the octanol system contains about the same bonded work as a similarly-sized protein system would, but eliminates the impact of load imbalance with DD (which I also ran just did not include here). The water box was used because that exercises the special case where the nonbonded kernels can skip the LJ compute.