Project

General

Profile

Task #2328

assess AVX2 128 vs 256 on AMD Zen and change defaults

Added by Szilárd Páll about 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

The performance difference between 128 and 256-bit builds on AMD Zen needs to be re-assessed and defaults changed in the light of recent more thorough testing.

Initial assumption that 256-bit AVX2 will never be worth it turns out to not be correct because:
- Parts of the code (e.g. bondeds, search) run a lot faster with 256-bit AVX, and many other kernels run with about the same performance
- Nonbonded performance significantly improves with tabulated and/or 2xNN kernels

Given the former GPU runs will often benefit from 256-bit AVX, e.g. octanol 55K with 1950X + Quadro P6000:

            1TPC    2TPC
AVX2_128    81.69    84.859
AVX2_256    83.995    86.083

Regarding CPU-only runs, some initial benchmarks indicate that tabulated Ewald kernels seem to be faster and possibly we might want to switch to 2xNN too with 256-bit SIMD:

                        water 48k        octanol 55k    
                        1TPC    2TPC    1TPC     2TPC
AVX2_128 4xN ewTab        44.223    49.841    34.797    38.45
AVX2_128 4xN            44.165    47.124    34.246    36.141

AVX2_256 2xNN                   43.45    46.926    34.963    36.762
AVX2_256 2xNN ewTab    42.702    48.085    34.031    37.71
AVX2_256 4XN ewTab        42.049    43.318    33.345    34.14
AVX2_256 4XN            42.961    43.148    33.641    33.811


Related issues

Related to GROMACS - Bug #2327: AVX2_128 and AVX_128_FMA double precision PME gather regressionClosed

Associated revisions

Revision 1f860f69 (diff)
Added by Berk Hess about 2 years ago

Choose faster nbnxn SIMD kernels on AMD Zen

On AMD Zen tabulated Ewald kernels are always faster than analytical.
And with AVX2_256 2xNN kernels are faster than 4xN.
These faster choices are now made based on CpuInfo at run time.

Refs #2328

Change-Id: I146bc012910bc1f46ed14155651c3d2a7c1f91e5

Revision cacb3b41 (diff)
Added by Berk Hess almost 2 years ago

Remove SIMD warning for AMD Zen

After choosing nbnxn 2xNN kernels and changing the to tabulated Ewald
nonbonded kernels, AVX2_256 is only a few percent slower than AVX2_128
on AMD Zen and is faster with nonbondeds and PME on a GPU. So we
should not warn the user when AVX2_256 is used.

Refs #2328

Change-Id: I67b66b0025c7e3c31943f3f02b80e97fb9764066

History

#1 Updated by Szilárd Páll about 2 years ago

  • Related to Bug #2327: AVX2_128 and AVX_128_FMA double precision PME gather regression added

#2 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related DRAFT patchset '2' for Issue #2328.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I146bc012910bc1f46ed14155651c3d2a7c1f91e5
Gerrit URL: https://gerrit.gromacs.org/7283

#3 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2328.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I67b66b0025c7e3c31943f3f02b80e97fb9764066
Gerrit URL: https://gerrit.gromacs.org/7289

#4 Updated by Berk Hess about 2 years ago

  • Tracker changed from Bug to Task
  • Status changed from New to Fix uploaded
  • Affected version deleted (2018-beta1)

#5 Updated by Erik Lindahl about 2 years ago

I don't expect this to be too system-dependent, but when making performance decisions we should include benchmarks for slightly more realistic systems with proteins or membranes.

#6 Updated by Szilárd Páll about 2 years ago

Erik Lindahl wrote:

I don't expect this to be too system-dependent, but when making performance decisions we should include benchmarks for slightly more realistic systems with proteins or membranes.

Sure, I can do a broader set of tests, but note that these were intentionally picked because the octanol system contains about the same bonded work as a similarly-sized protein system would, but eliminates the impact of load imbalance with DD (which I also ran just did not include here). The water box was used because that exercises the special case where the nonbonded kernels can skip the LJ compute.

#7 Updated by Berk Hess almost 2 years ago

  • Status changed from Fix uploaded to Resolved

This seems to be resolved for the moment. But (unfortunately) we will need to keep checking this when code changes or processors are updated.

#8 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF