Project

General

Profile

Task #3061

support for Zen2

Added by Szilárd Páll 3 months ago. Updated 28 days ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Difficulty:
uncategorized
Close

Description

TODO:
  • cmake detection needs tweaks to default to (likely ideal) AVX2_256 on Zen2 (SIMD flag + Vendor alone won't be enough, use CPU Family?)
  • tweak runtime detection to default to AVX2_256 as the ideal setting on Zen2
  • perf benchmark
  • backport to 2019.4
dhfr_2019.3_1thread.log (22.3 KB) dhfr_2019.3_1thread.log Kevin Boyd, 10/19/2019 09:40 PM
dhfr_2019.3_12thread.log (25.6 KB) dhfr_2019.3_12thread.log Kevin Boyd, 10/19/2019 09:40 PM
dhfr_2019.4_1thread.log (22.3 KB) dhfr_2019.4_1thread.log Kevin Boyd, 10/19/2019 09:40 PM
dhfr_2019.4_12thread.log (25.6 KB) dhfr_2019.4_12thread.log Kevin Boyd, 10/19/2019 09:40 PM
dhfr_master_1thread.log (22.6 KB) dhfr_master_1thread.log Kevin Boyd, 10/19/2019 09:40 PM
dhfr_master_12thread.log (25.9 KB) dhfr_master_12thread.log Kevin Boyd, 10/19/2019 09:40 PM

Associated revisions

Revision 03c59bcb (diff)
Added by Erik Lindahl 3 months ago

Default to AVX2_256 SIMD for Zen2

From Zen2, we should no longer use the previous
hack with 128-bit AVX2 since the microarchitecture
can now execute two full-width AVX2 instructions
per cycle. Rather than specializing for Zen2, the
logic has been changed so we only apply the 128-bit
optimization for the chips where we know it helps
(Zen and Zen+, based on the model numbers), while
we default to full-width AVX2 for all other AMD
CPUs - which for now is only Zen2.

Fixes #3061.

Change-Id: I66017b200cd627bb9792f53ee39dd80d8e05965a

History

#1 Updated by Szilárd Páll 3 months ago

Szilárd Páll wrote:

  • cmake detection needs tweaks to default to (likely ideal) AVX2_256 on Zen2 (SIMD flag + Vendor alone won't be enough, use CPU Family?)

There are three new instructions that should be able to distinguish between Zen/Zen+ and Zen2: CLWB, WBNOINVD, and RDPID. However if running virtualized, e.g. in the cloud some flags will not be passed through, so not sure if this is the most robust solution (would be better to rely on CPU model/stepping, but I guess doing so in cmake is pain).

#2 Updated by Szilárd Páll 3 months ago

Here's the /proc/cpuinfo of the new Zen2:
https://www.spec.org/cpu2017/results/res2019q3/cpu2017-20190723-16385.html
While the Zen1 looks like this:
https://www.spec.org/cpu2017/results/res2017q4/cpu2017-20171211-01539.html

Seems like model >=49 may be the right check?

#3 Updated by Anonymous 3 months ago

  • Status changed from New to Resolved

#4 Updated by Szilárd Páll 3 months ago

  • Status changed from Resolved to Blocked, need info

We still have the kernel layout (4xn vs 2xnn) and the Ewald correction treatment re-assessed. The latter certainly needs to be revised, the former will likely require more thought (preliminary data shows that with 1 thread/core 2xnn is faster in most cases, while 4xn is always faster with 2 threads/core).

#5 Updated by Szilárd Páll 3 months ago

Szilárd Páll wrote:

We still have the kernel layout (4xn vs 2xnn) and the Ewald correction treatment re-assessed. The latter certainly needs to be revised, the former will likely require more thought (preliminary data shows that with 1 thread/core 2xnn is faster in most cases, while 4xn is always faster with 2 threads/core).

Correction: looks like 4xn should work across the board on Zen2.

#6 Updated by Berk Hess about 2 months ago

We should switch Zen2 (non-Zen1) to 4xN kernels, right?

#7 Updated by Mark Abraham about 2 months ago

  • Target version changed from 2020-beta1 to 2020-beta2

#8 Updated by Szilárd Páll about 2 months ago

Berk Hess wrote:

We should switch Zen2 (non-Zen1) to 4xN kernels, right?

Yes (it is 10-23% faster), but your last change has already done that:
https://redmine.gromacs.org/projects/gromacs/repository/revisions/d242f4b487ac3ce5e325af8f6ec2477bdd7d8da1/diff/src/gromacs/mdlib/forcerec.cpp

#9 Updated by Paul Bauer about 1 month ago

  • Status changed from Blocked, need info to Resolved

has this not been merged already?

#10 Updated by Szilárd Páll about 1 month ago

Paul Bauer wrote:

has this not been merged already?

Awaiting feedback from testing on actual hardware.

Kevin, did you have a chance to check some runs on your machine?

#11 Updated by Kevin Boyd about 1 month ago

Kevin, did you have a chance to check some runs on your machine?

Sorry, slipped my mind. You want me to compare 2019.3 with 2019.4, right?

#12 Updated by Szilárd Páll about 1 month ago

Kevin Boyd wrote:

Kevin, did you have a chance to check some runs on your machine?

Sorry, slipped my mind. You want me to compare 2019.3 with 2019.4, right?

Yes, please. Short CPU-only run from both with log output. If you have time and you can thrown in master that could be useful -- just to confirm that things work as expected. Thanks!

#13 Updated by Kevin Boyd about 1 month ago

  • Status changed from Resolved to Closed

#14 Updated by Kevin Boyd about 1 month ago

  • Status changed from Closed to In Progress

#15 Updated by Kevin Boyd about 1 month ago

Szilárd Páll wrote:

Kevin Boyd wrote:

Kevin, did you have a chance to check some runs on your machine?

Sorry, slipped my mind. You want me to compare 2019.3 with 2019.4, right?

Yes, please. Short CPU-only run from both with log output. If you have time and you can thrown in master that could be useful -- just to confirm that things work as expected. Thanks!

2019.4 correctly defaulted to avx2_256.

The 2019.3 version chose the 4x4 kernel, while 2019.4 and master chose 4x8.

Let me know if you want me to run some longer simulations to actually check the performance.

#16 Updated by Szilárd Páll 29 days ago

  • Status changed from In Progress to Resolved

Kevin Boyd wrote:

Szilárd Páll wrote:

Kevin Boyd wrote:

Kevin, did you have a chance to check some runs on your machine?

Sorry, slipped my mind. You want me to compare 2019.3 with 2019.4, right?

Yes, please. Short CPU-only run from both with log output. If you have time and you can thrown in master that could be useful -- just to confirm that things work as expected. Thanks!

2019.4 correctly defaulted to avx2_256.

The 2019.3 version chose the 4x4 kernel, while 2019.4 and master chose 4x8.

Looks good, that's exactly as expected. Thanks!

Let me know if you want me to run some longer simulations to actually check the performance.

That's OK, we should have some Zen2 CPUs around soon.

#17 Updated by Paul Bauer 28 days ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF