AVX-512 optimizations. #3218

Sopel97 · 2020-11-03T11:31:18Z

This branch has some things to try and benchmark. Since there are no AVX512 workers on fishtest this has to proceed somewhat differently but maybe we'll reach consensus.

Changes:

AVX512 code for smaller affine transforms that works over 2 columns at a time.
The above unrolled by 8 times instead of 4 times, with more parallel horizontal adds (m512_hadd256x16).
(behind #if 0, because slightly worse in limited tests. Requires further investigation) Unrolled AVX512 affine transform loop over output dimensions from 4 to 8 while doing more parallel horizontal adds (m512_haddx8).
AVX512 code for feature transformer Transform

I'm fairly certain that this code in this PR is better than master as tests by ipmanchess indicate about 1.5% speed improvement. The thing I'm not sure about is the last bullet point - the unrolling of the normal AVX512 loop - which turned out inconclusive. Now this was tested without a profile build as I can't make one. The results could be different with a profile build, and possibly differ by hardware slightly.

vondele · 2020-11-03T16:00:38Z

on aws t3.xlarge Platinum 8259CL CPU @ 2.50GHz

Result of  10 runs
==================
base (...kfish.master) =    1795390  +/- 3762
test (...ckfish.patch) =    1828041  +/- 2566
diff                   =     +32651  +/- 2849

speedup        = +0.0182
P(speedup > 0) =  1.0000

vondele · 2020-11-03T16:58:23Z

if I remove the '#if 0' (and fix a small compilation issue), the benefit is clearly not as big (0.3% instead 1.8%).

Sopel97 · 2020-11-03T18:20:27Z

That aligns with ipmanchess' results. Thanks

Sopel97 · 2020-11-03T21:50:35Z

I've rebased on the better gcc bug workaround, removed the 8x unrolled avx512 implementation for the large layer. So this now should be the best I can come up in this area. Syzygy has sent me something better https://github.com/syzygy1/Cfish/blob/e051767f292d1c65b78a3a7b8255372694519518/src/nnue-regular.c#L90-L151 but it requires permuting weights so I don't think I'll be porting that.

mstembera · 2020-11-04T11:32:16Z

For clarity it would be much nicer to commit the stack array align stuff separately from this.

mstembera · 2020-11-04T20:14:47Z

I am wondering why in master m512_haddx4 is implemented using the unpacklo unpackhi intrinsics but m256_haddx4 is implemented using the hadd intrinsics? It seems we should pick the same scheme for both as a base comparison.

Sopel97 · 2020-11-04T20:22:24Z

That's just because AVX512 doesn't have hadd. And hadd in AVX2 is basically unpackhi/lo + add in 3uops

vondele · 2020-11-04T22:42:08Z

@Sopel97 once you think this is ready, please submit a non-regression test on fishtest, to exclude this would be slower on avx2. I seem to locally observe a small ( ~0.5%) slowdown, but it could be in the noise.

Sopel97 · 2020-11-05T13:28:32Z

After some tinkering I've reverted the last change, it seems that gcc doesn't quite like it and clang generates almost identical code for both (https://godbolt.org/z/sE6xxq). Now the AVX2 code is not affected, if you think I should still run non-regression I can do that later.

vondele · 2020-11-05T18:41:54Z

since it is now only affecting avx512 there is need to run a non-regression test. I'll repeat the speed measurement before merging.

vondele · 2020-11-05T18:42:56Z

can you squash in a single commit with an appropriate commit message?

mstembera · 2020-11-05T21:10:21Z

src/nnue/nnue_feature_transformer.h

The indentations here are inconsistent.

I changed the indentation of like 171, don't really know whether it should be 2 or 4 spaces here because it differs from line to line in this file

Line 167 and 169 should be the same but aren't. I'm not sure about 2 or 4 spaces for the same reason as you.

…mer. For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes official-stockfish#3218 No functional change.

gvreuls · 2020-11-06T12:07:16Z

This branch has some things to try and benchmark. Since there are no AVX512 workers on fishtest this has to proceed somewhat differently but maybe we'll reach consensus.

There are AVX-512 capable workers on fishtest but the AVX-512 build on fishtest is disabled because it's "too slow" (which isn't true for my AVX-512 system). Will this influence the decision not to use AVX-512 builds on fishtest?

vondele · 2020-11-06T12:22:42Z

@gvreuls would be interesting if you could benchmark (single-threaded and multithreaded) this patch on your system relative to avx2. I'm not opposed to enabling avx512 on fishtest if we figure out in which case it is a win relative to avx2.

Sopel97 · 2020-11-06T12:51:13Z

@vondele from ipmanchess I got the following for the previous PR

and now it's further 2% on top.

gvreuls · 2020-11-06T12:58:05Z

@vondele I'm not at home right now, it will take a few hours until I can run the tests.

gvreuls · 2020-11-06T17:24:57Z

Here are my test results (the average of 5 bench runs). Please note that om my system AVX512 runs at the same clock speed as AVX2 (and I strongly suspect this is the case on ipmanchess' system too), I can rerun the tests with reduced AVX512 clock speed if that's deemed necessary.

Master AVX2 1 thread:
 Performance counter stats for 'system wide' (5 runs):

     7.414.924.848      cycles:u                                                      ( +-  0,83% )
    12.386.333.875      instructions:u            #    1,67  insn per cycle           ( +-  0,52% )

            1,7229 +- 0,0193 seconds time elapsed  ( +-  1,12% )

Sopel AVX512 1 thead:
 Performance counter stats for 'system wide' (5 runs):

     6.951.475.653      cycles:u                                                      ( +-  0,64% )
    11.342.953.826      instructions:u            #    1,63  insn per cycle           ( +-  0,38% )

            1,6052 +- 0,0161 seconds time elapsed  ( +-  1,00% )

Master AVX2 16 threads:
 Performance counter stats for 'system wide' (5 runs):

    51.001.185.587      cycles:u                                                      ( +-  1,16% )
    51.672.129.967      instructions:u            #    1,01  insn per cycle           ( +-  1,11% )

            1,0090 +- 0,0123 seconds time elapsed  ( +-  1,22% )

Sopel AVX512 16 threads:
 Performance counter stats for 'system wide' (5 runs):

    47.428.267.772      cycles:u                                                      ( +-  2,39% )
    46.959.286.438      instructions:u            #    0,99  insn per cycle           ( +-  2,44% )

            0,9363 +- 0,0252 seconds time elapsed  ( +-  2,69% )

vondele · 2020-11-06T18:17:43Z

I also have rerun the benchmark and see 1.8% speedup for the patch. avx512 vs avx2 is 9.5% faster.

I will merge this patch in the next round, and we should consider using avx512 on fishtest, but this can be dealt with separately.

vondele · 2020-11-08T14:21:30Z

@gvreuls fishtest now using avx512 :

gvreuls-16cores-5 | Linux 5.8.17-200.fc32.x86_64 ARCH=64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT

gvreuls · 2020-11-08T15:54:26Z

@vondele Yes I noticed from official-stockfish/fishtest#782 (comment) thanks! From what I've gathered so far pemo and Spptr also have a AVX512 systems.

For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes official-stockfish#3218 No functional change.

For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes official-stockfish/Stockfish#3218 No functional change.

Sopel97 force-pushed the avx512_optimize_2 branch from 1e60aa0 to 43ed5d1 Compare November 3, 2020 12:48

Sopel97 force-pushed the avx512_optimize_2 branch from 883bc6e to 6df9641 Compare November 3, 2020 21:49

Sopel97 force-pushed the avx512_optimize_2 branch 5 times, most recently from d434695 to 4424338 Compare November 4, 2020 10:43

Sopel97 force-pushed the avx512_optimize_2 branch 2 times, most recently from 2c7a481 to e353c99 Compare November 4, 2020 19:26

Sopel97 force-pushed the avx512_optimize_2 branch from e353c99 to 66c398f Compare November 4, 2020 20:55

Sopel97 force-pushed the avx512_optimize_2 branch 2 times, most recently from 9e78c19 to 78410a6 Compare November 5, 2020 13:26

Sopel97 force-pushed the avx512_optimize_2 branch from 78410a6 to ae32712 Compare November 5, 2020 18:55

Sopel97 changed the title ~~[WIP] AVX-512 optimizations.~~ AVX-512 optimizations. Nov 5, 2020

Sopel97 marked this pull request as ready for review November 5, 2020 18:55

mstembera reviewed Nov 5, 2020

View reviewed changes

Sopel97 force-pushed the avx512_optimize_2 branch from ae32712 to 4c582af Compare November 5, 2020 21:19

Sopel97 force-pushed the avx512_optimize_2 branch from 4c582af to fc79013 Compare November 5, 2020 21:29

vondele added the to be merged Will be merged shortly label Nov 7, 2020

vondele closed this in ba35c88 Nov 7, 2020

Sopel97 deleted the avx512_optimize_2 branch November 14, 2020 15:12

AVX-512 optimizations. #3218

AVX-512 optimizations. #3218

Uh oh!

Conversation

Sopel97 commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vondele commented Nov 3, 2020

Uh oh!

vondele commented Nov 3, 2020

Uh oh!

Sopel97 commented Nov 3, 2020

Uh oh!

Sopel97 commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstembera commented Nov 4, 2020

Uh oh!

mstembera commented Nov 4, 2020

Uh oh!

Sopel97 commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vondele commented Nov 4, 2020

Uh oh!

Sopel97 commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vondele commented Nov 5, 2020

Uh oh!

vondele commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstembera Nov 5, 2020

Choose a reason for hiding this comment

Uh oh!

Sopel97 Nov 5, 2020

Choose a reason for hiding this comment

Uh oh!

mstembera Nov 5, 2020

Choose a reason for hiding this comment

Uh oh!

gvreuls commented Nov 6, 2020

Uh oh!

vondele commented Nov 6, 2020

Uh oh!

Sopel97 commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvreuls commented Nov 6, 2020

Uh oh!

gvreuls commented Nov 6, 2020

Uh oh!

vondele commented Nov 6, 2020

Uh oh!

vondele commented Nov 8, 2020

Uh oh!

gvreuls commented Nov 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sopel97 commented Nov 3, 2020 •

edited

Loading

Sopel97 commented Nov 3, 2020 •

edited

Loading

Sopel97 commented Nov 4, 2020 •

edited

Loading

Sopel97 commented Nov 5, 2020 •

edited

Loading

vondele commented Nov 5, 2020 •

edited

Loading

Sopel97 commented Nov 6, 2020 •

edited

Loading