Skip to content

Conversation

@Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Nov 3, 2020

This branch has some things to try and benchmark. Since there are no AVX512 workers on fishtest this has to proceed somewhat differently but maybe we'll reach consensus.

Changes:

  • AVX512 code for smaller affine transforms that works over 2 columns at a time.

  • The above unrolled by 8 times instead of 4 times, with more parallel horizontal adds (m512_hadd256x16).

  • (behind #if 0, because slightly worse in limited tests. Requires further investigation) Unrolled AVX512 affine transform loop over output dimensions from 4 to 8 while doing more parallel horizontal adds (m512_haddx8).

  • AVX512 code for feature transformer Transform

I'm fairly certain that this code in this PR is better than master as tests by ipmanchess indicate about 1.5% speed improvement. The thing I'm not sure about is the last bullet point - the unrolling of the normal AVX512 loop - which turned out inconclusive. Now this was tested without a profile build as I can't make one. The results could be different with a profile build, and possibly differ by hardware slightly.

@vondele
Copy link
Member

vondele commented Nov 3, 2020

on aws t3.xlarge Platinum 8259CL CPU @ 2.50GHz

Result of  10 runs
==================
base (...kfish.master) =    1795390  +/- 3762
test (...ckfish.patch) =    1828041  +/- 2566
diff                   =     +32651  +/- 2849

speedup        = +0.0182
P(speedup > 0) =  1.0000

@vondele
Copy link
Member

vondele commented Nov 3, 2020

if I remove the '#if 0' (and fix a small compilation issue), the benefit is clearly not as big (0.3% instead 1.8%).

@Sopel97
Copy link
Member Author

Sopel97 commented Nov 3, 2020

That aligns with ipmanchess' results. Thanks

@Sopel97
Copy link
Member Author

Sopel97 commented Nov 3, 2020

I've rebased on the better gcc bug workaround, removed the 8x unrolled avx512 implementation for the large layer. So this now should be the best I can come up in this area. Syzygy has sent me something better https://github.com/syzygy1/Cfish/blob/e051767f292d1c65b78a3a7b8255372694519518/src/nnue-regular.c#L90-L151 but it requires permuting weights so I don't think I'll be porting that.

@Sopel97 Sopel97 force-pushed the avx512_optimize_2 branch 5 times, most recently from d434695 to 4424338 Compare November 4, 2020 10:43
@mstembera
Copy link
Contributor

For clarity it would be much nicer to commit the stack array align stuff separately from this.

@Sopel97 Sopel97 force-pushed the avx512_optimize_2 branch 2 times, most recently from 2c7a481 to e353c99 Compare November 4, 2020 19:26
@mstembera
Copy link
Contributor

I am wondering why in master m512_haddx4 is implemented using the unpacklo unpackhi intrinsics but m256_haddx4 is implemented using the hadd intrinsics? It seems we should pick the same scheme for both as a base comparison.

@Sopel97
Copy link
Member Author

Sopel97 commented Nov 4, 2020

That's just because AVX512 doesn't have hadd. And hadd in AVX2 is basically unpackhi/lo + add in 3uops

@vondele
Copy link
Member

vondele commented Nov 4, 2020

@Sopel97 once you think this is ready, please submit a non-regression test on fishtest, to exclude this would be slower on avx2. I seem to locally observe a small ( ~0.5%) slowdown, but it could be in the noise.

@Sopel97 Sopel97 force-pushed the avx512_optimize_2 branch 2 times, most recently from 9e78c19 to 78410a6 Compare November 5, 2020 13:26
@Sopel97
Copy link
Member Author

Sopel97 commented Nov 5, 2020

After some tinkering I've reverted the last change, it seems that gcc doesn't quite like it and clang generates almost identical code for both (https://godbolt.org/z/sE6xxq). Now the AVX2 code is not affected, if you think I should still run non-regression I can do that later.

@vondele
Copy link
Member

vondele commented Nov 5, 2020

since it is now only affecting avx512 there is need to run a non-regression test. I'll repeat the speed measurement before merging.

@vondele
Copy link
Member

vondele commented Nov 5, 2020

can you squash in a single commit with an appropriate commit message?

@Sopel97 Sopel97 changed the title [WIP] AVX-512 optimizations. AVX-512 optimizations. Nov 5, 2020
@Sopel97 Sopel97 marked this pull request as ready for review November 5, 2020 18:55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentations here are inconsistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the indentation of like 171, don't really know whether it should be 2 or 4 spaces here because it differs from line to line in this file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 167 and 169 should be the same but aren't. I'm not sure about 2 or 4 spaces for the same reason as you.

…mer.

For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.

For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.

These changes provide about 1.5% speedup for AVX-512 builds.

Closes official-stockfish#3218

No functional change.
@gvreuls
Copy link
Contributor

gvreuls commented Nov 6, 2020

This branch has some things to try and benchmark. Since there are no AVX512 workers on fishtest this has to proceed somewhat differently but maybe we'll reach consensus.

There are AVX-512 capable workers on fishtest but the AVX-512 build on fishtest is disabled because it's "too slow" (which isn't true for my AVX-512 system). Will this influence the decision not to use AVX-512 builds on fishtest?

@vondele
Copy link
Member

vondele commented Nov 6, 2020

@gvreuls would be interesting if you could benchmark (single-threaded and multithreaded) this patch on your system relative to avx2. I'm not opposed to enabling avx512 on fishtest if we figure out in which case it is a win relative to avx2.

@Sopel97
Copy link
Member Author

Sopel97 commented Nov 6, 2020

@vondele from ipmanchess I got the following for the previous PR

and now it's further 2% on top.

@gvreuls
Copy link
Contributor

gvreuls commented Nov 6, 2020

@vondele I'm not at home right now, it will take a few hours until I can run the tests.

@gvreuls
Copy link
Contributor

gvreuls commented Nov 6, 2020

Here are my test results (the average of 5 bench runs). Please note that om my system AVX512 runs at the same clock speed as AVX2 (and I strongly suspect this is the case on ipmanchess' system too), I can rerun the tests with reduced AVX512 clock speed if that's deemed necessary.

Master AVX2 1 thread:
 Performance counter stats for 'system wide' (5 runs):

     7.414.924.848      cycles:u                                                      ( +-  0,83% )
    12.386.333.875      instructions:u            #    1,67  insn per cycle           ( +-  0,52% )

            1,7229 +- 0,0193 seconds time elapsed  ( +-  1,12% )
Sopel AVX512 1 thead:
 Performance counter stats for 'system wide' (5 runs):

     6.951.475.653      cycles:u                                                      ( +-  0,64% )
    11.342.953.826      instructions:u            #    1,63  insn per cycle           ( +-  0,38% )

            1,6052 +- 0,0161 seconds time elapsed  ( +-  1,00% )
Master AVX2 16 threads:
 Performance counter stats for 'system wide' (5 runs):

    51.001.185.587      cycles:u                                                      ( +-  1,16% )
    51.672.129.967      instructions:u            #    1,01  insn per cycle           ( +-  1,11% )

            1,0090 +- 0,0123 seconds time elapsed  ( +-  1,22% )
Sopel AVX512 16 threads:
 Performance counter stats for 'system wide' (5 runs):

    47.428.267.772      cycles:u                                                      ( +-  2,39% )
    46.959.286.438      instructions:u            #    0,99  insn per cycle           ( +-  2,44% )

            0,9363 +- 0,0252 seconds time elapsed  ( +-  2,69% )

@vondele
Copy link
Member

vondele commented Nov 6, 2020

I also have rerun the benchmark and see 1.8% speedup for the patch. avx512 vs avx2 is 9.5% faster.

I will merge this patch in the next round, and we should consider using avx512 on fishtest, but this can be dealt with separately.

@vondele vondele added the to be merged Will be merged shortly label Nov 7, 2020
@vondele vondele closed this in ba35c88 Nov 7, 2020
@vondele
Copy link
Member

vondele commented Nov 8, 2020

@gvreuls fishtest now using avx512 :

gvreuls-16cores-5 | Linux 5.8.17-200.fc32.x86_64 ARCH=64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT

@gvreuls
Copy link
Contributor

gvreuls commented Nov 8, 2020

@vondele Yes I noticed from official-stockfish/fishtest#782 (comment) thanks! From what I've gathered so far pemo and Spptr also have a AVX512 systems.

MichaelB7 pushed a commit to MichaelB7/Stockfish that referenced this pull request Nov 9, 2020
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.

For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.

These changes provide about 1.5% speedup for AVX-512 builds.

Closes official-stockfish#3218

No functional change.
MichaelB7 added a commit to MichaelB7/Stockfish that referenced this pull request Nov 9, 2020
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.

For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.

These changes provide about 1.5% speedup for AVX-512 builds.

Closes official-stockfish#3218

No functional change.
@Sopel97 Sopel97 deleted the avx512_optimize_2 branch November 14, 2020 15:12
QueensGambit pushed a commit to QueensGambit/Fairy-Stockfish that referenced this pull request Nov 20, 2020
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.

For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.

These changes provide about 1.5% speedup for AVX-512 builds.

Closes official-stockfish/Stockfish#3218

No functional change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

to be merged Will be merged shortly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants