Optimization of the affine transformations. #3203

Sopel97 · 2020-10-29T11:25:59Z

I pondered whether this should be a draft PR or an issue, but since I have code that can be compiled I decided a PR will be a better place for discussion. The code in this draft only supports target >= ssse3 so I removed others from the Makefile.

I've been tinkering lately with the manually vectorized affine transformations and I strongly believe that it can be easly improved.
My idea is to unroll (for the hidden layers) the outer loop in

Stockfish/src/nnue/layers/affine_transform.h

Line 112 in 0f6c08c

for (IndexType i = 0; i < kOutputDimensions; ++i) {

by a factor of 4, accumulate to 4 separate sums, and then perform horizontal addition on each. It's faster than 4 separate horizontal additions (a lot of results are unused when doing hadd on one register at a time) and also allows to replace 4 32bit loads and and stores with one 128bit load and store.
SSSE3: https://godbolt.org/z/Kz1h5v
AVX2: https://godbolt.org/z/nqaoYK
AVX-512: https://godbolt.org/z/xM4E4T

I have measured a 4% speed increase for ssse3 build on my i7-920. I have also made similar adjustments to avx2, vnni256, avx512, vnni512 but I can't test them for speed because my CPU doesn't support the instruction sets. I've verified the correctness for avx2, vnni256, avx512, vnni512 using Intel Software Emulator (thanks @syzygy1, btw, may want to look at this).

I expect some discussion to take place here before this is tested on fishtest. Maybe some further improvements could be found. It would also be nice if someone benchmarked the >ssse3 versions.

Some random thoughts:

Maybe it would be better to do 8 parallel accumulators and haddx8 in AVX2 code, not sure about register pressure though
I tried to do 8 parallel accumulators for AVX512 but it lacks hadds and I couldn't figure how to do m512_haddx8 in a good way with shuffles + adds. AVX512 is fairly extensive so I expect some improvements and discoveries here.
For AVX512 it may worth to try larger sizes than 32 for the hidden layers.

passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3

Bench: 3870606

src/Makefile

vondele · 2020-10-29T11:53:56Z

On avx2, I measure about 1% speedup.

$ ~/chess/pyshbench/pyshbench.py ./stockfish.master ./stockfish.patch 10
run       base       test     diff
  1    2105092    2126002   +20910
  2    2110419    2144096   +33677
  3    2118367    2114663    -3704
  4    2092706    2118924   +26218
  5    2121526    2142196   +20670
  6    2093430    2134822   +41392
  7    2108027    2116699    +8672
  8    2209317    2228042   +18725
  9    2120596    2147904   +27308
 10    2197075    2223127   +26052

Result of  10 runs
==================
base (...kfish.master) =    2127656  +/- 25473
test (...ckfish.patch) =    2149648  +/- 25822
diff                   =     +21992  +/- 7793

speedup        = +0.0103
P(speedup > 0) =  1.0000

vondele · 2020-10-29T11:55:44Z

as a minor nit, on avx2 some functions are unused:

nnue/../nnue/architectures/../layers/affine_transform.h:110:12: warning: variable ‘m256_hadd’ set but not used [-Wunused-but-set-variable]
  110 |       auto m256_hadd = [](__m256i sum, int bias) -> int {
      |            ^~~~~~~~~
nnue/../nnue/architectures/../layers/affine_transform.h:130:12: warning: variable ‘m128_hadd’ set but not used [-Wunused-but-set-variable]
  130 |       auto m128_hadd = [](__m128i sum, int bias) -> int {
      |            ^~~~~~~~~
nnue/../nnue/architectures/../layers/affine_transform.h:136:12: warning: variable ‘m128_haddx4’ set but not used [-Wunused-but-set-variable]
  136 |       auto m128_haddx4 = [](__m128i sum0, __m128i sum1, __m128i sum2, __m128i sum3, __m128i bias) -> __m128i {

Sopel97 · 2020-10-29T16:33:12Z

I profiled the vnni512 version with the Intel Software Developement Emulator. If the results are to be believed they show that the most costly affine transform is almost twice as fast (or at least instruction-wise) in the PR version.
Master: https://pastebin.com/UKLWjVMG in particular https://pastebin.com/UdyXvcaJ
PR: https://pastebin.com/adCYuC4X in particular https://pastebin.com/RgvsvtGV

src/Makefile

Sopel97 · 2020-10-31T13:35:50Z

test here https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

but considering that majority of the workers are AVX2 I would still appreciate testing on vnni256, avx512, vnni512

Sopel97 · 2020-10-31T15:15:47Z

The test passed. Is LTC required for speedup patches? If not I'll just fix the unused variables warnings and prepare a PR.

syzygy1 · 2020-10-31T15:16:32Z

As far as I understand, LTC is not required for pure speed ups.

syzygy1 · 2020-10-31T15:49:44Z

I think computing 8 inner-products in parallel should be possible on AVX2 without register spilling, but I haven't tried it yet.

How to do the horizontal adding most efficiently is an interesting question, especially on AVX512.
One possible approach is to first fold each 512-bit vector of partial sums into a 256-bit vector of partial sums by adding the upper part to the lower part. I don't know if that is faster than @Sopel97 's implementation.

It may now also make sense to clip the 32-bit sums immediately to 8-bit ints. This may be difficult to do in SF, though.

vondele · 2020-10-31T15:53:08Z

LTC is not required for a speedup patch.

This is a pretty big patch, mostly because it covers several architectures and cases. I would appreciate if @Sopel97 spents some time/effort to make sure it is written optimally from a maintenance/generality point of view. My current thinking is that for the NNUE core machinery it makes sense to optimize, in a machine specific way if needed, the code. Yet, we should keep in mind that the code should remain modifiable as new network architectures or hardware architectures appear.

I would appreciate if somebody with intrinsics programming experience can review the patch (@syzygy1 or @mstembera ?)

Sopel97 · 2020-10-31T16:16:47Z

Sure, I'll document it and cover the cases that are currently exit(1)

src/nnue/layers/affine_transform.h

syzygy1 · 2020-10-31T16:31:04Z

src/nnue/layers/affine_transform.h

I would leave out the explicit load instruction (so just row0[j]), and same for aligned stores. It looks cleaner, and it might give the compiler more opportunity for optimisation.

src/nnue/layers/affine_transform.h

Sopel97 · 2020-10-31T18:32:28Z

I tried 8x unroll at godbolt. With clang it looks very clean, with gcc not so much though and spills a few registers.
https://godbolt.org/z/hfdT9G
Might be worth for AVX512 if we find a nice way to do 8x horizontal additions.

Just throwing this here for the future. I won't be changing this now.

Sopel97 · 2020-10-31T18:50:42Z

I've made some changes to make the code cleaner, esured to cover all possible cases, and addressed the stuff syzygy pointed out. When it passes CI and gets @vondele 's approval I will put the final version on fishtest and then make a new PR.

Sopel97 · 2020-10-31T20:19:02Z

I gave avx512 compiles (g++ 10.2, without PGO) for old and new versions to ipmanchess and he kindly ran both.
https://cdn.discordapp.com/attachments/771338720028393472/772192457290285086/Stockfish_avx512_no-pgo_old-new_benches.PNG

syzygy1 · 2020-10-31T20:28:42Z

With Cfish, I found that the compiled code looks better if you replace the row0/1/2/3 pointers with a single pointer (looks like the difference between your clang and gcc output).

(I guess it's not so easy to look at the SF assembly code because SF relies heavily on LTO.)

syzygy1 · 2020-10-31T20:29:20Z

Nice AVX512 speedup :)

Sopel97 · 2020-10-31T20:44:42Z

I'm tinkering on godbolt. GCC seems to have #pragma GCC unroll 4 (#pragma unroll 4 for clang) which helps https://godbolt.org/z/41PsGa. Maybe such enforced unrolling could be forced further (you can even force it to unroll the whole thing, though that would likely trash I$), though this is for another PR.

syzygy1 · 2020-10-31T20:49:18Z

At least in Cfish the unrolling happens automatically after profiling.

It would be nice if one could make gcc produce assembly output even with LTO. I don't think it is possible, but I could easily be wrong. But you can still execute the g++ evaluate_nnue.cpp -S (with all the right flags) yourself and leave out -flto. You may have to prevent the Makefile from removing the .gcda files.

Sopel97 · 2020-10-31T21:13:20Z

I think I'll look into using XED next time for such stuff (it comes with SDE). It seems to be east to find the assembly listings for the affine transforms. In the real stockfish compilation it's unrolled better than at godbolt, 2x and uses rax for indexing with constant offsets.

vondele · 2020-10-31T21:49:57Z

on linux you could do CXXFLAGS="-g" make -j ARCH=x86-64-avx2 profile-build && objdump -dlS ./stockfish to get the assembly, with source and line numbers intermixed.

I've done a few tests, seemingly a little faster now on avx2 as well. Compiled a couple of crosses etc. So looks good for a final test if you feel it is final.

You can keep this PR, just force push a version with a full commit message including the testing results.

BTW, to test vnni etc, I have used cloud resources in the past, that's quite doable, but obviously no requirement. It is nice you've been able to check correctness on the emulator (please check again for the final version).

A non-functional speedup. Unroll the loops going over the output dimensions in the affine transform layers by a factor of 4 and perform 4 horizontal additions at a time. Instead of doing naive horizontal additions on each vector separately use hadd and shuffling between vectors to reduce the number of instructions by using all lanes for all stages of the horizontal adds. passed STC of the initial version: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 17808 W: 1914 L: 1756 D: 14138 Ptnml(0-2): 76, 1330, 5948, 1460, 90 https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3 passed STC of the final version after cleanup: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 16296 W: 1750 L: 1595 D: 12951 Ptnml(0-2): 72, 1192, 5479, 1319, 86 https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3 Bench: 3517795

Sopel97 · 2020-11-01T09:44:17Z

So a new net was merged and appveyor failed because it compares the bench to master. I've rebased and changed the bench in the commit message to match new master.

A non-functional speedup. Unroll the loops going over the output dimensions in the affine transform layers by a factor of 4 and perform 4 horizontal additions at a time. Instead of doing naive horizontal additions on each vector separately use hadd and shuffling between vectors to reduce the number of instructions by using all lanes for all stages of the horizontal adds. passed STC of the initial version: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 17808 W: 1914 L: 1756 D: 14138 Ptnml(0-2): 76, 1330, 5948, 1460, 90 https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3 passed STC of the final version after cleanup: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 16296 W: 1750 L: 1595 D: 12951 Ptnml(0-2): 72, 1192, 5479, 1319, 86 https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3 closes official-stockfish/Stockfish#3203 No functional change

vondele reviewed Oct 29, 2020

View reviewed changes

src/Makefile Outdated Show resolved Hide resolved

ghost approved these changes Oct 30, 2020

View reviewed changes

src/Makefile Outdated Show resolved Hide resolved

Sopel97 force-pushed the optimize_collected branch 4 times, most recently from 5919967 to 2563d30 Compare October 31, 2020 11:58

syzygy1 reviewed Oct 31, 2020

View reviewed changes

src/nnue/layers/affine_transform.h Outdated Show resolved Hide resolved

syzygy1 reviewed Oct 31, 2020

View reviewed changes

src/nnue/layers/affine_transform.h Outdated Show resolved Hide resolved

Sopel97 force-pushed the optimize_collected branch from 2563d30 to ff9cfb2 Compare October 31, 2020 18:49

Sopel97 force-pushed the optimize_collected branch 2 times, most recently from ad9a086 to 0c03958 Compare October 31, 2020 20:03

Sopel97 force-pushed the optimize_collected branch 2 times, most recently from e19c50b to 0f4ba31 Compare October 31, 2020 23:22

Sopel97 marked this pull request as ready for review November 1, 2020 09:28

Sopel97 force-pushed the optimize_collected branch from 0f4ba31 to fd55b76 Compare November 1, 2020 09:28

Sopel97 force-pushed the optimize_collected branch from fd55b76 to 8e0663b Compare November 1, 2020 09:42

Sopel97 changed the title ~~[WIP] Optimization of the affine transformations.~~ Optimization of the affine transformations. Nov 1, 2020

vondele added the to be merged Will be merged shortly label Nov 2, 2020

vondele closed this in 75e06a1 Nov 2, 2020

Sopel97 deleted the optimize_collected branch November 14, 2020 15:12

Optimization of the affine transformations. #3203

Optimization of the affine transformations. #3203

Uh oh!

Conversation

Sopel97 commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vondele commented Oct 29, 2020

Uh oh!

vondele commented Oct 29, 2020

Uh oh!

Sopel97 commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Sopel97 commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sopel97 commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syzygy1 commented Oct 31, 2020

Uh oh!

syzygy1 commented Oct 31, 2020

Uh oh!

vondele commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sopel97 commented Oct 31, 2020

Uh oh!

Uh oh!

syzygy1 Oct 31, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sopel97 commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sopel97 commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sopel97 commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syzygy1 commented Oct 31, 2020

Uh oh!

syzygy1 commented Oct 31, 2020

Uh oh!

Sopel97 commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syzygy1 commented Oct 31, 2020

Uh oh!

Sopel97 commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vondele commented Oct 31, 2020

Uh oh!

Sopel97 commented Nov 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sopel97 commented Oct 29, 2020 •

edited

Loading

Sopel97 commented Oct 29, 2020 •

edited

Loading

Sopel97 commented Oct 31, 2020 •

edited

Loading

Sopel97 commented Oct 31, 2020 •

edited

Loading

vondele commented Oct 31, 2020 •

edited

Loading

Sopel97 commented Oct 31, 2020 •

edited

Loading

Sopel97 commented Oct 31, 2020 •

edited

Loading

Sopel97 commented Oct 31, 2020 •

edited

Loading

Sopel97 commented Oct 31, 2020 •

edited

Loading

Sopel97 commented Oct 31, 2020 •

edited

Loading