Skip to content

Conversation

@Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Oct 29, 2020

I pondered whether this should be a draft PR or an issue, but since I have code that can be compiled I decided a PR will be a better place for discussion. The code in this draft only supports target >= ssse3 so I removed others from the Makefile.

I've been tinkering lately with the manually vectorized affine transformations and I strongly believe that it can be easly improved.
My idea is to unroll (for the hidden layers) the outer loop in

for (IndexType i = 0; i < kOutputDimensions; ++i) {
by a factor of 4, accumulate to 4 separate sums, and then perform horizontal addition on each. It's faster than 4 separate horizontal additions (a lot of results are unused when doing hadd on one register at a time) and also allows to replace 4 32bit loads and and stores with one 128bit load and store.
SSSE3: https://godbolt.org/z/Kz1h5v
AVX2: https://godbolt.org/z/nqaoYK
AVX-512: https://godbolt.org/z/xM4E4T

I have measured a 4% speed increase for ssse3 build on my i7-920. I have also made similar adjustments to avx2, vnni256, avx512, vnni512 but I can't test them for speed because my CPU doesn't support the instruction sets. I've verified the correctness for avx2, vnni256, avx512, vnni512 using Intel Software Emulator (thanks @syzygy1, btw, may want to look at this).

I expect some discussion to take place here before this is tested on fishtest. Maybe some further improvements could be found. It would also be nice if someone benchmarked the >ssse3 versions.

Some random thoughts:

  • Maybe it would be better to do 8 parallel accumulators and haddx8 in AVX2 code, not sure about register pressure though
  • I tried to do 8 parallel accumulators for AVX512 but it lacks hadds and I couldn't figure how to do m512_haddx8 in a good way with shuffles + adds. AVX512 is fairly extensive so I expect some improvements and discoveries here.
  • For AVX512 it may worth to try larger sizes than 32 for the hidden layers.

passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3

Bench: 3870606

@vondele
Copy link
Member

vondele commented Oct 29, 2020

On avx2, I measure about 1% speedup.

$ ~/chess/pyshbench/pyshbench.py ./stockfish.master ./stockfish.patch 10
run       base       test     diff
  1    2105092    2126002   +20910
  2    2110419    2144096   +33677
  3    2118367    2114663    -3704
  4    2092706    2118924   +26218
  5    2121526    2142196   +20670
  6    2093430    2134822   +41392
  7    2108027    2116699    +8672
  8    2209317    2228042   +18725
  9    2120596    2147904   +27308
 10    2197075    2223127   +26052

Result of  10 runs
==================
base (...kfish.master) =    2127656  +/- 25473
test (...ckfish.patch) =    2149648  +/- 25822
diff                   =     +21992  +/- 7793

speedup        = +0.0103
P(speedup > 0) =  1.0000

@vondele
Copy link
Member

vondele commented Oct 29, 2020

as a minor nit, on avx2 some functions are unused:

nnue/../nnue/architectures/../layers/affine_transform.h:110:12: warning: variable ‘m256_hadd’ set but not used [-Wunused-but-set-variable]
  110 |       auto m256_hadd = [](__m256i sum, int bias) -> int {
      |            ^~~~~~~~~
nnue/../nnue/architectures/../layers/affine_transform.h:130:12: warning: variable ‘m128_hadd’ set but not used [-Wunused-but-set-variable]
  130 |       auto m128_hadd = [](__m128i sum, int bias) -> int {
      |            ^~~~~~~~~
nnue/../nnue/architectures/../layers/affine_transform.h:136:12: warning: variable ‘m128_haddx4’ set but not used [-Wunused-but-set-variable]
  136 |       auto m128_haddx4 = [](__m128i sum0, __m128i sum1, __m128i sum2, __m128i sum3, __m128i bias) -> __m128i {

@Sopel97
Copy link
Member Author

Sopel97 commented Oct 29, 2020

I profiled the vnni512 version with the Intel Software Developement Emulator. If the results are to be believed they show that the most costly affine transform is almost twice as fast (or at least instruction-wise) in the PR version.
Master: https://pastebin.com/UKLWjVMG in particular https://pastebin.com/UdyXvcaJ
PR: https://pastebin.com/adCYuC4X in particular https://pastebin.com/RgvsvtGV

@Sopel97 Sopel97 force-pushed the optimize_collected branch 4 times, most recently from 5919967 to 2563d30 Compare October 31, 2020 11:58
@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

test here https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

but considering that majority of the workers are AVX2 I would still appreciate testing on vnni256, avx512, vnni512

@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

The test passed. Is LTC required for speedup patches? If not I'll just fix the unused variables warnings and prepare a PR.

@syzygy1
Copy link
Contributor

syzygy1 commented Oct 31, 2020

As far as I understand, LTC is not required for pure speed ups.

@syzygy1
Copy link
Contributor

syzygy1 commented Oct 31, 2020

I think computing 8 inner-products in parallel should be possible on AVX2 without register spilling, but I haven't tried it yet.

How to do the horizontal adding most efficiently is an interesting question, especially on AVX512.
One possible approach is to first fold each 512-bit vector of partial sums into a 256-bit vector of partial sums by adding the upper part to the lower part. I don't know if that is faster than @Sopel97 's implementation.

It may now also make sense to clip the 32-bit sums immediately to 8-bit ints. This may be difficult to do in SF, though.

@vondele
Copy link
Member

vondele commented Oct 31, 2020

LTC is not required for a speedup patch.

This is a pretty big patch, mostly because it covers several architectures and cases. I would appreciate if @Sopel97 spents some time/effort to make sure it is written optimally from a maintenance/generality point of view. My current thinking is that for the NNUE core machinery it makes sense to optimize, in a machine specific way if needed, the code. Yet, we should keep in mind that the code should remain modifiable as new network architectures or hardware architectures appear.

I would appreciate if somebody with intrinsics programming experience can review the patch (@syzygy1 or @mstembera ?)

@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

Sure, I'll document it and cover the cases that are currently exit(1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out the explicit load instruction (so just row0[j]), and same for aligned stores. It looks cleaner, and it might give the compiler more opportunity for optimisation.

@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

I tried 8x unroll at godbolt. With clang it looks very clean, with gcc not so much though and spills a few registers.
https://godbolt.org/z/hfdT9G
Might be worth for AVX512 if we find a nice way to do 8x horizontal additions.

Just throwing this here for the future. I won't be changing this now.

@Sopel97 Sopel97 force-pushed the optimize_collected branch from 2563d30 to ff9cfb2 Compare October 31, 2020 18:49
@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

I've made some changes to make the code cleaner, esured to cover all possible cases, and addressed the stuff syzygy pointed out. When it passes CI and gets @vondele 's approval I will put the final version on fishtest and then make a new PR.

@Sopel97 Sopel97 force-pushed the optimize_collected branch 2 times, most recently from ad9a086 to 0c03958 Compare October 31, 2020 20:03
@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

I gave avx512 compiles (g++ 10.2, without PGO) for old and new versions to ipmanchess and he kindly ran both.
https://cdn.discordapp.com/attachments/771338720028393472/772192457290285086/Stockfish_avx512_no-pgo_old-new_benches.PNG

@syzygy1
Copy link
Contributor

syzygy1 commented Oct 31, 2020

With Cfish, I found that the compiled code looks better if you replace the row0/1/2/3 pointers with a single pointer (looks like the difference between your clang and gcc output).

(I guess it's not so easy to look at the SF assembly code because SF relies heavily on LTO.)

@syzygy1
Copy link
Contributor

syzygy1 commented Oct 31, 2020

Nice AVX512 speedup :)

@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

I'm tinkering on godbolt. GCC seems to have #pragma GCC unroll 4 (#pragma unroll 4 for clang) which helps https://godbolt.org/z/41PsGa. Maybe such enforced unrolling could be forced further (you can even force it to unroll the whole thing, though that would likely trash I$), though this is for another PR.

@syzygy1
Copy link
Contributor

syzygy1 commented Oct 31, 2020

At least in Cfish the unrolling happens automatically after profiling.

It would be nice if one could make gcc produce assembly output even with LTO. I don't think it is possible, but I could easily be wrong. But you can still execute the g++ evaluate_nnue.cpp -S (with all the right flags) yourself and leave out -flto. You may have to prevent the Makefile from removing the .gcda files.

@Sopel97
Copy link
Member Author

Sopel97 commented Oct 31, 2020

I think I'll look into using XED next time for such stuff (it comes with SDE). It seems to be east to find the assembly listings for the affine transforms. In the real stockfish compilation it's unrolled better than at godbolt, 2x and uses rax for indexing with constant offsets.

@vondele
Copy link
Member

vondele commented Oct 31, 2020

on linux you could do CXXFLAGS="-g" make -j ARCH=x86-64-avx2 profile-build && objdump -dlS ./stockfish to get the assembly, with source and line numbers intermixed.

I've done a few tests, seemingly a little faster now on avx2 as well. Compiled a couple of crosses etc. So looks good for a final test if you feel it is final.

You can keep this PR, just force push a version with a full commit message including the testing results.

BTW, to test vnni etc, I have used cloud resources in the past, that's quite doable, but obviously no requirement. It is nice you've been able to check correctness on the emulator (please check again for the final version).

@Sopel97 Sopel97 force-pushed the optimize_collected branch 2 times, most recently from e19c50b to 0f4ba31 Compare October 31, 2020 23:22
@Sopel97 Sopel97 marked this pull request as ready for review November 1, 2020 09:28
@Sopel97 Sopel97 force-pushed the optimize_collected branch from 0f4ba31 to fd55b76 Compare November 1, 2020 09:28
A non-functional speedup. Unroll the loops going over the output dimensions in the affine transform layers by a factor of 4 and perform 4 horizontal additions at a time. Instead of doing naive horizontal additions on each vector separately use hadd and shuffling between vectors to reduce the number of instructions by using all lanes for all stages of the horizontal adds.

passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3

Bench: 3517795
@Sopel97 Sopel97 force-pushed the optimize_collected branch from fd55b76 to 8e0663b Compare November 1, 2020 09:42
@Sopel97
Copy link
Member Author

Sopel97 commented Nov 1, 2020

So a new net was merged and appveyor failed because it compares the bench to master. I've rebased and changed the bench in the commit message to match new master.

@Sopel97 Sopel97 changed the title [WIP] Optimization of the affine transformations. Optimization of the affine transformations. Nov 1, 2020
@vondele vondele added the to be merged Will be merged shortly label Nov 2, 2020
@vondele vondele closed this in 75e06a1 Nov 2, 2020
@Sopel97 Sopel97 deleted the optimize_collected branch November 14, 2020 15:12
QueensGambit pushed a commit to QueensGambit/Fairy-Stockfish that referenced this pull request Nov 20, 2020
A non-functional speedup. Unroll the loops going over
the output dimensions in the affine transform layers by
a factor of 4 and perform 4 horizontal additions at a time.
Instead of doing naive horizontal additions on each vector
separately use hadd and shuffling between vectors to reduce
the number of instructions by using all lanes for all stages
of the horizontal adds.

passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3

closes official-stockfish/Stockfish#3203

No functional change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

to be merged Will be merged shortly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants