-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Optimization of the affine transformations. #3203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
On avx2, I measure about 1% speedup. |
|
as a minor nit, on avx2 some functions are unused: |
|
I profiled the vnni512 version with the Intel Software Developement Emulator. If the results are to be believed they show that the most costly affine transform is almost twice as fast (or at least instruction-wise) in the PR version. |
5919967 to
2563d30
Compare
|
test here https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3 but considering that majority of the workers are AVX2 I would still appreciate testing on vnni256, avx512, vnni512 |
|
The test passed. Is LTC required for speedup patches? If not I'll just fix the unused variables warnings and prepare a PR. |
|
As far as I understand, LTC is not required for pure speed ups. |
|
I think computing 8 inner-products in parallel should be possible on AVX2 without register spilling, but I haven't tried it yet. How to do the horizontal adding most efficiently is an interesting question, especially on AVX512. It may now also make sense to clip the 32-bit sums immediately to 8-bit ints. This may be difficult to do in SF, though. |
|
LTC is not required for a speedup patch. This is a pretty big patch, mostly because it covers several architectures and cases. I would appreciate if @Sopel97 spents some time/effort to make sure it is written optimally from a maintenance/generality point of view. My current thinking is that for the NNUE core machinery it makes sense to optimize, in a machine specific way if needed, the code. Yet, we should keep in mind that the code should remain modifiable as new network architectures or hardware architectures appear. I would appreciate if somebody with intrinsics programming experience can review the patch (@syzygy1 or @mstembera ?) |
|
Sure, I'll document it and cover the cases that are currently |
src/nnue/layers/affine_transform.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave out the explicit load instruction (so just row0[j]), and same for aligned stores. It looks cleaner, and it might give the compiler more opportunity for optimisation.
|
I tried 8x unroll at godbolt. With clang it looks very clean, with gcc not so much though and spills a few registers. Just throwing this here for the future. I won't be changing this now. |
2563d30 to
ff9cfb2
Compare
|
I've made some changes to make the code cleaner, esured to cover all possible cases, and addressed the stuff syzygy pointed out. When it passes CI and gets @vondele 's approval I will put the final version on fishtest and then make a new PR. |
ad9a086 to
0c03958
Compare
|
I gave avx512 compiles (g++ 10.2, without PGO) for old and new versions to ipmanchess and he kindly ran both. |
|
With Cfish, I found that the compiled code looks better if you replace the row0/1/2/3 pointers with a single pointer (looks like the difference between your clang and gcc output). (I guess it's not so easy to look at the SF assembly code because SF relies heavily on LTO.) |
|
Nice AVX512 speedup :) |
|
I'm tinkering on godbolt. GCC seems to have |
|
At least in Cfish the unrolling happens automatically after profiling. It would be nice if one could make gcc produce assembly output even with LTO. I don't think it is possible, but I could easily be wrong. But you can still execute the g++ evaluate_nnue.cpp -S (with all the right flags) yourself and leave out -flto. You may have to prevent the Makefile from removing the .gcda files. |
|
I think I'll look into using XED next time for such stuff (it comes with SDE). It seems to be east to find the assembly listings for the affine transforms. In the real stockfish compilation it's unrolled better than at godbolt, 2x and uses rax for indexing with constant offsets. |
|
on linux you could do I've done a few tests, seemingly a little faster now on avx2 as well. Compiled a couple of crosses etc. So looks good for a final test if you feel it is final. You can keep this PR, just force push a version with a full commit message including the testing results. BTW, to test vnni etc, I have used cloud resources in the past, that's quite doable, but obviously no requirement. It is nice you've been able to check correctness on the emulator (please check again for the final version). |
e19c50b to
0f4ba31
Compare
0f4ba31 to
fd55b76
Compare
A non-functional speedup. Unroll the loops going over the output dimensions in the affine transform layers by a factor of 4 and perform 4 horizontal additions at a time. Instead of doing naive horizontal additions on each vector separately use hadd and shuffling between vectors to reduce the number of instructions by using all lanes for all stages of the horizontal adds.
passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3
passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3
Bench: 3517795
fd55b76 to
8e0663b
Compare
|
So a new net was merged and appveyor failed because it compares the bench to master. I've rebased and changed the bench in the commit message to match new master. |
A non-functional speedup. Unroll the loops going over
the output dimensions in the affine transform layers by
a factor of 4 and perform 4 horizontal additions at a time.
Instead of doing naive horizontal additions on each vector
separately use hadd and shuffling between vectors to reduce
the number of instructions by using all lanes for all stages
of the horizontal adds.
passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3
passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3
closes official-stockfish/Stockfish#3203
No functional change
I pondered whether this should be a draft PR or an issue, but since I have code that can be compiled I decided a PR will be a better place for discussion. The code in this draft only supports target >= ssse3 so I removed others from the Makefile.
I've been tinkering lately with the manually vectorized affine transformations and I strongly believe that it can be easly improved.
My idea is to unroll (for the hidden layers) the outer loop in
Stockfish/src/nnue/layers/affine_transform.h
Line 112 in 0f6c08c
SSSE3: https://godbolt.org/z/Kz1h5v
AVX2: https://godbolt.org/z/nqaoYK
AVX-512: https://godbolt.org/z/xM4E4T
I have measured a 4% speed increase for ssse3 build on my i7-920. I have also made similar adjustments to avx2, vnni256, avx512, vnni512 but I can't test them for speed because my CPU doesn't support the instruction sets. I've verified the correctness for avx2, vnni256, avx512, vnni512 using Intel Software Emulator (thanks @syzygy1, btw, may want to look at this).
I expect some discussion to take place here before this is tested on fishtest. Maybe some further improvements could be found. It would also be nice if someone benchmarked the >ssse3 versions.
Some random thoughts:
passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3
passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3
Bench: 3870606