-
Notifications
You must be signed in to change notification settings - Fork 2.7k
AVX-512 optimizations. #3218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX-512 optimizations. #3218
Conversation
1e60aa0 to
43ed5d1
Compare
|
on aws t3.xlarge Platinum 8259CL CPU @ 2.50GHz |
|
if I remove the '#if 0' (and fix a small compilation issue), the benefit is clearly not as big (0.3% instead 1.8%). |
|
That aligns with ipmanchess' results. Thanks |
883bc6e to
6df9641
Compare
|
I've rebased on the better gcc bug workaround, removed the 8x unrolled avx512 implementation for the large layer. So this now should be the best I can come up in this area. Syzygy has sent me something better https://github.com/syzygy1/Cfish/blob/e051767f292d1c65b78a3a7b8255372694519518/src/nnue-regular.c#L90-L151 but it requires permuting weights so I don't think I'll be porting that. |
d434695 to
4424338
Compare
|
For clarity it would be much nicer to commit the stack array align stuff separately from this. |
2c7a481 to
e353c99
Compare
|
I am wondering why in master m512_haddx4 is implemented using the unpacklo unpackhi intrinsics but m256_haddx4 is implemented using the hadd intrinsics? It seems we should pick the same scheme for both as a base comparison. |
|
That's just because AVX512 doesn't have hadd. And hadd in AVX2 is basically unpackhi/lo + add in 3uops |
e353c99 to
66c398f
Compare
|
@Sopel97 once you think this is ready, please submit a non-regression test on fishtest, to exclude this would be slower on avx2. I seem to locally observe a small ( ~0.5%) slowdown, but it could be in the noise. |
9e78c19 to
78410a6
Compare
|
After some tinkering I've reverted the last change, it seems that gcc doesn't quite like it and clang generates almost identical code for both (https://godbolt.org/z/sE6xxq). Now the AVX2 code is not affected, if you think I should still run non-regression I can do that later. |
|
since it is now only affecting avx512 there is need to run a non-regression test. I'll repeat the speed measurement before merging. |
|
can you squash in a single commit with an appropriate commit message? |
78410a6 to
ae32712
Compare
src/nnue/nnue_feature_transformer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentations here are inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the indentation of like 171, don't really know whether it should be 2 or 4 spaces here because it differs from line to line in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 167 and 169 should be the same but aren't. I'm not sure about 2 or 4 spaces for the same reason as you.
ae32712 to
4c582af
Compare
…mer. For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes official-stockfish#3218 No functional change.
4c582af to
fc79013
Compare
There are AVX-512 capable workers on fishtest but the AVX-512 build on fishtest is disabled because it's "too slow" (which isn't true for my AVX-512 system). Will this influence the decision not to use AVX-512 builds on fishtest? |
|
@gvreuls would be interesting if you could benchmark (single-threaded and multithreaded) this patch on your system relative to avx2. I'm not opposed to enabling avx512 on fishtest if we figure out in which case it is a win relative to avx2. |
|
@vondele from ipmanchess I got the following for the previous PR |
|
@vondele I'm not at home right now, it will take a few hours until I can run the tests. |
|
Here are my test results (the average of 5 bench runs). Please note that om my system AVX512 runs at the same clock speed as AVX2 (and I strongly suspect this is the case on ipmanchess' system too), I can rerun the tests with reduced AVX512 clock speed if that's deemed necessary. |
|
I also have rerun the benchmark and see 1.8% speedup for the patch. avx512 vs avx2 is 9.5% faster. I will merge this patch in the next round, and we should consider using avx512 on fishtest, but this can be dealt with separately. |
|
@gvreuls fishtest now using avx512 : gvreuls-16cores-5 | Linux 5.8.17-200.fc32.x86_64 ARCH=64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT |
|
@vondele Yes I noticed from official-stockfish/fishtest#782 (comment) thanks! From what I've gathered so far pemo and Spptr also have a AVX512 systems. |
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes official-stockfish#3218 No functional change.
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes official-stockfish#3218 No functional change.
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes official-stockfish/Stockfish#3218 No functional change.
This branch has some things to try and benchmark. Since there are no AVX512 workers on fishtest this has to proceed somewhat differently but maybe we'll reach consensus.
Changes:
AVX512 code for smaller affine transforms that works over 2 columns at a time.
The above unrolled by 8 times instead of 4 times, with more parallel horizontal adds (
m512_hadd256x16).(behind #if 0, because slightly worse in limited tests. Requires further investigation) Unrolled AVX512 affine transform loop over output dimensions from 4 to 8 while doing more parallel horizontal adds (
m512_haddx8).AVX512 code for feature transformer
TransformI'm fairly certain that this code in this PR is better than master as tests by ipmanchess indicate about 1.5% speed improvement. The thing I'm not sure about is the last bullet point - the unrolling of the normal AVX512 loop - which turned out inconclusive. Now this was tested without a profile build as I can't make one. The results could be different with a profile build, and possibly differ by hardware slightly.