-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Remove precomputed SquareBB #4343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove precomputed SquareBB #4343
Conversation
Bit-shifting is a single instruction, and should be faster than an array lookup on supported architectures. Besides (ever so slightly) speeding up the conversion of a square into a bitboard, we may see minor general performance improvements due to preserving more of the CPU's existing cache. Bench: 4106793
|
I tried this before, and for some reason it failed. I wonder if we changed something in the compiler, or we changed the machines doing tests or if it failed on LTC... 🤔 |
I seem to remember being surprised about this, too. |
|
yes, probably depends a bit on the mix of machines running on fishtest? Locally this is a gain for me: |
|
On what type of machine would we expect a 64-bit bitshift to be slower than a load? If this array were constant at compile-time I might be more inclined to agree that the effect might be variable, because the whole thing could be inlined to a literal value for some cases. But since that's not the case… |
|
as per https://uops.info/table.html, variable length shifts on intels are slower than fixed length (though it's a bit more complicated with the bmi shift variants). On AMDs they are the same. |
There is a case where the shift can be slower: If there are no free registers available. For the shift, you need two input registers, one for the left operand (i.e. the 1) and one for the shift amount. If there are no free registers available, you need to save a register to the stack ("spill") and later load it from the stack again ("reload"). So in worst case, you end up with an additional spill and reload compared to just a load (which only needs one register on x86-64). That said, I still expect the code to be faster with your changes. |
|
Slower than a fixed-length shift sure, but that’s not the alternative here? The original involves a load from a memory offset, which may or may not be in cache (and may or may not end up pushing something else useful out of cache). That seems like it would almost certainly be slower in virtually every case, I’d expect? Maybe on 32-bit platforms where a shift on a 64-bit operand requires multiple instructions? |
|
I'm just pointing out that the results will differ, depending on whether it's tested on an AMD or Intel CPU. Overall I also think it's better to remove this lookup table. |
|
Aha, a register spill is definitely a case where it could be slower. |
Bit-shifting is a single instruction, and should be faster than an array lookup on supported architectures. Besides (ever so slightly) speeding up the conversion of a square into a bitboard, we may see minor general performance improvements due to preserving more of the CPU's existing cache.
Bench: 4106793
https://tests.stockfishchess.org/tests/view/63c5cfe618c20f4929c5fe46