-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Remove popcnt16 array and use standard C++ bitset instead #2476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remove the 65KB "popcnt16" array.
Replace popcnt16 with bitset<64>().count()
|
at this point, we could just remove the popcount intrinsics and use std::bitset<64>(bar).count() everywhere, and cleanup the Makefile in this respect |
I think bitset<64> is slower than intrinsics if they're available(overhead dependent on C++ header version). This just removes the array. I've tried it before with |
|
As soon as the right architecture is specified (e.g. -msse4 or -mpopcnt) the instruction is used. |
|
Could makefile then include -march=native -mtune=native instead of -msse3 ? (I don't want to break makefile or automated/Travis instrumentation tests, so i won't change it within this PR) |
|
I guess the problem with -march=native is that the binaries can't that easily be distributed. It makes sense for those who compile from sources on the machine they want to use the binary, but otherwise having minimum requirement explicit is probably better. |
|
A bit of background: |
|
we might just run on 32-bit machines... why would we try to be optimal on such hardware? |
It removes 65KB array and its init in bitboard.cpp,(which can't be isolated due |
|
@mstembera how does bitset<64>().count() compares in benchmarks on your hardware vs the popcnt16 array? |
|
Added a test to check bitset<64> again, here: |
|
@vondele I agree but have never been successful at convincing Marco of that. |
|
Literally the only performance hit because of popcnt16 is in bitboard init and it can be easily fixed with this(not allowed due "conditional compilation") - the array itself is relatively harmless. |
|
Hmm, i think this is equivalent and technically not "conditional compilation" https://github.com/Chess13234/Stockfish/tree/Isolate_popcnt16_init |
|
I already wrote something like this on fishcooking; in C++ you can largely avoid conditional compilation by specialization of templates, constructs like these (unchecked code): template <bool> struct PopCnt {};
template <> struct PopCnt<true> {
int operator () (Bitboard b) const { return __builtin_popcountll(b); }
};
template <> struct PopCnt<false> {
PopCnt() {
for (unsigned i = 0; i < (1 << 16); ++i)
PopCnt16[i] = std::bitset<16>(i).count();
}
int operator () (Bitboard b) const {
union { Bitboard bb; uint16_t u[4]; } v = { b };
return PopCnt16[v.u[0]] + PopCnt16[v.u[1]] + PopCnt16[v.u[2]] + PopCnt16[v.u[3]];
}
uint8_t PopCnt16[1 << 16];
};
// Only creates the array if there is no hardware popcount support:
PopCnt<HasPopCnt> popcount; // HasPopCnt is defined in types.h
// Usage:
auto v = popcount(bb); |
|
@gvreuls Was it rejected? |
|
@Chess13234 I never tried it in a PR, feel free to do so. |
added a test here for non-regression(fixed): |
|
Perf test(of http://tests.stockfishchess.org/tests/view/5e14e1dd61fe5f83a67dd8c3 ) indicates it uses less instructions in bench, but time use is higher? sudo perf stat -r 5 -a -B -e cycles:u,instructions:u /tmp/test/Tmaster bench > /dev/null sudo perf stat -r 5 -a -B -e cycles:u,instructions:u /tmp/test/Tpopcount bench > /dev/null Performance counter stats for 'system wide' (5 runs): |
|
@Chess13234 This should produce the same machine instructions as the old version, the real test would be to compile it without POPCNT support and check if it produces the same bench. This could fall victim to the static initialization order fiasco (so the array wouldn't get initialized before main() starts and the bench would change or the binary would crash), in that case it's better to remove the constructor and do the initialization in Bitboards:init() like before, guarded with an if(HasPopCnt). |
|
@gvreuls yep, there is error with tbprobe, trying to fix it. |
|
@Chess13234 just give them both an init() method and cal that in Bitboards::init() template <bool> struct PopCnt {};
template <> struct PopCnt<true> {
void init() {}
int operator () (Bitboard b) const { return __builtin_popcountll(b); }
};
template <> struct PopCnt<false> {
void init() {
for (unsigned i = 0; i < (1 << 16); ++i)
PopCnt16[i] = std::bitset<16>(i).count();
}
int operator () (Bitboard b) const {
union { Bitboard bb; uint16_t u[4]; } v = { b };
return PopCnt16[v.u[0]] + PopCnt16[v.u[1]] + PopCnt16[v.u[2]] + PopCnt16[v.u[3]];
}
uint8_t PopCnt16[1 << 16];
};In Bitboards::init() you just call popcount.init(); |
|
I've fixed by placing it in types.h(using extern), it compiles and "stockfish eval" run ok. |
|
Closed in favor of much cleaner idea: #2481 |
This patch removes the ancient array for speed counting popcount for some compilers without popcount intrinsincs: such compilers are not supported anymore due bitboard.h dependence on lsb/msb intrinsics from MSVC/GCC/Intel C+++ compilers:
Line 369 of bitboard.h
#else // Compiler is neither GCC nor MSVC compatible
#error "Compiler not supported."
#endif
#620 (Original change was to support 32-bit machines)
No functional change.