Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Jan 4, 2020

This patch removes the ancient array for speed counting popcount for some compilers without popcount intrinsincs: such compilers are not supported anymore due bitboard.h dependence on lsb/msb intrinsics from MSVC/GCC/Intel C+++ compilers:
Line 369 of bitboard.h
#else // Compiler is neither GCC nor MSVC compatible

#error "Compiler not supported."

#endif
#620 (Original change was to support 32-bit machines)
No functional change.

Chess13234 added 2 commits January 4, 2020 22:13
Remove the 65KB "popcnt16" array.
Replace popcnt16 with bitset<64>().count()
@vondele
Copy link
Member

vondele commented Jan 4, 2020

at this point, we could just remove the popcount intrinsics and use std::bitset<64>(bar).count() everywhere, and cleanup the Makefile in this respect

@ghost
Copy link
Author

ghost commented Jan 4, 2020

at this point, we could just remove the popcount intrinsics and use std::bitset<64>(bar).count() everywhere, and cleanup the Makefile in this respect

I think bitset<64> is slower than intrinsics if they're available(overhead dependent on C++ header version). This just removes the array.

I've tried it before with
http://tests.stockfishchess.org/tests/view/5de0e75d0294ec4750cba965 (-0.94ELO after 75K games)

@vondele
Copy link
Member

vondele commented Jan 4, 2020

https://godbolt.org/z/Tin5wz

As soon as the right architecture is specified (e.g. -msse4 or -mpopcnt) the instruction is used.

@ghost
Copy link
Author

ghost commented Jan 4, 2020

Could makefile then include -march=native -mtune=native instead of -msse3 ? (I don't want to break makefile or automated/Travis instrumentation tests, so i won't change it within this PR)

@vondele
Copy link
Member

vondele commented Jan 4, 2020

I guess the problem with -march=native is that the binaries can't that easily be distributed. It makes sense for those who compile from sources on the machine they want to use the binary, but otherwise having minimum requirement explicit is probably better.

@mstembera
Copy link
Contributor

A bit of background:
#2006 (comment)
IMO we can remove PopCnt16 the day we remove 32-bit support.

@vondele
Copy link
Member

vondele commented Jan 4, 2020

we might just run on 32-bit machines... why would we try to be optimal on such hardware?

@ghost
Copy link
Author

ghost commented Jan 4, 2020

A bit of background:
#2006 (comment)
IMO we can remove PopCnt16 the day we remove 32-bit support.

It removes 65KB array and its init in bitboard.cpp,(which can't be isolated due
maintainers refusing #ifdefs around it(its considered "conditional compilation" and therefore very bad )
#2385

@ghost
Copy link
Author

ghost commented Jan 4, 2020

@mstembera how does bitset<64>().count() compares in benchmarks on your hardware vs the popcnt16 array?

@ghost
Copy link
Author

ghost commented Jan 4, 2020

Added a test to check bitset<64> again, here:
http://tests.stockfishchess.org/tests/view/5e111870fc94a2bc2383c1f1
-0.59ELO vs intrinsic.People have old compilers i think.
(on my machine GCC 8 bitset header is of equal speed to intrinsic).

@mstembera
Copy link
Contributor

mstembera commented Jan 5, 2020

@vondele I agree but have never been successful at convincing Marco of that.
SquareBB[] didn't get removed either only because of 32-bit.
#1960 (comment)

@ghost
Copy link
Author

ghost commented Jan 5, 2020

Literally the only performance hit because of popcnt16 is in bitboard init and it can be easily fixed with this(not allowed due "conditional compilation") - the array itself is relatively harmless.
For even the newest CPUs with BMI2,POPCNT and even AVX2 Stockfish still computes 65536 popcounts at each load, assuming it might be useful for something.

bitboard.cpp 
#ifndef USE_POPCNT //line 72
  for (unsigned i = 0; i < (1 << 16); ++i)
      PopCnt16[i] = std::bitset<16>(i).count();
#endif

@ghost
Copy link
Author

ghost commented Jan 5, 2020

Hmm, i think this is equivalent and technically not "conditional compilation" https://github.com/Chess13234/Stockfish/tree/Isolate_popcnt16_init

if(!HasPopCnt){
  for (unsigned i = 0; i < (1 << 16); ++i)
      PopCnt16[i] = std::bitset<16>(i).count();
}

@ghost ghost mentioned this pull request Jan 5, 2020
@gvreuls
Copy link
Contributor

gvreuls commented Jan 7, 2020

I already wrote something like this on fishcooking; in C++ you can largely avoid conditional compilation by specialization of templates, constructs like these (unchecked code):

template <bool> struct PopCnt {};

template <> struct PopCnt<true> {
    int operator () (Bitboard b) const { return __builtin_popcountll(b); }
};

template <> struct PopCnt<false> {
    PopCnt() {
        for (unsigned i = 0; i < (1 << 16); ++i)
            PopCnt16[i] = std::bitset<16>(i).count();
    }

    int operator () (Bitboard b) const {
        union { Bitboard bb; uint16_t u[4]; } v = { b };
        return PopCnt16[v.u[0]] + PopCnt16[v.u[1]] + PopCnt16[v.u[2]] + PopCnt16[v.u[3]];
    }

    uint8_t PopCnt16[1 << 16];
};

// Only creates the array if there is no hardware popcount support:
PopCnt<HasPopCnt> popcount;    // HasPopCnt is defined in types.h

// Usage:
auto v = popcount(bb);

@ghost
Copy link
Author

ghost commented Jan 7, 2020

@gvreuls Was it rejected?

@gvreuls
Copy link
Contributor

gvreuls commented Jan 7, 2020

@Chess13234 I never tried it in a PR, feel free to do so.

@ghost
Copy link
Author

ghost commented Jan 7, 2020

@Chess13234 I never tried it in a PR, feel free to do so.

added a test here for non-regression(fixed):
http://tests.stockfishchess.org/tests/view/5e14eb1d61fe5f83a67dd8c8

@ghost
Copy link
Author

ghost commented Jan 7, 2020

Perf test(of http://tests.stockfishchess.org/tests/view/5e14e1dd61fe5f83a67dd8c3 ) indicates it uses less instructions in bench, but time use is higher?

sudo perf stat -r 5 -a -B -e cycles:u,instructions:u /tmp/test/Tmaster bench > /dev/null
Performance counter stats for 'system wide' (5 runs):

 9,150,617,884      cycles:u                                                      ( +-  1.25% )
15,022,118,108      instructions:u            #    1.64  insn per cycle           ( +-  0.66% )

       2.92557 +- 0.00530 seconds time elapsed  ( +-  0.18% )

sudo perf stat -r 5 -a -B -e cycles:u,instructions:u /tmp/test/Tpopcount bench > /dev/null

Performance counter stats for 'system wide' (5 runs):
8,996,061,352 cycles:u ( +- 0.32% )
14,867,271,530 instructions:u # 1.65 insn per cycle ( +- 0.18% )

       2.92741 +- 0.00257 seconds time elapsed  ( +-  0.09% )

@gvreuls
Copy link
Contributor

gvreuls commented Jan 7, 2020

@Chess13234 This should produce the same machine instructions as the old version, the real test would be to compile it without POPCNT support and check if it produces the same bench. This could fall victim to the static initialization order fiasco (so the array wouldn't get initialized before main() starts and the bench would change or the binary would crash), in that case it's better to remove the constructor and do the initialization in Bitboards:init() like before, guarded with an if(HasPopCnt).

@ghost
Copy link
Author

ghost commented Jan 7, 2020

@gvreuls yep, there is error with tbprobe, trying to fix it.

@gvreuls
Copy link
Contributor

gvreuls commented Jan 7, 2020

@Chess13234 just give them both an init() method and cal that in Bitboards::init()

template <bool> struct PopCnt {};

template <> struct PopCnt<true> {
    void init() {}
    int operator () (Bitboard b) const { return __builtin_popcountll(b); }
};

template <> struct PopCnt<false> {
    void init() {
        for (unsigned i = 0; i < (1 << 16); ++i)
            PopCnt16[i] = std::bitset<16>(i).count();
    }

    int operator () (Bitboard b) const {
        union { Bitboard bb; uint16_t u[4]; } v = { b };
        return PopCnt16[v.u[0]] + PopCnt16[v.u[1]] + PopCnt16[v.u[2]] + PopCnt16[v.u[3]];
    }

    uint8_t PopCnt16[1 << 16];
};

In Bitboards::init() you just call popcount.init();

@ghost
Copy link
Author

ghost commented Jan 7, 2020

I've fixed by placing it in types.h(using extern), it compiles and "stockfish eval" run ok.

@ghost ghost mentioned this pull request Jan 7, 2020
@ghost
Copy link
Author

ghost commented Jan 7, 2020

@gvreuls it becomes too convoluted very fast to create this code if its has its own init called from other places. The simple version doesn't pass debug checks:
#2480

@ghost ghost closed this Jan 7, 2020
@ghost ghost reopened this Jan 7, 2020
@ghost
Copy link
Author

ghost commented Jan 7, 2020

Closed in favor of much cleaner idea: #2481

@ghost ghost closed this Jan 7, 2020
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants