Skip to content

Conversation

@Sopel97
Copy link
Collaborator

@Sopel97 Sopel97 commented Sep 9, 2020

This PR introduces a new string option for gensfen named sfen_format. Valid values are bin (default) and binpack. This value determines the format to use for serializing the generated sfens. Output file extension is now chosen and added based on the chosen format, output_file_name should contain just the sfen. @noobpwnftw asked about this feature, so now there's at least a branch that implements it.

For learn the reader type is chosen on single file granularity based on its extension. .bin files are read directly, .binpack files are read entry by entry and converted to bin format before further processing.

nnue_data_binpack_format.h is a header-only library that exposes some functions related to plain, bin, and binpack formats. In particular it allows creating readers and writers for the binpack format - this is utilized in gensfen and learn. It internally defines a type with the same layout as PackedSfenValue because it is meant to be used without dependencies. This can be unified in the future, along with simplification of other aspects in the library.
Everything in the library is behind binpack namespace. Most of the chess primitives used in it are behind binpack::chess namespace. I wouldn't dare trying to port it to stockfish's representation.

It also exposes currently unused conversion functions:
convertPlainToBin, convertBinToPlain, convertBinpackToBin, convertBinToBinpack, convertBinpackToPlain, convertPlainToBinpack that take paths to input and output files and perform the conversion. This covers all possible conversion between plain, bin, and binpack. In the future we can create an interface to allow users to access these through CLI.

…ack". It determines the output format of the sfens. Binpack is a highly compressed formats for consecutive sfens. Extension is now determined by the used format, output_file_name should contain just the stem.
…tect file extension and choose the correct reader (bin or binpack)
@nodchip
Copy link
Owner

nodchip commented Sep 9, 2020

Thank you for the pull request. I guess that one usage of this format is to distribute training data between net developers. In that case, we could need to support shuffle at some timing.

Could you please take a look at CI? https://travis-ci.org/github/nodchip/Stockfish/builds/725750282

@Sopel97
Copy link
Collaborator Author

Sopel97 commented Sep 9, 2020

Yes, I'm slowly trying to get CI working. May have to come back to it tomorrow as it's already late for me.

I was talking with noob on discord and he says that shuffling over large distances is not needed to ensure enough randomness (because two games differ on average the same regardless of how far they are). Only shuffling within a batch that's loaded in memory is enough. Persisting the shuffled output is also not needed. Even if some people would still want to shuffle I think .bin for that is the best we can do.

@nodchip
Copy link
Owner

nodchip commented Sep 9, 2020

If my memory is correct, someone in Discrod said that we can gain more slos with learn shuffle. Are there experiment results for it?

@Sopel97
Copy link
Collaborator Author

Sopel97 commented Sep 9, 2020

It was only a test between completely no shuffle and full shuffle. I don't think anyone did tests on partial shuffles over small blocks.

@vondele
Copy link
Collaborator

vondele commented Sep 10, 2020

The following compiles on linux, but I didn't check functionality:

diff --git a/src/extra/nnue_data_binpack_format.h b/src/extra/nnue_data_binpack_format.h
index c86a55c2a..e6cd7ad20 100644
--- a/src/extra/nnue_data_binpack_format.h
+++ b/src/extra/nnue_data_binpack_format.h
@@ -41,8 +41,9 @@ THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 #include <array>
 #include <immintrin.h>
 
-#ifdef linux
+#ifdef __linux__
 #include <x86intrin.h>
+#include <climits>
 #else
 #include <intrin.h>
 #endif
@@ -2700,7 +2701,7 @@ namespace chess
         return Bitboard::square(sq0) | sq1;
     }
 
-    [[nodiscard]] constexpr Bitboard operator""_bb(std::uint64_t bits)
+    [[nodiscard]] constexpr Bitboard operator""_bb(unsigned long long bits)
     {
         return Bitboard::fromBits(bits);
     }
@@ -7295,4 +7296,4 @@ namespace binpack
             std::cout << "Processed " << (cur - base) << " bytes and " << numProcessedPositions << " positions.\n";
         }
     }
-}
\ No newline at end of file
+}

@vondele
Copy link
Collaborator

vondele commented Sep 10, 2020

some comments:

  • I think __linux__ might be the wrong define for this (this might be specific to gcc). Can one, instead of #include <intrin.h> be using the same approach as nnue_common.h.
  • unsigned long long bits there is only a small list of parameters that are allowed in this case, std::uint64_tis not part of this. https://en.cppreference.com/w/cpp/language/user_literal

@Sopel97
Copy link
Collaborator Author

Sopel97 commented Sep 10, 2020

Okay, checks passed now

@vondele
Copy link
Collaborator

vondele commented Sep 10, 2020

nice... btw, I see you make this optional. If it works well, I would consider making this the default & only option. While there is already some data generated, there will be even more generated in the future, and the size of these files is an obstacle to sharing them and having fewer options might be a path to making progress quicker.

@nodchip nodchip merged commit 59402d4 into nodchip:master Sep 10, 2020
@nodchip
Copy link
Owner

nodchip commented Sep 10, 2020

Thank you for sending and fixing the pull request. I merged it.

@Sopel97 Sopel97 deleted the binpack branch November 14, 2020 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants