Skip to content

Conversation

@Sopel97
Copy link
Member

@Sopel97 Sopel97 commented May 18, 2021

https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d
TC: 10s+0.1s, 1 thread
ELO: 21.74 +-3.4 (95%) LOS: 100.0%
Total: 10000 W: 1559 L: 934 D: 7507
Ptnml(0-2): 38, 701, 2972, 1176, 113

https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b
TC: 60s+0.6s, 1 thread
ELO: 5.85 +-1.7 (95%) LOS: 100.0%
Total: 20000 W: 1381 L: 1044 D: 17575
Ptnml(0-2): 27, 885, 7864, 1172, 52 

https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806
TC: 20s+0.2s, 8 threads
LLR: 2.93 (-2.94,2.94) <0.50,3.50>
Total: 34272 W: 1610 L: 1452 D: 31210
Ptnml(0-2): 30, 1285, 14350, 1439, 32

https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72
TC: 60s+0.6s, 8 threads
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 45544 W: 1262 L: 1214 D: 43068
Ptnml(0-2): 12, 1129, 20442, 1177, 12

This network was trained by @vondele using this trainer (the trainer master branch will be updated in the near future), using a combination of data: d8, d9, d10, fishtest_d9.

This network also contains a few architectural changes with respect to the current master:

  • Size changed from 256x2-32-32-1 to 512x2-16-32-1
    • ~15-20% slower
    • ~2x larger
    • adds a special path for 16 valued ClippedReLU
    • fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions
      • this is safe now because the inputs are processed in groups of 4 in the current affine transform code
  • The feature set changed from HalfKP to HalfKAv2
    • Includes information about the kings like HalfKA
    • Packs king features better, resulting in 8% size reduction compared to HalfKA
  • The board is flipped for the black's perspective, instead of rotated like in the current master
  • PSQT values for each feature
    • the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero.
    • 8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4
    • initialized to classical material values on the start of the training
  • 8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4
    • only one subnetwork is evaluated for any position, no or marginal speed loss

Additionally, we experienced a lot high weights in most nets from the pytorch trainer, which causes the slow affine transform path to reduce the performance of the net significantly, therefore it was removed for the time of testing. A revert can be attempted for up to 3% speedup, depending on how many large weights there are.

Diagram of the new architecture:
HalfKAv2-45056-512x2P8x2 -16-32-1 x8

@vondele
Copy link
Member

vondele commented May 18, 2021

appveyor fails because of the appveyor script picks up the wrong reference number from the commit message. Just needs to put the Bench on a different line.

@locutus2
Copy link
Member

@Sopel97 @vondele
Thanks for invention and training of this new architecture. Especially for the nice diagram and explanation!


featureTransformer->transform(pos, transformedFeatures);
const auto output = network->propagate(transformedFeatures, buffer);
const std::size_t bucket = (popcount(pos.pieces()) - 1) / 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: pos.count<ALL_PIECES>() is probably faster on most machines than popcount(pos.pieces()).

@snicolet
Copy link
Member

snicolet commented May 18, 2021

I'm fine to merge this pull request in, to change the net architecture. Congrats :-)

And if I think I have a simpler/alternative scaling scheme for STC or LTC, we can just resume the normal testing procedure on fishtest.

Concerning the merge, I think it would be best to perhaps keep two commits for the pull request:

  1. first commit with all of Tomasz's changes, up to "Use adds instead of add", adding support for the new architecture, and using the net nn-8a08400ed089.nnue directly
  2. second commit with all the tweaks by Joost to get the Elo, namely the changes in search.cpp, the tuned scaling formula in evaluate.cpp and the removal of random eval.

@Fanael
Copy link
Contributor

Fanael commented May 18, 2021

Great job!

@snicolet
Copy link
Member

Easiest way to create such two commits is probably to use the diff .. trick, to get the global diff from master to the PR version:
https://github.com/vondele/Stockfish/compare/61e1c66b7c..3dbba284ab

Then in that diff, for first commit use all changes except in evaluate.cpp and search.cpp
For second commit use all remaining changes in evaluate.cpp and search.cpp

@vondele
Copy link
Member

vondele commented May 18, 2021

@snicolet I would prefer to have a single commit as only the combination of the things have been tested on fishtest. However, the changes you refer to are 8b8eb2b and 9f03668 and so can be found easily via this PR.

@snicolet
Copy link
Member

OK
bye :-)

vondele pushed a commit to vondele/Stockfish that referenced this pull request May 18, 2021
Introduces a new NNUE network architecture and associated network parameters,
as obtained by a new pytorch trainer.

The network is already very strong at short TC, without regression at longer TC,
and has potential for further improvements.

https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d
TC: 10s+0.1s, 1 thread
ELO: 21.74 +-3.4 (95%) LOS: 100.0%
Total: 10000 W: 1559 L: 934 D: 7507
Ptnml(0-2): 38, 701, 2972, 1176, 113

https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b
TC: 60s+0.6s, 1 thread
ELO: 5.85 +-1.7 (95%) LOS: 100.0%
Total: 20000 W: 1381 L: 1044 D: 17575
Ptnml(0-2): 27, 885, 7864, 1172, 52

https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806
TC: 20s+0.2s, 8 threads
LLR: 2.93 (-2.94,2.94) <0.50,3.50>
Total: 34272 W: 1610 L: 1452 D: 31210
Ptnml(0-2): 30, 1285, 14350, 1439, 32

https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72
TC: 60s+0.6s, 8 threads
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 45544 W: 1262 L: 1214 D: 43068
Ptnml(0-2): 12, 1129, 20442, 1177, 12

The network has been trained (by vondele) using the https://github.com/glinscott/nnue-pytorch/ trainer (started by glinscott),
specifically the branch https://github.com/Sopel97/nnue-pytorch/tree/experiment_56.
The data used are in 64 billion positions (193GB total) generated and scored with the current master net
d8: https://drive.google.com/file/d/1hOOYSDKgOOp38ZmD0N4DV82TOLHzjUiF/view?usp=sharing
d9: https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing
d10: https://drive.google.com/file/d/1ZC5upzBYMmMj1gMYCkt6rCxQG0GnO3Kk/view?usp=sharing
fishtest_d9: https://drive.google.com/file/d/1GQHt0oNgKaHazwJFTRbXhlCN3FbUedFq/view?usp=sharing

This network also contains a few architectural changes with respect to the current master:

    Size changed from 256x2-32-32-1 to 512x2-16-32-1
        ~15-20% slower
        ~2x larger
        adds a special path for 16 valued ClippedReLU
        fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions
            this is safe now because the inputs are processed in groups of 4 in the current affine transform code
    The feature set changed from HalfKP to HalfKAv2
        Includes information about the kings like HalfKA
        Packs king features better, resulting in 8% size reduction compared to HalfKA
    The board is flipped for the black's perspective, instead of rotated like in the current master
    PSQT values for each feature
        the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero.
        8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4
        initialized to classical material values on the start of the training
    8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4
        only one subnetwork is evaluated for any position, no or marginal speed loss

A diagram of the network is available: https://user-images.githubusercontent.com/8037982/118656988-553a1700-b7eb-11eb-82ef-56a11cbebbf2.png

closes official-stockfish#3474

Bench: 3806488
@vondele vondele closed this in e8d64af May 18, 2021
@snicolet snicolet added the to be merged Will be merged shortly label Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

to be merged Will be merged shortly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants