Skip to content

Commit a169c78

Browse files
Sopel97vondele
authored andcommitted
Improve performance on NUMA systems
Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, **without** libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes #5253 closes #5285 No functional change
1 parent b0287dc commit a169c78

File tree

19 files changed

+1418
-289
lines changed

19 files changed

+1418
-289
lines changed

.github/ci/libcxx17.imp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
{ include: [ "<__fwd/sstream.h>", private, "<iosfwd>", public ] },
88
{ include: [ "<__fwd/streambuf.h>", private, "<iosfwd>", public ] },
99
{ include: [ "<__fwd/string_view.h>", private, "<string_view>", public ] },
10+
{ include: [ "<__system_error/errc.h>", private, "<system_error>", public ] },
1011

1112
# Mappings for includes between public headers
1213
{ include: [ "<ios>", public, "<iostream>", public ] },

src/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ HEADERS = benchmark.h bitboard.h evaluate.h misc.h movegen.h movepick.h \
6363
nnue/layers/sqr_clipped_relu.h nnue/nnue_accumulator.h nnue/nnue_architecture.h \
6464
nnue/nnue_common.h nnue/nnue_feature_transformer.h position.h \
6565
search.h syzygy/tbprobe.h thread.h thread_win32_osx.h timeman.h \
66-
tt.h tune.h types.h uci.h ucioption.h perft.h nnue/network.h engine.h score.h
66+
tt.h tune.h types.h uci.h ucioption.h perft.h nnue/network.h engine.h score.h numa.h
6767

6868
OBJS = $(notdir $(SRCS:.cpp=.o))
6969

src/engine.cpp

Lines changed: 70 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,15 @@
1818

1919
#include "engine.h"
2020

21+
#include <cassert>
2122
#include <deque>
23+
#include <iosfwd>
2224
#include <memory>
2325
#include <ostream>
26+
#include <sstream>
2427
#include <string_view>
2528
#include <utility>
2629
#include <vector>
27-
#include <sstream>
28-
#include <iosfwd>
29-
#include <cassert>
3030

3131
#include "evaluate.h"
3232
#include "misc.h"
@@ -48,10 +48,14 @@ constexpr auto StartFEN = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
4848

4949
Engine::Engine(std::string path) :
5050
binaryDirectory(CommandLine::get_binary_directory(path)),
51+
numaContext(NumaConfig::from_system()),
5152
states(new std::deque<StateInfo>(1)),
52-
networks(NN::Networks(
53-
NN::NetworkBig({EvalFileDefaultNameBig, "None", ""}, NN::EmbeddedNNUEType::BIG),
54-
NN::NetworkSmall({EvalFileDefaultNameSmall, "None", ""}, NN::EmbeddedNNUEType::SMALL))) {
53+
threads(),
54+
networks(
55+
numaContext,
56+
NN::Networks(
57+
NN::NetworkBig({EvalFileDefaultNameBig, "None", ""}, NN::EmbeddedNNUEType::BIG),
58+
NN::NetworkSmall({EvalFileDefaultNameSmall, "None", ""}, NN::EmbeddedNNUEType::SMALL))) {
5559
pos.set(StartFEN, false, &states->back());
5660
capSq = SQ_NONE;
5761
}
@@ -74,7 +78,7 @@ void Engine::stop() { threads.stop = true; }
7478
void Engine::search_clear() {
7579
wait_for_search_finished();
7680

77-
tt.clear(options["Threads"]);
81+
tt.clear(threads);
7882
threads.clear();
7983

8084
// @TODO wont work with multiple instances
@@ -124,40 +128,71 @@ void Engine::set_position(const std::string& fen, const std::vector<std::string>
124128

125129
// modifiers
126130

127-
void Engine::resize_threads() { threads.set({options, threads, tt, networks}, updateContext); }
131+
void Engine::set_numa_config_from_option(const std::string& o) {
132+
if (o == "auto" || o == "system")
133+
{
134+
numaContext.set_numa_config(NumaConfig::from_system());
135+
}
136+
else if (o == "none")
137+
{
138+
numaContext.set_numa_config(NumaConfig{});
139+
}
140+
else
141+
{
142+
numaContext.set_numa_config(NumaConfig::from_string(o));
143+
}
144+
145+
// Force reallocation of threads in case affinities need to change.
146+
resize_threads();
147+
}
148+
149+
void Engine::resize_threads() {
150+
threads.wait_for_search_finished();
151+
threads.set(numaContext.get_numa_config(), {options, threads, tt, networks}, updateContext);
152+
153+
// Reallocate the hash with the new threadpool size
154+
set_tt_size(options["Hash"]);
155+
}
128156

129157
void Engine::set_tt_size(size_t mb) {
130158
wait_for_search_finished();
131-
tt.resize(mb, options["Threads"]);
159+
tt.resize(mb, threads);
132160
}
133161

134162
void Engine::set_ponderhit(bool b) { threads.main_manager()->ponder = b; }
135163

136164
// network related
137165

138166
void Engine::verify_networks() const {
139-
networks.big.verify(options["EvalFile"]);
140-
networks.small.verify(options["EvalFileSmall"]);
167+
networks->big.verify(options["EvalFile"]);
168+
networks->small.verify(options["EvalFileSmall"]);
141169
}
142170

143171
void Engine::load_networks() {
144-
load_big_network(options["EvalFile"]);
145-
load_small_network(options["EvalFileSmall"]);
172+
networks.modify_and_replicate([this](NN::Networks& networks_) {
173+
networks_.big.load(binaryDirectory, options["EvalFile"]);
174+
networks_.small.load(binaryDirectory, options["EvalFileSmall"]);
175+
});
176+
threads.clear();
146177
}
147178

148179
void Engine::load_big_network(const std::string& file) {
149-
networks.big.load(binaryDirectory, file);
180+
networks.modify_and_replicate(
181+
[this, &file](NN::Networks& networks_) { networks_.big.load(binaryDirectory, file); });
150182
threads.clear();
151183
}
152184

153185
void Engine::load_small_network(const std::string& file) {
154-
networks.small.load(binaryDirectory, file);
186+
networks.modify_and_replicate(
187+
[this, &file](NN::Networks& networks_) { networks_.small.load(binaryDirectory, file); });
155188
threads.clear();
156189
}
157190

158191
void Engine::save_network(const std::pair<std::optional<std::string>, std::string> files[2]) {
159-
networks.big.save(files[0].first);
160-
networks.small.save(files[1].first);
192+
networks.modify_and_replicate([&files](NN::Networks& networks_) {
193+
networks_.big.save(files[0].first);
194+
networks_.small.save(files[1].first);
195+
});
161196
}
162197

163198
// utility functions
@@ -169,7 +204,7 @@ void Engine::trace_eval() const {
169204

170205
verify_networks();
171206

172-
sync_cout << "\n" << Eval::trace(p, networks) << sync_endl;
207+
sync_cout << "\n" << Eval::trace(p, *networks) << sync_endl;
173208
}
174209

175210
OptionsMap& Engine::get_options() { return options; }
@@ -184,4 +219,21 @@ std::string Engine::visualize() const {
184219
return ss.str();
185220
}
186221

222+
std::vector<std::pair<size_t, size_t>> Engine::get_bound_thread_count_by_numa_node() const {
223+
auto counts = threads.get_bound_thread_count_by_numa_node();
224+
const NumaConfig& cfg = numaContext.get_numa_config();
225+
std::vector<std::pair<size_t, size_t>> ratios;
226+
NumaIndex n = 0;
227+
for (; n < counts.size(); ++n)
228+
ratios.emplace_back(counts[n], cfg.num_cpus_in_numa_node(n));
229+
if (!counts.empty())
230+
for (; n < cfg.num_numa_nodes(); ++n)
231+
ratios.emplace_back(0, cfg.num_cpus_in_numa_node(n));
232+
return ratios;
233+
}
234+
235+
std::string Engine::get_numa_config_as_string() const {
236+
return numaContext.get_numa_config().to_string();
237+
}
238+
187239
}

src/engine.h

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
#include "thread.h"
3636
#include "tt.h"
3737
#include "ucioption.h"
38+
#include "numa.h"
3839

3940
namespace Stockfish {
4041

@@ -47,6 +48,13 @@ class Engine {
4748
using InfoIter = Search::InfoIteration;
4849

4950
Engine(std::string path = "");
51+
52+
// Can't be movable due to components holding backreferences to fields
53+
Engine(const Engine&) = delete;
54+
Engine(Engine&&) = delete;
55+
Engine& operator=(const Engine&) = delete;
56+
Engine& operator=(Engine&&) = delete;
57+
5058
~Engine() { wait_for_search_finished(); }
5159

5260
std::uint64_t perft(const std::string& fen, Depth depth, bool isChess960);
@@ -63,6 +71,7 @@ class Engine {
6371

6472
// modifiers
6573

74+
void set_numa_config_from_option(const std::string& o);
6675
void resize_threads();
6776
void set_tt_size(size_t mb);
6877
void set_ponderhit(bool);
@@ -83,23 +92,27 @@ class Engine {
8392

8493
// utility functions
8594

86-
void trace_eval() const;
87-
OptionsMap& get_options();
88-
std::string fen() const;
89-
void flip();
90-
std::string visualize() const;
95+
void trace_eval() const;
96+
OptionsMap& get_options();
97+
std::string fen() const;
98+
void flip();
99+
std::string visualize() const;
100+
std::vector<std::pair<size_t, size_t>> get_bound_thread_count_by_numa_node() const;
101+
std::string get_numa_config_as_string() const;
91102

92103
private:
93104
const std::string binaryDirectory;
94105

106+
NumaReplicationContext numaContext;
107+
95108
Position pos;
96109
StateListPtr states;
97110
Square capSq;
98111

99-
OptionsMap options;
100-
ThreadPool threads;
101-
TranspositionTable tt;
102-
Eval::NNUE::Networks networks;
112+
OptionsMap options;
113+
ThreadPool threads;
114+
TranspositionTable tt;
115+
NumaReplicated<Eval::NNUE::Networks> networks;
103116

104117
Search::SearchManager::UpdateContext updateContext;
105118
};

0 commit comments

Comments
 (0)