Skip to content

Commit 613dc66

Browse files
committed
Careful SMP locking - Fix very occasional hangs
Louis Zulli reported that Stockfish suffers from very occasional hangs with his 20 cores machine. Careful SMP debugging revealed that this was caused by "a ghost split point slave", where thread was marked as a split point slave, but wasn't actually working on it. The only logical explanation for this was double booking, where due to SMP race, the same thread is booked for two different split points simultaneously. Due to very intermittent nature of the problem, we can't say exactly how this happens. The current handling of Thread specific variables is risky though. Volatile variables are in some cases changed without spinlock being hold. In this case standard doesn't give us any kind of guarantees about how the updated values are propagated to other threads. We resolve the situation by enforcing very strict locking rules: - Values for key thread variables (splitPointsSize, activeSplitPoint, searching) can only be changed when the thread specific spinlock is held. - Structural changes for splitPoints[] are only allowed when the thread specific spinlock is held. - Thread booking decisions (per split point) can only be done when the thread specific spinlock is held. With these changes hangs didn't occur anymore during 2 days torture testing on Zulli's machine. We probably have a slight performance penalty in SMP mode due to more locking. STC (7 threads): ELO: -1.00 +-2.2 (95%) LOS: 18.4% Total: 30000 W: 4538 L: 4624 D: 20838 However stability is worth more than 1-2 ELO points in this case. No functional change Resolves #422
1 parent 3e2591d commit 613dc66

File tree

2 files changed

+8
-2
lines changed

2 files changed

+8
-2
lines changed

src/search.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1622,6 +1622,7 @@ void Thread::idle_loop() {
16221622
else
16231623
assert(false);
16241624

1625+
spinlock.acquire();
16251626
assert(searching);
16261627

16271628
searching = false;
@@ -1633,6 +1634,7 @@ void Thread::idle_loop() {
16331634
// After releasing the lock we can't access any SplitPoint related data
16341635
// in a safe way because it could have been released under our feet by
16351636
// the sp master.
1637+
spinlock.release();
16361638
sp->spinlock.release();
16371639

16381640
// Try to late join to another split point if none of its slaves has

src/thread.cpp

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,7 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
145145
SplitPoint& sp = splitPoints[splitPointsSize];
146146

147147
sp.spinlock.acquire(); // No contention here until we don't increment splitPointsSize
148+
spinlock.acquire();
148149

149150
sp.master = this;
150151
sp.parentSplitPoint = activeSplitPoint;
@@ -167,6 +168,7 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
167168
++splitPointsSize;
168169
activeSplitPoint = &sp;
169170
activePosition = nullptr;
171+
spinlock.release();
170172

171173
// Try to allocate available threads
172174
Thread* slave;
@@ -194,6 +196,9 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
194196

195197
Thread::idle_loop(); // Force a call to base class idle_loop()
196198

199+
sp.spinlock.acquire();
200+
spinlock.acquire();
201+
197202
// In the helpful master concept, a master can help only a sub-tree of its
198203
// split point and because everything is finished here, it's not possible
199204
// for the master to be booked.
@@ -205,15 +210,14 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
205210
// We have returned from the idle loop, which means that all threads are
206211
// finished. Note that decreasing splitPointsSize must be done under lock
207212
// protection to avoid a race with Thread::can_join().
208-
sp.spinlock.acquire();
209-
210213
--splitPointsSize;
211214
activeSplitPoint = sp.parentSplitPoint;
212215
activePosition = &pos;
213216
pos.set_nodes_searched(pos.nodes_searched() + sp.nodes);
214217
*bestMove = sp.bestMove;
215218
*bestValue = sp.bestValue;
216219

220+
spinlock.release();
217221
sp.spinlock.release();
218222
}
219223

0 commit comments

Comments
 (0)