-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Describe the issue
we observe slower than expected nodes per second on multi-socket and/or numa systems (e.g. local, tcec, potentially ccc). The likely reason is increased memory bandwidth contention due to the larger networks and accumulator caches.
Expected behavior
nps / performance following more closely the nps on single CPU / single numa chips. Potential speedups could be 2x and larger.
Steps to reproduce
Reproducing/testing needs access to multi-socket system, and either looking at historical data, or doing a comparison with a synthetic benchmark, where the speed of multiple instances of SF with fewer threads (each bound to a numa domain) are compared to one instance of SF with correspondingly more threads. An example is the following set of historic data:
sha date pinned default
dcb02337844d71e56df57b9a8ba17646f953711c 2024-05-15T16:27:03+02:00 8273351050 3266542305
49ef4c935a5cb0e4d94096e6354caa06b36b3e3c 2024-04-24T18:38:20+02:00 8548621496 3491153172
0716b845fdef8a20102b07eaec074b8da8162523 2024-04-02T08:49:48+02:00 8059206362 4248207804
bd579ab5d1a931a09a62f2ed33b5149ada7bc65f 2024-03-07T19:53:48+01:00 9424434014 5654516665
e67cc979fd2c0e66dfc2b2f2daa0117458cfc462 2024-02-24T18:15:04+01:00 9470247485 5864264415
8e75548f2a10969c1c9211056999efbcebe63f9a 2024-02-17T17:11:46+01:00 9411936135 6000562457
6deb88728fb141e853243c2873ad0cda4dd19320 2024-01-08T18:34:36+01:00 9346121150 5796929238
f12035c88c58a5fd568d26cde9868f73a8d7b839 2023-12-30T11:08:03+01:00 9454744857 6531509540
afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9 2023-09-29T22:30:27+02:00 9235658137 7375676116
70ba9de85cddc5460b1ec53e0a99bee271e26ece 2023-09-22T19:26:16+02:00 9284309149 7583913771
3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f 2023-09-11T22:37:39+02:00 10860286083 10026471309
e699fee513ce26b3794ac43d08826c89106e10ea 2023-07-06T23:03:58+02:00 10399561315 9485800488
915532181f11812c80ef0b57bc018de4ea2155ec 2023-07-01T13:34:30+02:00 10172761439 8869020884
here default is running a single instance of SF with 256T without pinning, while the pinning setup uses 8 instances of SF each pinned to a suitable numa domain. It can be seen that a year ago the performance difference was just about 10% whereas now the difference is about 200%.
data generated with:
for version in dcb02337844d71e56df57b9a8ba17646f953711c 49ef4c935a5cb0e4d94096e6354caa06b36b3e3c 0716b845fdef8a20102b07eaec074b8da8162523 bd579ab5d1a931a09a62f2ed33b5149ada7bc65f e67cc979fd2c0e66dfc2b2f2daa0117458cfc462 8e75548f2a10969c1c9211056999efbcebe63f9a 6deb88728fb141e853243c2873ad0cda4dd19320 f12035c88c58a5fd568d26cde9868f73a8d7b839 afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9 70ba9de85cddc5460b1ec53e0a99bee271e26ece 3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f e699fee513ce26b3794ac43d08826c89106e10ea 915532181f11812c80ef0b57bc018de4ea2155ec ef94f77f8c827a2395f1c40f53311a3b1f20bc5b a49b3ba7ed5d9be9151c8ceb5eed40efe3387c75 932f5a2d657c846c282adcf2051faef7ca17ae15 373359b44d0947cce2628a9a8c9b432a458615a8 c1fff71650e2f8bf5a2d63bdc043161cdfe8e460 41f50b2c83a0ba36a2b9c507c1783e57c9b13485 68e1e9b3811e16cad014b590d7443b9063b3eb52 758f9c9350abee36a5865ec701560db8ea62004d e6e324eb28fd49c1fc44b3b65784f85a773ec61c 7262fd5d14810b7b495b5038e348a448fda1bcc3 773dff020968f7a6f590cfd53e8fd89f12e15e36 3597f1942ec6f2cfbd50b905683739b0900ff5dd c306d838697011da0a960758dde3f7ede6849060 c3483fa9a7d7c0ffa9fcc32b467ca844cfb63790
do
git checkout $version >& checkout.log.$version
make -j ARCH=x86-64-avx2 profile-build >& build.log.$version
mv stockfish stockfish.$version
for split in 1 2 4 8 16 32
do
threads=$((256/split))
hash=$((128000/split))
cat << EOF > inp_$split
setoption name Threads value $threads
setoption name Hash value $hash
go movetime 100000
ucinewgame
quit
EOF
done
# no binding
split=1
for instance in `seq 1 $split`
do
cat inp_$split | ./stockfish.$version > out_nobind_${split}_${instance} &
done
wait
total_nodes_nobind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_nobind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_nobind=$((total_nodes_nobind + nodes))
done
# binding
split=8
threads=$((256/split))
for instance in `seq 1 $split`
do
# this cpu list must match the numa domains ... depends on the system
tasksetlow=$(((instance-1)*threads/2))
tasksethigh=$(((instance-1)*threads/2 + threads/2 - 1))
cat inp_$split | taskset --cpu-list $tasksetlow-$tasksethigh,$((128+tasksetlow))-$((128+tasksethigh)) ./stockfish.$version > out_bind_${split}_${instance} &
done
wait
total_nodes_bind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_bind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_bind=$((total_nodes_bind + nodes))
done
epoch=`git show --pretty=fuller --date=iso-strict $rev | grep 'CommitDate' | awk '{print $NF}'`
printf "%64s %32s %12d %12d\n" $version $epoch $total_nodes_bind $total_nodes_nobind
doneOn the same system the following performance is observed according the splitting/pinning startegy:
split bind no bind
1 3387763787 3412555326
2 6249669886 4568189260
4 8345110549 3516812166
8 8283342858 3523839484
16 8013377138 2302587649
32 7888404092 3094224313
On a different system with 4 sockets the observation is:
split bind no bind
1 9204147529 9082536561
2 15654018181 10057157190
4 20958931771 8864636565
8 20290433821 4744173824
16 19448457275 3913568825
Anything else?
Tentatively, a solution could be to introduce thread affinity, and replicated the network weights across numa domains. A potential interface could be along the lines:
#specify cpu masks for two numa domains
setoption name affinityMasks value 0xFF00,0x00FF
#roundrobin allocate threads to these domains, with a net allocated for each numa domain
setoption name Threads value 256
Some more discussion here https://discord.com/channels/435943710472011776/813919248455827515/1240709279049191525
As well as a rebased version of a code that goes in that direction (but needs further work):
https://github.com/official-stockfish/Stockfish/compare/master...Disservin:Stockfish:numareplicatedweights?expand=1
Operating system
All
Stockfish version
master