Suboptimal speed on multi-socket / numa systems

### Describe the issue

we observe slower than expected nodes per second on multi-socket and/or numa systems (e.g. local, tcec, potentially ccc). The likely reason is increased memory bandwidth contention due to the larger networks and accumulator caches.

### Expected behavior

nps / performance following more closely the nps on single CPU / single numa chips. Potential speedups could be 2x and larger.

### Steps to reproduce

Reproducing/testing needs access to multi-socket system, and either looking at historical data, or doing a comparison with a synthetic benchmark, where the speed of multiple instances of SF with fewer threads (each bound to a numa domain) are compared to one instance of SF with correspondingly more threads. An example is the following set of historic data:

```
                                                             sha                             date       pinned      default
                        dcb02337844d71e56df57b9a8ba17646f953711c        2024-05-15T16:27:03+02:00   8273351050   3266542305
                        49ef4c935a5cb0e4d94096e6354caa06b36b3e3c        2024-04-24T18:38:20+02:00   8548621496   3491153172
                        0716b845fdef8a20102b07eaec074b8da8162523        2024-04-02T08:49:48+02:00   8059206362   4248207804
                        bd579ab5d1a931a09a62f2ed33b5149ada7bc65f        2024-03-07T19:53:48+01:00   9424434014   5654516665
                        e67cc979fd2c0e66dfc2b2f2daa0117458cfc462        2024-02-24T18:15:04+01:00   9470247485   5864264415
                        8e75548f2a10969c1c9211056999efbcebe63f9a        2024-02-17T17:11:46+01:00   9411936135   6000562457
                        6deb88728fb141e853243c2873ad0cda4dd19320        2024-01-08T18:34:36+01:00   9346121150   5796929238
                        f12035c88c58a5fd568d26cde9868f73a8d7b839        2023-12-30T11:08:03+01:00   9454744857   6531509540
                        afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9        2023-09-29T22:30:27+02:00   9235658137   7375676116
                        70ba9de85cddc5460b1ec53e0a99bee271e26ece        2023-09-22T19:26:16+02:00   9284309149   7583913771
                        3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f        2023-09-11T22:37:39+02:00  10860286083  10026471309
                        e699fee513ce26b3794ac43d08826c89106e10ea        2023-07-06T23:03:58+02:00  10399561315   9485800488
                        915532181f11812c80ef0b57bc018de4ea2155ec        2023-07-01T13:34:30+02:00  10172761439   8869020884
```
here default is running a single instance of SF with 256T without pinning, while the pinning setup uses 8 instances of SF each pinned to a suitable numa domain. It can be seen that a year ago the performance difference was just about 10% whereas now the difference is about 200%.

data generated with:
```bash
for version in  dcb02337844d71e56df57b9a8ba17646f953711c 49ef4c935a5cb0e4d94096e6354caa06b36b3e3c 0716b845fdef8a20102b07eaec074b8da8162523 bd579ab5d1a931a09a62f2ed33b5149ada7bc65f e67cc979fd2c0e66dfc2b2f2daa0117458cfc462 8e75548f2a10969c1c9211056999efbcebe63f9a 6deb88728fb141e853243c2873ad0cda4dd19320 f12035c88c58a5fd568d26cde9868f73a8d7b839 afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9 70ba9de85cddc5460b1ec53e0a99bee271e26ece 3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f e699fee513ce26b3794ac43d08826c89106e10ea 915532181f11812c80ef0b57bc018de4ea2155ec ef94f77f8c827a2395f1c40f53311a3b1f20bc5b a49b3ba7ed5d9be9151c8ceb5eed40efe3387c75 932f5a2d657c846c282adcf2051faef7ca17ae15 373359b44d0947cce2628a9a8c9b432a458615a8 c1fff71650e2f8bf5a2d63bdc043161cdfe8e460 41f50b2c83a0ba36a2b9c507c1783e57c9b13485 68e1e9b3811e16cad014b590d7443b9063b3eb52 758f9c9350abee36a5865ec701560db8ea62004d e6e324eb28fd49c1fc44b3b65784f85a773ec61c 7262fd5d14810b7b495b5038e348a448fda1bcc3 773dff020968f7a6f590cfd53e8fd89f12e15e36 3597f1942ec6f2cfbd50b905683739b0900ff5dd c306d838697011da0a960758dde3f7ede6849060 c3483fa9a7d7c0ffa9fcc32b467ca844cfb63790
do

git checkout $version >& checkout.log.$version
make -j ARCH=x86-64-avx2 profile-build  >& build.log.$version
mv stockfish stockfish.$version

for split in 1 2 4 8 16 32
do

threads=$((256/split))
hash=$((128000/split))

cat << EOF > inp_$split
setoption name Threads value $threads
setoption name Hash value $hash
go movetime 100000
ucinewgame
quit
EOF

done

# no binding
split=1
for instance in `seq 1 $split`
do
cat inp_$split | ./stockfish.$version > out_nobind_${split}_${instance} &
done
wait

total_nodes_nobind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_nobind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_nobind=$((total_nodes_nobind + nodes))
done


# binding
split=8
threads=$((256/split))
for instance in `seq 1 $split`
do
# this cpu list must match the numa domains ... depends on the system
tasksetlow=$(((instance-1)*threads/2))
tasksethigh=$(((instance-1)*threads/2 + threads/2 - 1))
cat inp_$split | taskset --cpu-list $tasksetlow-$tasksethigh,$((128+tasksetlow))-$((128+tasksethigh)) ./stockfish.$version > out_bind_${split}_${instance} &
done
wait

total_nodes_bind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_bind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_bind=$((total_nodes_bind + nodes))
done

epoch=`git show --pretty=fuller --date=iso-strict $rev | grep 'CommitDate' | awk '{print $NF}'`

printf "%64s %32s %12d %12d\n" $version $epoch $total_nodes_bind $total_nodes_nobind

done
```

On the same system the following performance is observed according the splitting/pinning startegy:
```
 split         bind      no bind
     1   3387763787   3412555326
     2   6249669886   4568189260
     4   8345110549   3516812166
     8   8283342858   3523839484
    16   8013377138   2302587649
    32   7888404092   3094224313
```
On a different system with 4 sockets the observation is:
```
 split         bind      no bind
     1   9204147529   9082536561
     2  15654018181  10057157190
     4  20958931771   8864636565
     8  20290433821   4744173824
    16  19448457275   3913568825
```
    




### Anything else?

Tentatively, a solution could be to introduce thread affinity, and replicated the network weights across  numa domains. A potential interface could be along the lines:

```
#specify cpu masks for two numa domains
setoption name affinityMasks value 0xFF00,0x00FF
#roundrobin allocate threads to these domains, with a net allocated for each numa domain
setoption name Threads value 256
```

Some more discussion here https://discord.com/channels/435943710472011776/813919248455827515/1240709279049191525
As well as a rebased version of a code that goes in that direction (but needs further work):
https://github.com/official-stockfish/Stockfish/compare/master...Disservin:Stockfish:numareplicatedweights?expand=1

### Operating system

All

### Stockfish version

master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suboptimal speed on multi-socket / numa systems #5253

Describe the issue

Expected behavior

Steps to reproduce

Anything else?

Operating system

Stockfish version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suboptimal speed on multi-socket / numa systems #5253

Description

Describe the issue

Expected behavior

Steps to reproduce

Anything else?

Operating system

Stockfish version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions