Skip to content

Conversation

@noobpwnftw
Copy link
Contributor

@noobpwnftw noobpwnftw commented Nov 12, 2021

Old code only worked because there was probably a limit on how many cores/threads can reside within one NUMA node, and the OS creates extra NUMA nodes when necessary, however the actual mechanism of core binding is done by "Processor Groups"(https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups). With a newer OS, one NUMA node can have many such "Processor Groups" and we should just consistently use the number of groups to bind the threads instead of deriving the topology from the number of NUMA nodes.

No functional change.

@noobpwnftw
Copy link
Contributor Author

I can confirm locally that this is required to spread threads on all cores on Windows 11 with a 3990X CPU. It has only 1 NUMA node with 2 groups of 64 threads each.

Other OS/CPU combinations are not yet tested, hopefully nothing breaks.

@vondele
Copy link
Member

vondele commented Nov 13, 2021

thanks, maybe somebody with an older version of windows and a 64+ threads machine could give this a test as well.

@noobpwnftw
Copy link
Contributor Author

Behavior change of the API is documented here:
https://docs.microsoft.com/en-us/windows/win32/procthread/numa-support
Read "Behavior starting with Windows 10 Build 20348".

@vondele
Copy link
Member

vondele commented Nov 13, 2021

Creating "fake" nodes to accommodate a 1:1 mapping between groups and nodes 
has resulted in confusing behaviors where unexpected numbers of NUMA nodes 
are reported and so, starting with Windows 10 Build 20348, the OS has changed [...]

who would have guessed

@vondele vondele added the to be merged Will be merged shortly label Nov 15, 2021
@vondele
Copy link
Member

vondele commented Nov 15, 2021

There hasn't been any negative feedback, so I'll proceed with merging. We'll see if it causes problems with older windows OS on large core hardware.

@vondele vondele closed this in 9048ac0 Nov 15, 2021
@g3g6
Copy link

g3g6 commented Nov 20, 2021

I think that this code doesnt work right way. I have system 2xXeon E5 2697 v2 (24 cores, 48 treads, Windows 10 21H2, x64 ) and since this version, when i set number of cores to 24, then Stockfish use 1CPU to 100% and second CPU is idle. Previous versions works right way, 12 cores running for both CPU's.

@vondele
Copy link
Member

vondele commented Nov 20, 2021

Thanks for the feedback... I think we'll need to understand if this windows version is from before of after Windows 10 Build 20348 , i.e. before the change in Windows API concerning NUMA or not (see link above).

@noobpwnftw
Copy link
Contributor Author

I think that this code doesnt work right way. I have system 2xXeon E5 2697 v2 (24 cores, 48 treads, Windows 10 21H2, x64 ) and since this version, when i set number of cores to 24, then Stockfish use 1CPU to 100% and second CPU is idle. Previous versions works right way, 12 cores running for both CPU's.

As per 14.1 release(before the patch):

Stockfish/src/misc.cpp

Lines 551 to 555 in 7262fd5

// Run as many threads as possible on the same node until core limit is
// reached, then move on filling the next node.
for (int n = 0; n < nodes; n++)
for (int i = 0; i < cores / nodes; i++)
groups.push_back(n);

If in your system, "12 cores running for both CPU's" is the wrong behavior and now "use 1CPU to 100% and second CPU is idle" works as intended.

@Ipmanchess
Copy link

@g3g6 Can you run this bench with Stockfish 14.1 and also with last version to see the difference : http://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php Thanks!

@g3g6
Copy link

g3g6 commented Nov 20, 2021

Thanks for the feedback... I think we'll need to understand if this windows version is from before of after Windows 10 Build 20348 , i.e. before the change in Windows API concerning NUMA or not (see link above).

Win 10 x64, 21H2, build 19044.1348

@g3g6
Copy link

g3g6 commented Nov 20, 2021

http://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php

Benchmark
14.1

Total time (ms) : 180287
Nodes searched : 3299797749
Nodes/second : 18303026

the last dev

Total time (ms) : 281867
Nodes searched : 3195021430
Nodes/second : 11335209

@noobpwnftw
Copy link
Contributor Author

#3798 will restore old NUMA assignment behavior.

@noobpwnftw noobpwnftw deleted the fix_numa branch November 21, 2021 13:07
vondele pushed a commit to vondele/Stockfish that referenced this pull request Nov 22, 2021
revert official-stockfish@9048ac0 due to core spread problem and fix new OS compatibility with another method.

This code assumes that if one NUMA node has more than one processor groups, they are created equal(having equal amount of cores assigned to each of the groups), and also the total number of available cores contained in such groups are equal to the number of available cores within one NUMA node because of how best_node function works.

closes official-stockfish#3798
fixes official-stockfish#3787

No functional change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

to be merged Will be merged shortly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants