Fix processor group binding under Windows. #3787

noobpwnftw · 2021-11-12T22:39:28Z

Old code only worked because there was probably a limit on how many cores/threads can reside within one NUMA node, and the OS creates extra NUMA nodes when necessary, however the actual mechanism of core binding is done by "Processor Groups"(https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups). With a newer OS, one NUMA node can have many such "Processor Groups" and we should just consistently use the number of groups to bind the threads instead of deriving the topology from the number of NUMA nodes.

No functional change.

noobpwnftw · 2021-11-13T06:45:32Z

I can confirm locally that this is required to spread threads on all cores on Windows 11 with a 3990X CPU. It has only 1 NUMA node with 2 groups of 64 threads each.

Other OS/CPU combinations are not yet tested, hopefully nothing breaks.

vondele · 2021-11-13T08:31:27Z

thanks, maybe somebody with an older version of windows and a 64+ threads machine could give this a test as well.

noobpwnftw · 2021-11-13T20:58:18Z

Behavior change of the API is documented here:
https://docs.microsoft.com/en-us/windows/win32/procthread/numa-support
Read "Behavior starting with Windows 10 Build 20348".

vondele · 2021-11-13T21:32:14Z

Creating "fake" nodes to accommodate a 1:1 mapping between groups and nodes 
has resulted in confusing behaviors where unexpected numbers of NUMA nodes 
are reported and so, starting with Windows 10 Build 20348, the OS has changed [...]

who would have guessed

vondele · 2021-11-15T17:18:01Z

There hasn't been any negative feedback, so I'll proceed with merging. We'll see if it causes problems with older windows OS on large core hardware.

g3g6 · 2021-11-20T08:09:54Z

I think that this code doesnt work right way. I have system 2xXeon E5 2697 v2 (24 cores, 48 treads, Windows 10 21H2, x64 ) and since this version, when i set number of cores to 24, then Stockfish use 1CPU to 100% and second CPU is idle. Previous versions works right way, 12 cores running for both CPU's.

vondele · 2021-11-20T08:16:37Z

Thanks for the feedback... I think we'll need to understand if this windows version is from before of after Windows 10 Build 20348 , i.e. before the change in Windows API concerning NUMA or not (see link above).

noobpwnftw · 2021-11-20T08:18:37Z

I think that this code doesnt work right way. I have system 2xXeon E5 2697 v2 (24 cores, 48 treads, Windows 10 21H2, x64 ) and since this version, when i set number of cores to 24, then Stockfish use 1CPU to 100% and second CPU is idle. Previous versions works right way, 12 cores running for both CPU's.

As per 14.1 release(before the patch):

Stockfish/src/misc.cpp

Lines 551 to 555 in 7262fd5

    
           // Run as many threads as possible on the same node until core limit is 
        
           // reached, then move on filling the next node. 
        
           for (int n = 0; n < nodes; n++) 
        
               for (int i = 0; i < cores / nodes; i++) 
        
                   groups.push_back(n);

If in your system, "12 cores running for both CPU's" is the wrong behavior and now "use 1CPU to 100% and second CPU is idle" works as intended.

Ipmanchess · 2021-11-20T08:21:49Z

@g3g6 Can you run this bench with Stockfish 14.1 and also with last version to see the difference : http://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php Thanks!

g3g6 · 2021-11-20T08:28:03Z

Thanks for the feedback... I think we'll need to understand if this windows version is from before of after Windows 10 Build 20348 , i.e. before the change in Windows API concerning NUMA or not (see link above).

Win 10 x64, 21H2, build 19044.1348

g3g6 · 2021-11-20T08:43:45Z

http://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php

Benchmark
14.1

Total time (ms) : 180287
Nodes searched : 3299797749
Nodes/second : 18303026

the last dev

Total time (ms) : 281867
Nodes searched : 3195021430
Nodes/second : 11335209

noobpwnftw · 2021-11-20T10:02:54Z

#3798 will restore old NUMA assignment behavior.

revert official-stockfish@9048ac0 due to core spread problem and fix new OS compatibility with another method. This code assumes that if one NUMA node has more than one processor groups, they are created equal(having equal amount of cores assigned to each of the groups), and also the total number of available cores contained in such groups are equal to the number of available cores within one NUMA node because of how best_node function works. closes official-stockfish#3798 fixes official-stockfish#3787 No functional change.

Fix processor group binding under Windows.

204a4cc

No functional change.

vondele added the to be merged Will be merged shortly label Nov 15, 2021

vondele closed this in 9048ac0 Nov 15, 2021

noobpwnftw deleted the fix_numa branch November 21, 2021 13:07

Fix processor group binding under Windows. #3787

Fix processor group binding under Windows. #3787

Uh oh!

Conversation

noobpwnftw commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noobpwnftw commented Nov 13, 2021

Uh oh!

vondele commented Nov 13, 2021

Uh oh!

noobpwnftw commented Nov 13, 2021

Uh oh!

vondele commented Nov 13, 2021

Uh oh!

vondele commented Nov 15, 2021

Uh oh!

g3g6 commented Nov 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vondele commented Nov 20, 2021

Uh oh!

noobpwnftw commented Nov 20, 2021

Uh oh!

Ipmanchess commented Nov 20, 2021

Uh oh!

g3g6 commented Nov 20, 2021

Uh oh!

g3g6 commented Nov 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark 14.1

the last dev

Uh oh!

noobpwnftw commented Nov 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

noobpwnftw commented Nov 12, 2021 •

edited

Loading

g3g6 commented Nov 20, 2021 •

edited

Loading

g3g6 commented Nov 20, 2021 •

edited

Loading

Benchmark
14.1