-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Advise the kernel to use huge pages (Linux) #2463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advise the kernel to use huge pages (Linux) #2463
Conversation
f715511 to
d7bf423
Compare
|
For reference about 35% percent difference, toggling between : and The system has hugepages by default, and master matches the MADV_HUGEPAGE performance. (dual socket Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz). So yes, on some systems this might be a significant gain. |
d7bf423 to
6f73b61
Compare
I guess one can expect that /sys/kernel/mm/transparent_hugepage/enabled is "always" on most new systems, but /sys/kernel/mm/transparent_hugepage/defrag is not necessarily so. This can make the performance non-deterministic: you get huge pages if there's not too much fragmentation in the page pool. This is not necessarily true if the system has some uptime. About systems that don't have enabled=always / defrag=always:
|
|
This is on a linux box , average for 3 runs, this box has 128GB of RAM, with 20GB dedicated to large pages on start up. Looks likes a winner to me. Thank you @skiminki for submitting this patch. |
|
For reference, passes non-regression on fishtest (which is no surprise) : no Elo gain there, probably because most machines are hugepage enabled by default there. |
src/tt.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would
madvise(mem, allocSize, MADV_HUGEPAGE | MADV_RANDOM);be even better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reasonable consideration. However:
- The advice parameter is not actually a bitfield, so you can't use bitwise or.
- Since the TT hash is always supposed to be resident after the initial memset() to clear it, MADV_RANDOM doesn't actually do anything to my knowledge. Those MADV_RANDOM/MADV_SEQUENTIAL are supposed to control prefetching (i.e., file read-ahead) of swapped-out pages or pages in memory-mapped files. MADV_RANDOM might theoretically make sense for the Syzygy files to control resident memory bloat, but it needs a separate study.
The one advice that I would have liked to use here is MADV_WILLNEED to make pages resident without page faulting on the initial memset(). This should speed up hash init even more. Unfortunately, it doesn't seem to do anything for anonymous mappings (i.e., plain memory not mapped to files).
|
I'm not a windows user, but should we be considering large pages for Windows as well? |
I'm not a Windows dev, either, but I would guess that large pages alloc has more positive impact on Windows than on Linux. I'm not aware that Windows has best-effort large pages at all like Linux has (i.e. transparent huge pages)? But I might be wrong, and it's worth a test at least, I suppose. The Windows instructions are here: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support . A bit more complicated than on Linux, but not too complicated. I also think that enabling large pages for OSX would make sense, too, if that can be done. |
|
macOS does not support large/huge pages as far as I know. I believe Brainfish may have already pushed Large pages to its SF fork. Should be relatively easy if someone could test. |
|
I am on MacOS, so I am obviously not an expert on these Windows/Linux specific subjects like NUMA, huge pages, etc (and I admit that platform specific speed-ups are not my cup of tea, I trust more algorithmic improvements in the long run for Stockfish). However, I understand that performance is an important psychological issue for some users with top machines, so I will approve this pull request. Can you please write an helper function |
Align the TT allocation by 2M to make it huge page friendly and advise the
kernel to use huge pages.
Benchmarks on my i7-8700K (6C/12T) box: (3 runs per bench per config)
vanilla (nps) hugepages (nps) avg
==================================================================================
bench | 3012490 3024364 3036331 3071052 3067544 3071052 +1.5%
bench 16 12 20 | 19237932 19050166 19085315 19266346 19207025 19548758 +1.1%
bench 16384 12 20 | 18182313 18371581 18336838 19381275 19738012 19620225 +7.0%
On my box, huge pages have a significant perf impact when using a big
hash size. They also speed up TT initialization big time:
vanilla (s) huge pages (s) speed-up
=======================================================================
time stockfish bench 16384 1 1 | 5.37 1.48 3.6x
In practice, huge pages with auto-defrag may always be enabled in the
system, in which case this patch essentially does nothing. But this
depends on the values in /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag, which are
distro-specific. Not all distros use always/always here, as these
settings are not automatic wins for all applications.
But in any case, it won't hurt to align the TT by the huge page size
and explicitly request huge pages, to ensure the use of huge pages when
they are available.
No functional change
6f73b61 to
79ba510
Compare
|
Err, that is a +54 lines patch for a trivial change... Sorry, but the patch is getting a little bit too invasive for my taste, this is not what I had in mind. Unfortunately, you have force-pushed changes to the pull request branch, so I can not access you first version anymore :-( |
|
What about the version in the following branch? Does it gives the same speed-up as your first version on your system (+7.0% speed-up for 16 gigabytes hash table)? Obviously I haven't tested it myself, as I said :-) |
|
@snicolet your version looks clean to me (some headers and makefile changes still needed), but would name the function large_page_alloc instead of alligned_malloc, since aligned_alloc is already a std c++ function, and we're doing more than just aligning. |
|
Ok, so by the suggested name of the function I thought the aligned allocation logic was to be contained in a misc.cpp function. What I did in my patch was just a simple refactorization of the already existing cache line alignment logic and add documentation (that's 3rd of the patch). The existing logic is not quite obvious, unless you know the pattern. @snicolet 's smaller patch (as well as my original) will now do double alignment and allocate always the extra 63 bytes, even when we have aligned_alloc() available. The only downside is that an access beyond a bit after the TT alloc end will never be caught by valgrind or debug malloc() (e.g., off-by-one cases), but that's not a practical concern, I suppose. I can test the patch and add the headers / makefile changes later. But I would also suggest using a different name. aligned_alloc/alligned_alloc would suggest that the function can be used for general allocation of aligned memory, which it doesn't do. And there's also the C11 function name clash. |
Oh, I thought one could still access the earlier versions by commit ids? At least they're visible in this page and the links work... I'm not that familiar with github :/ |
|
Change Makefile, add header, rename function: |
|
Looks ok and works on my box. Added some minor notes which you may or may not want to fix. |
|
@skiminki Thanks for the test and the review! My current plan is to commit huge page support for Linux before SF11 with full credit to you as the author of the patch, of course, and I will use the explanations that you have written for this PR. The only reason why I delay the commit is because someone suggested that we we should add large page support for Windows too if it is easy to do so, and it is probably important for "marketing" (to avoid bad comments on Talkchess, for instance) to have this in SF11 too. |
|
By the way, are there any risks to use a) what happens if a user or a chess GUI launches Stockfish on a system with fragmented memory and asks for 16 Gigabytes of hash table? Is the aligned_alloc() version more prone to fail, which would be a regression for Stockfish usability? b) I read on https://linux.die.net/man/3/aligned_alloc that
I don't enforce this restriction at all in my version... c) Wouldn't it be generally safer to fall back on malloc() if aligned_alloc() fails? |
|
@snicolet your questions, and the fact that the behavior might depend quite strongly on the environment, would suggest to me that this is the kind of patch to commit early in the release cycle, to give it some testing on dev, not shortly before a release. |
There should not. But see below (item b). (a) On Linux/glibc, aligned_alloc(alignment, size) internally allocates extra virtual space for aligned allocations, and adjusts the returned pointer appropriately. Just like TT::resize() does. On Linux, these extra allocated pages do not allocate actual memory, since they're never accessed. Also, physical page pool fragmentation is not an issue here, since alignment is on virtual address, not physical. The madvise() call is slightly trickier, since large pages map large virtual --> large physical pages. If there's physical page pool fragmentation, then depending on the system settings, Linux will try to defragment the physical page pool to make room for large pages. But this is best-effort only. If defragment can't be done (locked pages), then whatever large pages are available are used, and the rest will fall back to small pages. The madvise() call with huge pages advice is just a performance hint anyways, so it won't guarantee huge pages. There is also a way to guarantee huge pages, but that's more intrusive. This requires setting up the large page pools by the user. I have patches for this also in my master (since that's what I'm running for my analysis), but they're probably too intrusive here. https://github.com/skiminki/Stockfish/commits/master On 32-bit systems, virtual address space fragmentation might become an issue. This basically means that max available hash size would drop by the amount of alignment required. (b) You're right. The C11 standard actually requires that size = n * alignment. I have missed that on my version, too. On Linux glibc, that requirement is relaxed, but we should have it anyways. I don't know what's the case with Android bionic, for instance. Oh, well. https://en.cppreference.com/w/c/memory/aligned_alloc (c) That's perfectly safe, as far as I can tell. Good point, so why not do that? About large pages on Windows. Fizbo at least has it, and the code is here: https://github.com/MoonstoneLight/Fizbo-Chess/blob/master/hash.cpp#L50 . So, we'd need to use VirtualAlloc/VirtualFree and request some privilege token in order to do so. This was actually a hidden reason for my patch formulation here, so that large pages for windows would be easy to add in a later patch. I would expect that Windows users always get a perf boost from requesting large pages, because Windows doesn't do transparent best-effort large pages ever AFAIK. But I'm not a Windows dev and I haven't tested this. It would also be nice for the OSX users to get large pages. However, I tried to google this a bit, but I only found some unanswered stackoverflow questions. One more point. TCEC used to have a rule that engines can't request large pages due to fairness, but this is no longer the case. I suspect that the rule was because large pages are first-come-first-serve on Windows. But TCEC is now running on Linux and the servers are using aggressive transparent large page settings anyways (i.e., this patch wouldn't be needed). https://wiki.chessdom.org/TCEC_Season_17_Further_information |
|
Version with fall back on malloc() in case the aligned_alloc() call fails: |
Verified locally. Still works for me. |
|
I don't understand the concern about Because we're aligning on About @snicolet 's concern: we do enforce the restriction that the requested size should be a multiple of the alignment, because we allocate in units of megabytes and a megabyte is a multiple of 64 bytes. |
|
@gvreuls, for example aligned_alloc expects a size that is a multiple of the alignment: |
|
I'm confused which branch we're talking about. |
|
I'm trying to very politely make the point that that code isn't up to par, because it aligns memory twice (which is almost always a sign you're in trouble). You don't need to align huge pages, the system will hand you an aligned huge page when it has one, and you should use void* large_page_alloc(size_t alignment, size_t size) {
auto addr = aligned_alloc(alignment, size);
#ifdef USE_MADVISE_HUGEPAGE
if (addr)
madvise(addr, size, MADV_HUGEPAGE);
#endif
return addr;
}
void TranspositionTable::resize(size_t mbSize) {
Threads.main()->wait_for_search_finished();
clusterCount = mbSize * 1024 * 1024 / sizeof(Cluster);
free(table);
table = (Cluster*)large_page_alloc(CacheLineSize, clusterCount * sizeof(Cluster));
if (!table)
{
std::cerr << "Failed to allocate " << mbSize
<< "MB for transposition table." << std::endl;
exit(EXIT_FAILURE);
}
clear();
}(Using |
|
@gvreuls |
|
@snicolet I will open a new PR if you think it's necessary, but I have no clue what the Windows |
|
Yes, it would be interesting, if only for Linux as it is so short there. Note that the master code for (small) alignment is quite old (see git blame on tt.cpp) and was written before the switch to C++11 (so aligned_alloc() was not an option at the time), but it is a part of code with non-obvious traps, see for instance this commit by Marco: For Windows all the implementations I have seen in other open-source chess programs are ~100/200 lines long, which is obviously too intrusive. |
My preference would have been to go with 79ba510 , which avoids the double-alignment issues altogether, and it allows clean(?) windows-specific implementation later. But it is a bit more intrusive that snicolet's patch, I suppose. |
|
I rewrote my original PR with as simple code as possible, this one introduces a new UCI option "Use Large Memory Pages" to control usage of huge memory pages on Linux. It correctly aligns and sizes memory depending on the new UCI option, and should be easily extendable to use Windows large pages too. large_pages |
|
@gvreuls I am not in favor of introducing a new UCI option, sorry. We currently have too many such options, we should seek to reduce their number rather than increasing it. Here we have a case where we should be able to do the right thing transparently to the user. |
|
Just FYI, the new TCEC machine runs Linux so some form of this would be nice. |
|
@mstembera The new TCEC Linux machine uses huge pages by default, this patch isn't needed for TCEC, but it's very useful for Linux machines with normal (less wasteful) memory management. See : TCEC Season 17 Further information |
|
@skiminki I'd like to work towards making this patch ready for merging. I'd like to understand a few more things.
|
|
I'd suggest rounding down the allocSize to a multiple of the alignment instead of rounding it up because aligning the TT already allocates alignment - 1 bytes extra. |
|
playing with this a little more with this PR, building a mingw cross fails with this patch, but that's fixed using That's actually something to add to our CI. |
|
merged, thanks all for the discussion and contributions. |
|
Thanks! I think the committed one is the best version of this patch. |
|
Just linking this here now that this PR has been merged and in case we decide to go for Windows as well. |
Align the TT allocation by 2M to make it huge page friendly and advise the
kernel to use huge pages.
Benchmarks on my i7-8700K (6C/12T) box: (3 runs per bench per config)
On my box, huge pages have a significant perf impact when using a big
hash size. They also speed up TT initialization big time:
In practice, huge pages with auto-defrag may always be enabled in the
system, in which case this patch essentially does nothing. But this
depends on the values in /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag, which are
distro-specific. Not all distros use always/always here, as these
settings are not automatic wins for all applications.
But in any case, it won't hurt to align the TT by the huge page size
and explicitly request huge pages, to ensure huge pages when they are
supported.
NO FUNCTIONAL CHANGE.