Skip to content

Conversation

@skiminki
Copy link
Contributor

Align the TT allocation by 2M to make it huge page friendly and advise the
kernel to use huge pages.

Benchmarks on my i7-8700K (6C/12T) box: (3 runs per bench per config)

                   vanilla (nps)               hugepages (nps)              avg
=================================================================================
bench              3012490  3024364  3036331   3071052  3067544  3071052    +1.5%
bench 16 12 20     19237932 19050166 19085315  19266346 19207025 19548758   +1.1%
bench 16384 12 20  18182313 18371581 18336838  19381275 19738012 19620225   +7.0%

On my box, huge pages have a significant perf impact when using a big
hash size. They also speed up TT initialization big time:

                                 vanilla (s)  huge pages (s)  speed-up
======================================================================
time stockfish bench 16384 1 1   5.37         1.48            3.6x

In practice, huge pages with auto-defrag may always be enabled in the
system, in which case this patch essentially does nothing. But this
depends on the values in /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag, which are
distro-specific. Not all distros use always/always here, as these
settings are not automatic wins for all applications.

But in any case, it won't hurt to align the TT by the huge page size
and explicitly request huge pages, to ensure huge pages when they are
supported.

NO FUNCTIONAL CHANGE.

@skiminki skiminki force-pushed the tt-transparent-huge-pages branch 2 times, most recently from f715511 to d7bf423 Compare December 25, 2019 14:06
@vondele
Copy link
Member

vondele commented Dec 25, 2019

For reference about 35% percent difference, toggling between :

+  madvise(mem, allocSize, MADV_NOHUGEPAGE);

and

+  madvise(mem, allocSize, MADV_HUGEPAGE);

The system has hugepages by default, and master matches the MADV_HUGEPAGE performance. (dual socket Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz).

So yes, on some systems this might be a significant gain.

-bash-4.2$ for i in `seq 1 10`; do ./stockfish.nohugepage bench 10000 70 20 default depth 2>&1 | grep 'Nodes/second'; done
Nodes/second    : 67328195
Nodes/second    : 65982070
Nodes/second    : 65514851
Nodes/second    : 65072194
Nodes/second    : 65568597
Nodes/second    : 65570284
Nodes/second    : 64491622
Nodes/second    : 66089321
Nodes/second    : 65600501
Nodes/second    : 65635178
-bash-4.2$ for i in `seq 1 10`; do ./stockfish.hugepage bench 10000 70 20 default depth 2>&1 | grep 'Nodes/second'; done
Nodes/second    : 92966381
Nodes/second    : 88320163
Nodes/second    : 88866438
Nodes/second    : 88627935
Nodes/second    : 89368951
Nodes/second    : 88280357
Nodes/second    : 88586165
Nodes/second    : 88285122
Nodes/second    : 89297419
Nodes/second    : 88944702

@skiminki skiminki force-pushed the tt-transparent-huge-pages branch from d7bf423 to 6f73b61 Compare December 25, 2019 21:20
@skiminki
Copy link
Contributor Author

The system has hugepages by default, and master matches the MADV_HUGEPAGE performance

I guess one can expect that /sys/kernel/mm/transparent_hugepage/enabled is "always" on most new systems, but /sys/kernel/mm/transparent_hugepage/defrag is not necessarily so. This can make the performance non-deterministic: you get huge pages if there's not too much fragmentation in the page pool. This is not necessarily true if the system has some uptime.

About systems that don't have enabled=always / defrag=always:

  • It has been reported to me that debian/sid uses enabled=always / defrag=madvise. Not sure if this is true for all installs, but at least it's so on that one box. I also have some Android machines that use this combo.
  • I have a couple years old debian/jessie box that does not have huge pages enabled by default for some reason (enabled=madvise / defrag=always).

@MichaelB7
Copy link
Contributor

MichaelB7 commented Dec 26, 2019

This is on a linux box , average for 3 runs, this box has 128GB of RAM, with 20GB dedicated to large pages on start up.

            vanilla (nps)      hugepages (nps)   avg
============================================================
bench        889934186           103618945        +11.6%

Looks likes a winner to me. Thank you @skiminki for submitting this patch.

vondele added a commit to vondele/Stockfish that referenced this pull request Dec 27, 2019
@vondele
Copy link
Member

vondele commented Dec 27, 2019

For reference, passes non-regression on fishtest (which is no surprise) :
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 53271 W: 11598 L: 11581 D: 30092
Ptnml(0-2): 661, 5666, 14000, 5594, 704
http://tests.stockfishchess.org/tests/view/5e05b355c13ac2425c4a9d64

no Elo gain there, probably because most machines are hugepage enabled by default there.

src/tt.cpp Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would

madvise(mem, allocSize, MADV_HUGEPAGE | MADV_RANDOM);

be even better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reasonable consideration. However:

  • The advice parameter is not actually a bitfield, so you can't use bitwise or.
  • Since the TT hash is always supposed to be resident after the initial memset() to clear it, MADV_RANDOM doesn't actually do anything to my knowledge. Those MADV_RANDOM/MADV_SEQUENTIAL are supposed to control prefetching (i.e., file read-ahead) of swapped-out pages or pages in memory-mapped files. MADV_RANDOM might theoretically make sense for the Syzygy files to control resident memory bloat, but it needs a separate study.

The one advice that I would have liked to use here is MADV_WILLNEED to make pages resident without page faulting on the initial memset(). This should speed up hash init even more. Unfortunately, it doesn't seem to do anything for anonymous mappings (i.e., plain memory not mapped to files).

@MichaelB7
Copy link
Contributor

I'm not a windows user, but should we be considering large pages for Windows as well?

@skiminki
Copy link
Contributor Author

I'm not a windows user, but should we be considering large pages for Windows as well?

I'm not a Windows dev, either, but I would guess that large pages alloc has more positive impact on Windows than on Linux. I'm not aware that Windows has best-effort large pages at all like Linux has (i.e. transparent huge pages)? But I might be wrong, and it's worth a test at least, I suppose.

The Windows instructions are here: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support . A bit more complicated than on Linux, but not too complicated.

I also think that enabling large pages for OSX would make sense, too, if that can be done.

@MichaelB7
Copy link
Contributor

macOS does not support large/huge pages as far as I know. I believe Brainfish may have already pushed Large pages to its SF fork. Should be relatively easy if someone could test.

@snicolet
Copy link
Member

snicolet commented Jan 7, 2020

I am on MacOS, so I am obviously not an expert on these Windows/Linux specific subjects like NUMA, huge pages, etc (and I admit that platform specific speed-ups are not my cup of tea, I trust more algorithmic improvements in the long run for Stockfish).

However, I understand that performance is an important psychological issue for some users with top machines, so I will approve this pull request.

Can you please write an helper function aligned_malloc() in misc.cpp, and isolate all the conditional compilation with USE_MADVISE_HUGEPAGE inside this function?

Align the TT allocation by 2M to make it huge page friendly and advise the
kernel to use huge pages.

Benchmarks on my i7-8700K (6C/12T) box: (3 runs per bench per config)

                    vanilla (nps)               hugepages (nps)              avg
==================================================================================
bench             | 3012490  3024364  3036331   3071052  3067544  3071052    +1.5%
bench 16 12 20    | 19237932 19050166 19085315  19266346 19207025 19548758   +1.1%
bench 16384 12 20 | 18182313 18371581 18336838  19381275 19738012 19620225   +7.0%

On my box, huge pages have a significant perf impact when using a big
hash size. They also speed up TT initialization big time:

                                  vanilla (s)  huge pages (s)  speed-up
=======================================================================
time stockfish bench 16384 1 1  | 5.37         1.48            3.6x

In practice, huge pages with auto-defrag may always be enabled in the
system, in which case this patch essentially does nothing. But this
depends on the values in /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag, which are
distro-specific. Not all distros use always/always here, as these
settings are not automatic wins for all applications.

But in any case, it won't hurt to align the TT by the huge page size
and explicitly request huge pages, to ensure the use of huge pages when
they are available.

No functional change
@skiminki skiminki force-pushed the tt-transparent-huge-pages branch from 6f73b61 to 79ba510 Compare January 8, 2020 00:02
@snicolet
Copy link
Member

snicolet commented Jan 8, 2020

Err, that is a +54 lines patch for a trivial change...

Sorry, but the patch is getting a little bit too invasive for my taste, this is not what I had in mind. Unfortunately, you have force-pushed changes to the pull request branch, so I can not access you first version anymore :-(

@snicolet
Copy link
Member

snicolet commented Jan 8, 2020

What about the version in the following branch?
https://github.com/snicolet/Stockfish/commits/large_pages

Does it gives the same speed-up as your first version on your system (+7.0% speed-up for 16 gigabytes hash table)? Obviously I haven't tested it myself, as I said :-)

@vondele
Copy link
Member

vondele commented Jan 8, 2020

@snicolet your version looks clean to me (some headers and makefile changes still needed), but would name the function large_page_alloc instead of alligned_malloc, since aligned_alloc is already a std c++ function, and we're doing more than just aligning.

@skiminki
Copy link
Contributor Author

skiminki commented Jan 8, 2020

Ok, so by the suggested name of the function I thought the aligned allocation logic was to be contained in a misc.cpp function. What I did in my patch was just a simple refactorization of the already existing cache line alignment logic and add documentation (that's 3rd of the patch). The existing logic is not quite obvious, unless you know the pattern.

@snicolet 's smaller patch (as well as my original) will now do double alignment and allocate always the extra 63 bytes, even when we have aligned_alloc() available. The only downside is that an access beyond a bit after the TT alloc end will never be caught by valgrind or debug malloc() (e.g., off-by-one cases), but that's not a practical concern, I suppose. I can test the patch and add the headers / makefile changes later.

But I would also suggest using a different name. aligned_alloc/alligned_alloc would suggest that the function can be used for general allocation of aligned memory, which it doesn't do. And there's also the C11 function name clash.

@skiminki
Copy link
Contributor Author

skiminki commented Jan 8, 2020

Unfortunately, you have force-pushed changes to the pull request branch

Oh, I thought one could still access the earlier versions by commit ids? At least they're visible in this page and the links work... I'm not that familiar with github :/

@snicolet
Copy link
Member

snicolet commented Jan 8, 2020

Change Makefile, add header, rename function:
https://github.com/snicolet/Stockfish/commits/large_pages

@skiminki
Copy link
Contributor Author

skiminki commented Jan 8, 2020

Looks ok and works on my box. Added some minor notes which you may or may not want to fix.

@snicolet
Copy link
Member

snicolet commented Jan 9, 2020

@skiminki Thanks for the test and the review! My current plan is to commit huge page support for Linux before SF11 with full credit to you as the author of the patch, of course, and I will use the explanations that you have written for this PR.

The only reason why I delay the commit is because someone suggested that we we should add large page support for Windows too if it is easy to do so, and it is probably important for "marketing" (to avoid bad comments on Talkchess, for instance) to have this in SF11 too.

@snicolet
Copy link
Member

snicolet commented Jan 9, 2020

By the way, are there any risks to use aligned_alloc(alignment, size) instead of malloc(size)?

a) what happens if a user or a chess GUI launches Stockfish on a system with fragmented memory and asks for 16 Gigabytes of hash table? Is the aligned_alloc() version more prone to fail, which would be a regression for Stockfish usability?

b) I read on https://linux.die.net/man/3/aligned_alloc that

the function aligned_alloc() is the same as memalign(), except for the added restriction
that size should be a multiple of alignment.

I don't enforce this restriction at all in my version...

c) Wouldn't it be generally safer to fall back on malloc() if aligned_alloc() fails?

@vondele
Copy link
Member

vondele commented Jan 9, 2020

@snicolet your questions, and the fact that the behavior might depend quite strongly on the environment, would suggest to me that this is the kind of patch to commit early in the release cycle, to give it some testing on dev, not shortly before a release.

@skiminki
Copy link
Contributor Author

skiminki commented Jan 9, 2020

By the way, are there any risks to use aligned_alloc(alignment, size) instead of malloc(size)

There should not. But see below (item b).

(a) On Linux/glibc, aligned_alloc(alignment, size) internally allocates extra virtual space for aligned allocations, and adjusts the returned pointer appropriately. Just like TT::resize() does. On Linux, these extra allocated pages do not allocate actual memory, since they're never accessed. Also, physical page pool fragmentation is not an issue here, since alignment is on virtual address, not physical.

The madvise() call is slightly trickier, since large pages map large virtual --> large physical pages. If there's physical page pool fragmentation, then depending on the system settings, Linux will try to defragment the physical page pool to make room for large pages. But this is best-effort only. If defragment can't be done (locked pages), then whatever large pages are available are used, and the rest will fall back to small pages. The madvise() call with huge pages advice is just a performance hint anyways, so it won't guarantee huge pages.

There is also a way to guarantee huge pages, but that's more intrusive. This requires setting up the large page pools by the user. I have patches for this also in my master (since that's what I'm running for my analysis), but they're probably too intrusive here. https://github.com/skiminki/Stockfish/commits/master

On 32-bit systems, virtual address space fragmentation might become an issue. This basically means that max available hash size would drop by the amount of alignment required.

(b) You're right. The C11 standard actually requires that size = n * alignment. I have missed that on my version, too. On Linux glibc, that requirement is relaxed, but we should have it anyways. I don't know what's the case with Android bionic, for instance. Oh, well. https://en.cppreference.com/w/c/memory/aligned_alloc

(c) That's perfectly safe, as far as I can tell. Good point, so why not do that?

About large pages on Windows. Fizbo at least has it, and the code is here: https://github.com/MoonstoneLight/Fizbo-Chess/blob/master/hash.cpp#L50 . So, we'd need to use VirtualAlloc/VirtualFree and request some privilege token in order to do so. This was actually a hidden reason for my patch formulation here, so that large pages for windows would be easy to add in a later patch.

I would expect that Windows users always get a perf boost from requesting large pages, because Windows doesn't do transparent best-effort large pages ever AFAIK. But I'm not a Windows dev and I haven't tested this.

It would also be nice for the OSX users to get large pages. However, I tried to google this a bit, but I only found some unanswered stackoverflow questions.

One more point. TCEC used to have a rule that engines can't request large pages due to fairness, but this is no longer the case. I suspect that the rule was because large pages are first-come-first-serve on Windows. But TCEC is now running on Linux and the servers are using aggressive transparent large page settings anyways (i.e., this patch wouldn't be needed). https://wiki.chessdom.org/TCEC_Season_17_Further_information

@snicolet
Copy link
Member

Version with fall back on malloc() in case the aligned_alloc() call fails:
https://github.com/snicolet/Stockfish/commits/large_pages

@skiminki
Copy link
Contributor Author

Version with fall back on malloc() in case the aligned_alloc() call fails:
https://github.com/snicolet/Stockfish/commits/large_pages

Verified locally. Still works for me.

@gvreuls
Copy link
Contributor

gvreuls commented Jan 14, 2020

I don't understand the concern about aligned_alloc() failing, because in master we force exactly the same behavior that aligned_alloc() displays: we allocate (alignment - 1) bytes more than the requested size and use that extra space to align the pointer we're returning.

Because we're aligning on CacheLineSize which is 64 bytes, aligned_alloc() will allocate 63 bytes more than an unaligned malloc() will, so the concern with it failing on 63 bytes of memory extra when we're allocating megabytes or gigabytes of memory seems unwarranted IMO, and if it does fail we should just exit with an error message like we did before when we still used malloc().

About @snicolet 's concern: we do enforce the restriction that the requested size should be a multiple of the alignment, because we allocate in units of megabytes and a megabyte is a multiple of 64 bytes.

@vondele
Copy link
Member

vondele commented Jan 14, 2020

@gvreuls, for example aligned_alloc expects a size that is a multiple of the alignment:

Passing a size which is not an integral multiple of alignment or a alignment which is not valid or not supported by the implementation causes the function to fail and return a null pointer (C11, as published, specified undefined behavior in this case, this was corrected by DR 460). 

https://en.cppreference.com/w/c/memory/aligned_alloc

@gvreuls
Copy link
Contributor

gvreuls commented Jan 14, 2020

@vondele I already addressed that in my previous post, we're aligning on 64 bytes and allocating memory in units of megabytes, which is a multiple of 64 bytes:

clusterCount = mbSize * 1024 * 1024 / sizeof(Cluster);

@vondele
Copy link
Member

vondele commented Jan 14, 2020

I'm confused which branch we're talking about.
https://github.com/snicolet/Stockfish/commits/large_pages
still allocates with a different size, if I read the code correctly.

@gvreuls
Copy link
Contributor

gvreuls commented Jan 14, 2020

I'm trying to very politely make the point that that code isn't up to par, because it aligns memory twice (which is almost always a sign you're in trouble). You don't need to align huge pages, the system will hand you an aligned huge page when it has one, and you should use aligned_alloc() instead of aligning memory manually. This is what a linux version should look like:

void* large_page_alloc(size_t alignment, size_t size) {
  auto addr = aligned_alloc(alignment, size);
#ifdef USE_MADVISE_HUGEPAGE
  if (addr)
      madvise(addr, size, MADV_HUGEPAGE);
#endif
  return addr;
}

void TranspositionTable::resize(size_t mbSize) {
  Threads.main()->wait_for_search_finished();
  clusterCount = mbSize * 1024 * 1024 / sizeof(Cluster);
  
  free(table);
  table = (Cluster*)large_page_alloc(CacheLineSize, clusterCount * sizeof(Cluster));
  if (!table)
  {
      std::cerr << "Failed to allocate " << mbSize
                << "MB for transposition table." << std::endl;
      exit(EXIT_FAILURE);
  }
  clear();
}

(Using aligned_alloc() also means we can drop the mem pointer from TranspositionTable.)

@snicolet
Copy link
Member

@gvreuls
Thanks for the heads up, can you make another pull request?
It seems the discussion would be easier to follow from a clear starting point :-)

@gvreuls
Copy link
Contributor

gvreuls commented Jan 14, 2020

@snicolet I will open a new PR if you think it's necessary, but I have no clue what the Windows large_page_alloc() would look like, on Linux it's pretty straight forward :-)

@snicolet
Copy link
Member

snicolet commented Jan 14, 2020

Yes, it would be interesting, if only for Linux as it is so short there.

Note that the master code for (small) alignment is quite old (see git blame on tt.cpp) and was written before the switch to C++11 (so aligned_alloc() was not an option at the time), but it is a part of code with non-obvious traps, see for instance this commit by Marco:
595fc34

For Windows all the implementations I have seen in other open-source chess programs are ~100/200 lines long, which is obviously too intrusive.

@gvreuls gvreuls mentioned this pull request Jan 14, 2020
@gvreuls
Copy link
Contributor

gvreuls commented Jan 14, 2020

@snicolet I have opened #2491

@skiminki
Copy link
Contributor Author

I'm trying to very politely make the point that that code isn't up to par, because it aligns memory twice

My preference would have been to go with 79ba510 , which avoids the double-alignment issues altogether, and it allows clean(?) windows-specific implementation later. But it is a bit more intrusive that snicolet's patch, I suppose.

@gvreuls
Copy link
Contributor

gvreuls commented Jan 21, 2020

I rewrote my original PR with as simple code as possible, this one introduces a new UCI option "Use Large Memory Pages" to control usage of huge memory pages on Linux. It correctly aligns and sizes memory depending on the new UCI option, and should be easily extendable to use Windows large pages too. large_pages

@snicolet
Copy link
Member

@gvreuls I am not in favor of introducing a new UCI option, sorry. We currently have too many such options, we should seek to reduce their number rather than increasing it.

Here we have a case where we should be able to do the right thing transparently to the user.

@mstembera
Copy link
Contributor

Just FYI, the new TCEC machine runs Linux so some form of this would be nice.

@gvreuls
Copy link
Contributor

gvreuls commented Jan 24, 2020

@mstembera The new TCEC Linux machine uses huge pages by default, this patch isn't needed for TCEC, but it's very useful for Linux machines with normal (less wasteful) memory management. See : TCEC Season 17 Further information

@vondele
Copy link
Member

vondele commented Jan 25, 2020

@skiminki I'd like to work towards making this patch ready for merging. I'd like to understand a few more things.

  • Can't we just round up the allocSize so it is a multiple of alignment. The user can specify an odd number of MB for the TT, and we don't want this to cause aligned_alloc to fail.
  • Can we just use __linux__ without the Makefile changes and avoid to define USE_MADVISE_HUGEPAGE
  • since std::aligned_alloc is c++17, and would be the one we get via <cstdlib>, the aligned_alloc we use is actually from <stdlib.h> (probably included indirectly via <cstdlib>). Wouldn't it be better to include the <stdlib.h> file explicitly?

@gvreuls
Copy link
Contributor

gvreuls commented Jan 25, 2020

I'd suggest rounding down the allocSize to a multiple of the alignment instead of rounding it up because aligning the TT already allocates alignment - 1 bytes extra.

@vondele
Copy link
Member

vondele commented Jan 27, 2020

playing with this a little more with this PR, building a mingw cross fails with this patch, but that's fixed using __linux__ directly.

That's actually something to add to our CI.

@vondele vondele closed this in 39437f4 Jan 27, 2020
@vondele
Copy link
Member

vondele commented Jan 27, 2020

merged, thanks all for the discussion and contributions.

@skiminki
Copy link
Contributor Author

Thanks! I think the committed one is the best version of this patch.

@mstembera
Copy link
Contributor

Just linking this here now that this PR has been merged and in case we decide to go for Windows as well.
https://groups.google.com/d/msg/fishcooking/maEOv8ypkMU/z20YLZm2CgAJ

@skiminki skiminki deleted the tt-transparent-huge-pages branch June 13, 2020 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants