-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add large page support for NNUE weights #3081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'd like to get CI checks first before submitting any fishtest runs. On my box, I measured a 1-2% speedup. |
a81886d to
754aec8
Compare
754aec8 to
fe97333
Compare
|
CI for Windows passed, so I guess we're ok in that front. CI has already passed for the patches without the Windows patch. If the patches look ok, I'd assume that STC does not need to get rerun? Should we still run LTC? If so, I'll include all 3 patches in that run to get regression coverage also over the TT mem alloc simplification. |
|
I don't think another LTC run is needed for this non-functional speedup. I've verified a speedup also locally. I need to check this runs fine under wine and cross compiles to android, but otherwise I think this can be merged in the next round. Can you squash all commits to one? |
|
Sure, I'll squash a bit later today. |
Use TT memory functions to allocate memory for the NNUE weights. This should provide a small speed-up on systems where large pages are not automatically used, including Windows and some Linux distributions. Further, since we now have a wrapper for std::aligned_alloc(), we can simplify the TT memory management a bit: - We no longer need to store separate pointers to the hash table and its underlying memory allocation. - We also get to merge the Linux-specific and default implementations of aligned_ttmem_alloc(). Finally, we'll enable the VirtualAlloc code path with large page support also for Win32. STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 14896 W: 1854 L: 1686 D: 11356 Ptnml(0-2): 65, 1224, 4742, 1312, 105 No functional change.
e5f433e to
575a1d0
Compare
|
|
|
With the completed patch, it seems strange to use the type One possibility would be to use the following renaming:
The comments before |
| template <typename T> | ||
| void Initialize(TtmemPtr<T>& pointer) { | ||
|
|
||
| static_assert(alignof(T) <= 4096, "Ensure aligned_ttmem_alloc gives big enough alignment"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment of the assert, did you mean the reverse comparison alignof(T) >= 4096 ?
And by the way, what happens on systems without large page supports? Will the assert fail on such systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the text could have been worded better. aligned_ttmem_alloc() allocates memory that is aligned by 4096 or more--it's in the contract and all the variants implement it so (Linux gives 2MB, Windows gives 4kB/2MB depending on big page support, others give 4kB). Here we simply do a build-time check that type T has an alignment requirement of at most 4096.
The static_assert would fail in the following case, for instance:
struct alignas(8192) BigStruct
{
uint8_t some_data_that_needs_8192_alignment[1048576];
};
TtmemPtr<BigStruct> bigData;
Initialize(bigData); // static assert triggered here, since BigStruct type needs more alignment than aligned_ttmem_alloc() guarantees.This static_assert is not hugely important. But since Initialize(TtmemPtr<T> &) is a template where T could be theoretically any type, it's generally advisable to have such a check in place. But I suppose we can also consider removing it if it's confusing.
And by the way, what happens on systems without large page supports? Will the assert fail on such systems?
No. The assert depends only on the alignment requirement of type T.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanations. So I propose to reword the comment of the assert like this:
static_assert(alignof(T) <= 4096, "aligned_large_page_alloc() may fail for such a big alignment requirement of T");
I suppose I could do a search'n'replace if this change is deemed necessary. |
yes, please do. |
|
I'll will make the changes while merging. |
|
Thanks! Your edits look good. |
|
Do you measure any speed-up if we use also the LargePagePtr type for the network? Line 57 of evaluate_nnue.cpp: I can't test on my Macintosh, as Mac OS does not have large pages. |
|
The network is probably not so much data (but more than 1 small page). I wonder how far we are from using a large page allocator for the Thread classes, with the History tables, they might actually benefit more. |
There might be some potential here. However, there are also some pitfalls with over-aggressive use of large pages. These mainly are excessive memory use (allocs rounded up to 2MB multiples) and possibly some NUMA issues, since page migration would occur in 2MB chunks. There are couple of considerations:
|
|
The Thread class is actually pretty large 8462080 bytes, most of it will be accessed by random access (history tables). All of this is allocated deallocated in |
Ok, then it seems like a candidate for LP alloc. However, that size is just beyond the previous 2MB boundary (by 73472 bytes), so ideally we'd either shrink or enlarge it a bit to fit better in 2MB alloc multiples. |
|
Shrinking in a non-functional way would be difficult, but it contains e.g. the pawnsTable, which at least in NNUE times should be less important, and could presumably be reduced in size (halved). But even wasting 2MB per thread doesn't seem such a terrible thing. |
|
Possibly. However, would it make sense to bundle the thread stack with the thread object allocation? I see that the stack is 8MB, which would also make it a candidate for LP alloc--and if 7.9MB is actually good enough for the thread we'd save 1.9MB memory per thread. I said earlier that stacks may not be the prime candidates for LP, but I also don't see why it would hurt. |
|
If that's easy to implement, yes, the thread stack could be 7.5Mb or similar (there is a large safety margin on the thread stack space). There is another thing I have been wondering about, the Thread object is allocated, and presumably first touched, by the uciThread. It would possibly be better to have those pages first touched by the thread that is going to use them, so maybe gain something in a NUMA setup as well. Again not sure how the implementation would look like. |
Setting the thread stack should be a matter of calling pthread_attr_setstack() instead of pthread_attr_setstacksize(). I think(?) this covers all platforms that support large pages. On the plus side, explicit stack allocation would allow us to measure the stack highwater mark on exit.
On paper this looks easy. Just pass the allocation to the newly created native thread and let it construct the thread object into the allocation. However, implementing this cleanly might require a bit of refactoring of the Thread/NativeThread relation. Perhaps not that much, though. However, I think this would be most useful for our Windows users, since Windows doesn't seem to do physical memory migration on NUMA? (Or at least there was this bonkers +40% speedup for TT hash on Windows with malloc()-->VirtualAlloc() for some AMD Opteron box.) But shouldn't definitely hurt for Linux, either. Maybe this could be worth an experiment. |
|
If can give it a try, would be nice. diff --git a/src/thread.cpp b/src/thread.cpp
index 2fbf745d0..c233a8ef8 100644
--- a/src/thread.cpp
+++ b/src/thread.cpp
@@ -130,14 +130,25 @@ void ThreadPool::set(size_t requested) {
main()->wait_for_search_finished();
while (size() > 0)
- delete back(), pop_back();
+ {
+ if (size() == 1)
+ static_cast<MainThread*>(back())->~MainThread();
+ else
+ back()->~Thread();
+ aligned_large_pages_free(back());
+ pop_back();
+ }
}
if (requested > 0) { // create new thread(s)
- push_back(new MainThread(0));
+ void* mem = aligned_large_pages_alloc(sizeof(MainThread));
+ push_back(new(mem) MainThread(0));
while (size() < requested)
- push_back(new Thread(size()));
+ {
+ mem = aligned_large_pages_alloc(sizeof(Thread));
+ push_back(new(mem) Thread(size()));
+ }
clear();
// Reallocate the hash with the new threadpool size
|
|
I can take a look at some point. TBH, I wouldn’t expect big speedups—or it can also be that the speedup is present only with specific hash sizes. |
… management Use TT memory functions to allocate memory for the NNUE weights. This should provide a small speed-up on systems where large pages are not automatically used, including Windows and some Linux distributions. Further, since we now have a wrapper for std::aligned_alloc(), we can simplify the TT memory management a bit: - We no longer need to store separate pointers to the hash table and its underlying memory allocation. - We also get to merge the Linux-specific and default implementations of aligned_ttmem_alloc(). Finally, we'll enable the VirtualAlloc code path with large page support also for Win32. STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 14896 W: 1854 L: 1686 D: 11356 Ptnml(0-2): 65, 1224, 4742, 1312, 105 closes official-stockfish#3081 No functional change.
Use TT memory functions to allocate memory for the NNUE weights. This should provide a small speed-up on systems where large pages are not automatically used, including Windows and some Linux distributions. Further, since we now have a wrapper for std::aligned_alloc(), we can simplify the TT memory management a bit: - We no longer need to store separate pointers to the hash table and its underlying memory allocation. - We also get to merge the Linux-specific and default implementations of aligned_ttmem_alloc(). Finally, we'll enable the VirtualAlloc code path with large page support also for Win32. STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 14896 W: 1854 L: 1686 D: 11356 Ptnml(0-2): 65, 1224, 4742, 1312, 105 closes official-stockfish#3081 No functional change.
Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used.
Further, since we now have wrapper for std::aligned_alloc() for NNUE, we
can simplify the TT memory management a bit, since we no longer need to
store separate pointers to the table and the underlying allocation.
No functional change.
STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105