Add large page support for NNUE weights #3081

skiminki · 2020-08-30T16:46:42Z

Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used.

Further, since we now have wrapper for std::aligned_alloc() for NNUE, we
can simplify the TT memory management a bit, since we no longer need to
store separate pointers to the table and the underlying allocation.

No functional change.

STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105

skiminki · 2020-08-30T16:48:42Z

I'd like to get CI checks first before submitting any fishtest runs. On my box, I measured a 1-2% speedup.

src/misc.cpp

src/nnue/evaluate_nnue.cpp

src/misc.cpp

skiminki · 2020-09-20T12:43:21Z

CI for Windows passed, so I guess we're ok in that front. CI has already passed for the patches without the Windows patch.

If the patches look ok, I'd assume that STC does not need to get rerun? Should we still run LTC? If so, I'll include all 3 patches in that run to get regression coverage also over the TT mem alloc simplification.

vondele · 2020-09-20T13:01:49Z

I don't think another LTC run is needed for this non-functional speedup. I've verified a speedup also locally.

I need to check this runs fine under wine and cross compiles to android, but otherwise I think this can be merged in the next round.

Can you squash all commits to one?

skiminki · 2020-09-20T13:07:30Z

Sure, I'll squash a bit later today.

Use TT memory functions to allocate memory for the NNUE weights. This should provide a small speed-up on systems where large pages are not automatically used, including Windows and some Linux distributions. Further, since we now have a wrapper for std::aligned_alloc(), we can simplify the TT memory management a bit: - We no longer need to store separate pointers to the hash table and its underlying memory allocation. - We also get to merge the Linux-specific and default implementations of aligned_ttmem_alloc(). Finally, we'll enable the VirtualAlloc code path with large page support also for Win32. STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 14896 W: 1854 L: 1686 D: 11356 Ptnml(0-2): 65, 1224, 4742, 1312, 105 No functional change.

skiminki · 2020-09-20T13:40:46Z

git diff e5f433e 575a1d0 is empty.

snicolet · 2020-09-20T18:41:42Z

With the completed patch, it seems strange to use the type TtmemPtr for the network allocation, since the net is not at all related to the transposition table.

One possibility would be to use the following renaming:

aligned_ttmem_alloc ---> aligned_large_page_alloc
aligned_ttmem_free ---> aligned_large_page_free
TtmemPtr ---> LargePagePtr

The comments before aligned_large_page_alloc() in misc.cpp would be sufficient to indicate that this function [...] will return suitably aligned memory, if possible using large pages.

snicolet · 2020-09-20T19:01:49Z

src/nnue/evaluate_nnue.cpp

+  template <typename T>
+  void Initialize(TtmemPtr<T>& pointer) {
+
+    static_assert(alignof(T) <= 4096, "Ensure aligned_ttmem_alloc gives big enough alignment");


I don't understand the comment of the assert, did you mean the reverse comparison alignof(T) >= 4096 ?

And by the way, what happens on systems without large page supports? Will the assert fail on such systems?

I suppose the text could have been worded better. aligned_ttmem_alloc() allocates memory that is aligned by 4096 or more--it's in the contract and all the variants implement it so (Linux gives 2MB, Windows gives 4kB/2MB depending on big page support, others give 4kB). Here we simply do a build-time check that type T has an alignment requirement of at most 4096.

The static_assert would fail in the following case, for instance:

struct alignas(8192) BigStruct { uint8_t some_data_that_needs_8192_alignment[1048576]; }; TtmemPtr<BigStruct> bigData; Initialize(bigData); // static assert triggered here, since BigStruct type needs more alignment than aligned_ttmem_alloc() guarantees.

This static_assert is not hugely important. But since Initialize(TtmemPtr<T> &) is a template where T could be theoretically any type, it's generally advisable to have such a check in place. But I suppose we can also consider removing it if it's confusing.

And by the way, what happens on systems without large page supports? Will the assert fail on such systems?

No. The assert depends only on the alignment requirement of type T.

I see, thanks for the explanations. So I propose to reword the comment of the assert like this:

static_assert(alignof(T) <= 4096, "aligned_large_page_alloc() may fail for such a big alignment requirement of T");

skiminki · 2020-09-20T19:57:33Z

With the completed patch, it seems strange to use the type TtmemPtr for the network allocation...

I suppose I could do a search'n'replace if this change is deemed necessary.

vondele · 2020-09-20T20:46:40Z

With the completed patch, it seems strange to use the type TtmemPtr for the network allocation...

I suppose I could do a search'n'replace if this change is deemed necessary.

yes, please do.

vondele · 2020-09-21T06:32:07Z

I'll will make the changes while merging.

skiminki · 2020-09-21T07:48:01Z

Thanks! Your edits look good.

snicolet · 2020-09-21T10:42:41Z

Do you measure any speed-up if we use also the LargePagePtr type for the network? Line 57 of evaluate_nnue.cpp:

   // Evaluation function
   LargePagePtr<Network> network;

I can't test on my Macintosh, as Mac OS does not have large pages.

vondele · 2020-09-21T11:06:57Z

The network is probably not so much data (but more than 1 small page). I wonder how far we are from using a large page allocator for the Thread classes, with the History tables, they might actually benefit more.

skiminki · 2020-09-21T13:48:51Z

I wonder how far we are from using a large page allocator for the Thread classes, with the History tables, they might actually benefit more.

There might be some potential here. However, there are also some pitfalls with over-aggressive use of large pages. These mainly are excessive memory use (allocs rounded up to 2MB multiples) and possibly some NUMA issues, since page migration would occur in 2MB chunks.

There are couple of considerations:

Sequential access such as stack-like structures won't probably benefit that much from large pages. Large pages are more useful in random access, striped/tiled access (like matrix calculations), and so on, to relieve the pressure on TLB. However...
The number of TLB entries is usually in order of some thousand or so in modern CPUs. Keeping the working set (not including TT hash) within the TLB coverage definitely is a win. On my CPU this translates to ~6 MB (4-kB pages) and ~3 GB (2-MB pages) for the working set limits. (core i7-8700k, 1536 entries, use cpuid on Linux to query)
Per-thread memory allocation could perhaps use large pages with suballocation for small buffers to prevent excessive padding.
For estimates of perf improvements: Run some benchmarks with the current SF-dev with THP "enable/defrag" in "always/always" vs "madvise/madvise" modes. If the diff is not much, then I'd guess it doesn't make much sense to start overhauling memory layouts for large pages. My guess is that we've already exhausted most of the LP potential, but who knows. These things become more complicated with NUMA and only experiments will tell.

vondele · 2020-09-21T14:29:35Z

The Thread class is actually pretty large 8462080 bytes, most of it will be accessed by random access (history tables). All of this is allocated deallocated in ThreadPool::set()

skiminki · 2020-09-21T15:00:09Z

The Thread class is actually pretty large 8462080 bytes, most of it will be accessed by random access (history tables).

Ok, then it seems like a candidate for LP alloc. However, that size is just beyond the previous 2MB boundary (by 73472 bytes), so ideally we'd either shrink or enlarge it a bit to fit better in 2MB alloc multiples.

vondele · 2020-09-21T15:15:29Z

Shrinking in a non-functional way would be difficult, but it contains e.g. the pawnsTable, which at least in NNUE times should be less important, and could presumably be reduced in size (halved). But even wasting 2MB per thread doesn't seem such a terrible thing.

skiminki · 2020-09-21T16:03:34Z

Possibly. However, would it make sense to bundle the thread stack with the thread object allocation? I see that the stack is 8MB, which would also make it a candidate for LP alloc--and if 7.9MB is actually good enough for the thread we'd save 1.9MB memory per thread.

I said earlier that stacks may not be the prime candidates for LP, but I also don't see why it would hurt.

vondele · 2020-09-21T17:11:08Z

If that's easy to implement, yes, the thread stack could be 7.5Mb or similar (there is a large safety margin on the thread stack space). There is another thing I have been wondering about, the Thread object is allocated, and presumably first touched, by the uciThread. It would possibly be better to have those pages first touched by the thread that is going to use them, so maybe gain something in a NUMA setup as well. Again not sure how the implementation would look like.

skiminki · 2020-09-21T17:57:11Z

If that's easy to implement, yes, the thread stack could be 7.5Mb or similar

Setting the thread stack should be a matter of calling pthread_attr_setstack() instead of pthread_attr_setstacksize(). I think(?) this covers all platforms that support large pages.

On the plus side, explicit stack allocation would allow us to measure the stack highwater mark on exit.

There is another thing I have been wondering about, the Thread object is allocated, and presumably first touched, by the uciThread.

On paper this looks easy. Just pass the allocation to the newly created native thread and let it construct the thread object into the allocation. However, implementing this cleanly might require a bit of refactoring of the Thread/NativeThread relation. Perhaps not that much, though. However, I think this would be most useful for our Windows users, since Windows doesn't seem to do physical memory migration on NUMA? (Or at least there was this bonkers +40% speedup for TT hash on Windows with malloc()-->VirtualAlloc() for some AMD Opteron box.) But shouldn't definitely hurt for Linux, either.

Maybe this could be worth an experiment.

vondele · 2020-09-21T18:03:31Z

If can give it a try, would be nice.
I had a look at this solution, but there is no speedup on my linux box:

diff --git a/src/thread.cpp b/src/thread.cpp
index 2fbf745d0..c233a8ef8 100644
--- a/src/thread.cpp
+++ b/src/thread.cpp
@@ -130,14 +130,25 @@ void ThreadPool::set(size_t requested) {
       main()->wait_for_search_finished();
 
       while (size() > 0)
-          delete back(), pop_back();
+      {
+          if (size() == 1)
+              static_cast<MainThread*>(back())->~MainThread();
+          else
+              back()->~Thread();
+          aligned_large_pages_free(back());
+          pop_back();
+      }
   }
 
   if (requested > 0) { // create new thread(s)
-      push_back(new MainThread(0));
+      void* mem = aligned_large_pages_alloc(sizeof(MainThread));
+      push_back(new(mem) MainThread(0));
 
       while (size() < requested)
-          push_back(new Thread(size()));
+      {
+          mem = aligned_large_pages_alloc(sizeof(Thread));
+          push_back(new(mem) Thread(size()));
+      }
       clear();
 
       // Reallocate the hash with the new threadpool size

skiminki · 2020-09-21T18:33:55Z

I can take a look at some point. TBH, I wouldn’t expect big speedups—or it can also be that the speedup is present only with specific hash sizes.

… management Use TT memory functions to allocate memory for the NNUE weights. This should provide a small speed-up on systems where large pages are not automatically used, including Windows and some Linux distributions. Further, since we now have a wrapper for std::aligned_alloc(), we can simplify the TT memory management a bit: - We no longer need to store separate pointers to the hash table and its underlying memory allocation. - We also get to merge the Linux-specific and default implementations of aligned_ttmem_alloc(). Finally, we'll enable the VirtualAlloc code path with large page support also for Win32. STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 14896 W: 1854 L: 1686 D: 11356 Ptnml(0-2): 65, 1224, 4742, 1312, 105 closes official-stockfish#3081 No functional change.

Use TT memory functions to allocate memory for the NNUE weights. This should provide a small speed-up on systems where large pages are not automatically used, including Windows and some Linux distributions. Further, since we now have a wrapper for std::aligned_alloc(), we can simplify the TT memory management a bit: - We no longer need to store separate pointers to the hash table and its underlying memory allocation. - We also get to merge the Linux-specific and default implementations of aligned_ttmem_alloc(). Finally, we'll enable the VirtualAlloc code path with large page support also for Win32. STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 14896 W: 1854 L: 1686 D: 11356 Ptnml(0-2): 65, 1224, 4742, 1312, 105 closes official-stockfish#3081 No functional change.

skiminki changed the title ~~Add large page support for NNUE weights~~ [WiP] Add large page support for NNUE weights Aug 30, 2020

skiminki force-pushed the nnue-large-pages branch from a81886d to 754aec8 Compare September 19, 2020 18:29

vondele reviewed Sep 20, 2020

View reviewed changes