Skip to content

Conversation

@skiminki
Copy link
Contributor

@skiminki skiminki commented Aug 30, 2020

Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used.

Further, since we now have wrapper for std::aligned_alloc() for NNUE, we
can simplify the TT memory management a bit, since we no longer need to
store separate pointers to the table and the underlying allocation.

No functional change.

STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105

@skiminki skiminki changed the title Add large page support for NNUE weights [WiP] Add large page support for NNUE weights Aug 30, 2020
@skiminki
Copy link
Contributor Author

I'd like to get CI checks first before submitting any fishtest runs. On my box, I measured a 1-2% speedup.

@skiminki
Copy link
Contributor Author

CI for Windows passed, so I guess we're ok in that front. CI has already passed for the patches without the Windows patch.

If the patches look ok, I'd assume that STC does not need to get rerun? Should we still run LTC? If so, I'll include all 3 patches in that run to get regression coverage also over the TT mem alloc simplification.

@skiminki skiminki marked this pull request as ready for review September 20, 2020 12:51
@skiminki skiminki changed the title [WiP] Add large page support for NNUE weights Add large page support for NNUE weights Sep 20, 2020
@vondele
Copy link
Member

vondele commented Sep 20, 2020

I don't think another LTC run is needed for this non-functional speedup. I've verified a speedup also locally.

I need to check this runs fine under wine and cross compiles to android, but otherwise I think this can be merged in the next round.

Can you squash all commits to one?

@skiminki
Copy link
Contributor Author

Sure, I'll squash a bit later today.

Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used, including Windows and some Linux distributions.

Further, since we now have a wrapper for std::aligned_alloc(), we can
simplify the TT memory management a bit:

- We no longer need to store separate pointers to the hash table and
  its underlying memory allocation.
- We also get to merge the Linux-specific and default implementations
  of aligned_ttmem_alloc().

Finally, we'll enable the VirtualAlloc code path with large page
support also for Win32.

STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105

No functional change.
@skiminki
Copy link
Contributor Author

git diff e5f433e 575a1d0 is empty.

@snicolet
Copy link
Member

snicolet commented Sep 20, 2020

With the completed patch, it seems strange to use the type TtmemPtr for the network allocation, since the net is not at all related to the transposition table.

One possibility would be to use the following renaming:

aligned_ttmem_alloc ---> aligned_large_page_alloc
aligned_ttmem_free ---> aligned_large_page_free
TtmemPtr ---> LargePagePtr

The comments before aligned_large_page_alloc() in misc.cpp would be sufficient to indicate that this function [...] will return suitably aligned memory, if possible using large pages.

template <typename T>
void Initialize(TtmemPtr<T>& pointer) {

static_assert(alignof(T) <= 4096, "Ensure aligned_ttmem_alloc gives big enough alignment");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment of the assert, did you mean the reverse comparison alignof(T) >= 4096 ?

And by the way, what happens on systems without large page supports? Will the assert fail on such systems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the text could have been worded better. aligned_ttmem_alloc() allocates memory that is aligned by 4096 or more--it's in the contract and all the variants implement it so (Linux gives 2MB, Windows gives 4kB/2MB depending on big page support, others give 4kB). Here we simply do a build-time check that type T has an alignment requirement of at most 4096.

The static_assert would fail in the following case, for instance:

  struct alignas(8192) BigStruct
  {
      uint8_t some_data_that_needs_8192_alignment[1048576];
  };

  TtmemPtr<BigStruct> bigData; 
  Initialize(bigData); // static assert triggered here, since BigStruct type needs more alignment than aligned_ttmem_alloc() guarantees.

This static_assert is not hugely important. But since Initialize(TtmemPtr<T> &) is a template where T could be theoretically any type, it's generally advisable to have such a check in place. But I suppose we can also consider removing it if it's confusing.

And by the way, what happens on systems without large page supports? Will the assert fail on such systems?

No. The assert depends only on the alignment requirement of type T.

Copy link
Member

@snicolet snicolet Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanations. So I propose to reword the comment of the assert like this:

    static_assert(alignof(T) <= 4096, "aligned_large_page_alloc() may fail for such a big alignment requirement of T");

@skiminki
Copy link
Contributor Author

With the completed patch, it seems strange to use the type TtmemPtr for the network allocation...

I suppose I could do a search'n'replace if this change is deemed necessary.

@vondele
Copy link
Member

vondele commented Sep 20, 2020

With the completed patch, it seems strange to use the type TtmemPtr for the network allocation...

I suppose I could do a search'n'replace if this change is deemed necessary.

yes, please do.

@vondele vondele added the to be merged Will be merged shortly label Sep 21, 2020
@vondele
Copy link
Member

vondele commented Sep 21, 2020

I'll will make the changes while merging.

@vondele vondele closed this in 485d517 Sep 21, 2020
@skiminki
Copy link
Contributor Author

Thanks! Your edits look good.

@snicolet
Copy link
Member

Do you measure any speed-up if we use also the LargePagePtr type for the network? Line 57 of evaluate_nnue.cpp:

   // Evaluation function
   LargePagePtr<Network> network;

I can't test on my Macintosh, as Mac OS does not have large pages.

@vondele
Copy link
Member

vondele commented Sep 21, 2020

The network is probably not so much data (but more than 1 small page). I wonder how far we are from using a large page allocator for the Thread classes, with the History tables, they might actually benefit more.

@skiminki
Copy link
Contributor Author

I wonder how far we are from using a large page allocator for the Thread classes, with the History tables, they might actually benefit more.

There might be some potential here. However, there are also some pitfalls with over-aggressive use of large pages. These mainly are excessive memory use (allocs rounded up to 2MB multiples) and possibly some NUMA issues, since page migration would occur in 2MB chunks.

There are couple of considerations:

  • Sequential access such as stack-like structures won't probably benefit that much from large pages. Large pages are more useful in random access, striped/tiled access (like matrix calculations), and so on, to relieve the pressure on TLB. However...
  • The number of TLB entries is usually in order of some thousand or so in modern CPUs. Keeping the working set (not including TT hash) within the TLB coverage definitely is a win. On my CPU this translates to ~6 MB (4-kB pages) and ~3 GB (2-MB pages) for the working set limits. (core i7-8700k, 1536 entries, use cpuid on Linux to query)
  • Per-thread memory allocation could perhaps use large pages with suballocation for small buffers to prevent excessive padding.
  • For estimates of perf improvements: Run some benchmarks with the current SF-dev with THP "enable/defrag" in "always/always" vs "madvise/madvise" modes. If the diff is not much, then I'd guess it doesn't make much sense to start overhauling memory layouts for large pages. My guess is that we've already exhausted most of the LP potential, but who knows. These things become more complicated with NUMA and only experiments will tell.

@vondele
Copy link
Member

vondele commented Sep 21, 2020

The Thread class is actually pretty large 8462080 bytes, most of it will be accessed by random access (history tables). All of this is allocated deallocated in ThreadPool::set()

@skiminki
Copy link
Contributor Author

skiminki commented Sep 21, 2020

The Thread class is actually pretty large 8462080 bytes, most of it will be accessed by random access (history tables).

Ok, then it seems like a candidate for LP alloc. However, that size is just beyond the previous 2MB boundary (by 73472 bytes), so ideally we'd either shrink or enlarge it a bit to fit better in 2MB alloc multiples.

@vondele
Copy link
Member

vondele commented Sep 21, 2020

Shrinking in a non-functional way would be difficult, but it contains e.g. the pawnsTable, which at least in NNUE times should be less important, and could presumably be reduced in size (halved). But even wasting 2MB per thread doesn't seem such a terrible thing.

@skiminki
Copy link
Contributor Author

Possibly. However, would it make sense to bundle the thread stack with the thread object allocation? I see that the stack is 8MB, which would also make it a candidate for LP alloc--and if 7.9MB is actually good enough for the thread we'd save 1.9MB memory per thread.

I said earlier that stacks may not be the prime candidates for LP, but I also don't see why it would hurt.

@vondele
Copy link
Member

vondele commented Sep 21, 2020

If that's easy to implement, yes, the thread stack could be 7.5Mb or similar (there is a large safety margin on the thread stack space). There is another thing I have been wondering about, the Thread object is allocated, and presumably first touched, by the uciThread. It would possibly be better to have those pages first touched by the thread that is going to use them, so maybe gain something in a NUMA setup as well. Again not sure how the implementation would look like.

@skiminki
Copy link
Contributor Author

skiminki commented Sep 21, 2020

If that's easy to implement, yes, the thread stack could be 7.5Mb or similar

Setting the thread stack should be a matter of calling pthread_attr_setstack() instead of pthread_attr_setstacksize(). I think(?) this covers all platforms that support large pages.

On the plus side, explicit stack allocation would allow us to measure the stack highwater mark on exit.

There is another thing I have been wondering about, the Thread object is allocated, and presumably first touched, by the uciThread.

On paper this looks easy. Just pass the allocation to the newly created native thread and let it construct the thread object into the allocation. However, implementing this cleanly might require a bit of refactoring of the Thread/NativeThread relation. Perhaps not that much, though. However, I think this would be most useful for our Windows users, since Windows doesn't seem to do physical memory migration on NUMA? (Or at least there was this bonkers +40% speedup for TT hash on Windows with malloc()-->VirtualAlloc() for some AMD Opteron box.) But shouldn't definitely hurt for Linux, either.

Maybe this could be worth an experiment.

@vondele
Copy link
Member

vondele commented Sep 21, 2020

If can give it a try, would be nice.
I had a look at this solution, but there is no speedup on my linux box:

diff --git a/src/thread.cpp b/src/thread.cpp
index 2fbf745d0..c233a8ef8 100644
--- a/src/thread.cpp
+++ b/src/thread.cpp
@@ -130,14 +130,25 @@ void ThreadPool::set(size_t requested) {
       main()->wait_for_search_finished();
 
       while (size() > 0)
-          delete back(), pop_back();
+      {
+          if (size() == 1)
+              static_cast<MainThread*>(back())->~MainThread();
+          else
+              back()->~Thread();
+          aligned_large_pages_free(back());
+          pop_back();
+      }
   }
 
   if (requested > 0) { // create new thread(s)
-      push_back(new MainThread(0));
+      void* mem = aligned_large_pages_alloc(sizeof(MainThread));
+      push_back(new(mem) MainThread(0));
 
       while (size() < requested)
-          push_back(new Thread(size()));
+      {
+          mem = aligned_large_pages_alloc(sizeof(Thread));
+          push_back(new(mem) Thread(size()));
+      }
       clear();
 
       // Reallocate the hash with the new threadpool size

@skiminki
Copy link
Contributor Author

I can take a look at some point. TBH, I wouldn’t expect big speedups—or it can also be that the speedup is present only with specific hash sizes.

Dantist added a commit to Dantist/Stockfish that referenced this pull request Dec 22, 2020
… management

Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used, including Windows and some Linux distributions.

Further, since we now have a wrapper for std::aligned_alloc(), we can
simplify the TT memory management a bit:

- We no longer need to store separate pointers to the hash table and
  its underlying memory allocation.
- We also get to merge the Linux-specific and default implementations
  of aligned_ttmem_alloc().

Finally, we'll enable the VirtualAlloc code path with large page
support also for Win32.

STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105

closes official-stockfish#3081

No functional change.
BM123499 pushed a commit to BM123499/Stockfish that referenced this pull request Feb 22, 2021
Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used, including Windows and some Linux distributions.

Further, since we now have a wrapper for std::aligned_alloc(), we can
simplify the TT memory management a bit:

- We no longer need to store separate pointers to the hash table and
  its underlying memory allocation.
- We also get to merge the Linux-specific and default implementations
  of aligned_ttmem_alloc().

Finally, we'll enable the VirtualAlloc code path with large page
support also for Win32.

STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105

closes official-stockfish#3081

No functional change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

to be merged Will be merged shortly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants