Conversation
|
🚀 I still have on my TODO to run some serious benchmarks on these patches; I'll work to get those up and running to see how it performs under different contexts on some larger machines. Given they're focused on index building, my thought process is to test:
|
822af36 to
6e6e1c7
Compare
|
Rebased this |
6e6e1c7 to
e40e687
Compare
|
I was reminded of this by https://www.postgresql.org/message-id/CA%2BhUKGJ_7NKd46nx1wbyXWriuZSNzsTfm%2BrhEuvU6nxZi3-KVw%40mail.gmail.com. Rebased. |
|
Pushed a version of bulk hashing to the bulk-hash branch, but seeing very little difference on Linux x86-64 and Mac arm64. Let me know if you're still seeing a difference (or if I messed something up with the code). |
e40e687 to
e0f4a19
Compare
Thanks! I'm still seeing a modest ~4-5 % difference from the 'bulk-hash' patch. It feels smaller than what I saw before, but it's still measurable and repeatable. (I got a new laptop since I wrote this, so I cannot compare on the same hardware anymore) Rebased this again. The first commit is essentially the same as the bulk-hash branch. |
Thanks to CPU cache effects (I think), it's faster to fetch all the hashes first. This gives a 5%-10% speedup in HNSW build of a 100-dimension on my laptop.
Introduce a helper function to calculate the distances of all candidates in an array. It doesn't change much on its own, but paves the way for further optimizations, in next commit.
In my testing, this gives a further 10% speedup in HNSW index build.
e0f4a19 to
55142c4
Compare
|
@jkatz If you have a chance to test the bulk-hash branch vs its previous commit, I'd be curious to see what you find. |
|
@ankane By previous commit do you mean what's currently on |
|
Cool, I meant the commit it was branched from ( |
|
@ankane Understood. I'm working on this test now. |
Here are two more micro-optimizations of HnswSearchLayer, for HNSW index build
1st Commit: Add a bulk-variant of AddToVisited
The idea is to move code around so that we collect all the 'hash' values to an array in a tight loop, before performing the hash table lookups. This codepath causes a lot of CPU cache misses, as the elements are scattered around memory, and performing all the fetches upfront allows the CPU to schedule fetching those cachelines sooner. That's my theory of why this works, anyway :-).
This gives a 5%-10% speedup on my laptop, on HNSW index build of a subset of the 'angular-100' dataset (same test I used on my previous PRs). I'd love to hear how this performs on other systems, as this could be very dependent on CPU details.
2nd & 3rd commits: Calculate 4 neighbor distances at a time in HNSW search
This is just a proof of concept at this stage, but shows promising results. The idea is to have a variant of the distance function that calculates the distance from one point 'q' to 4 other points in one call. That gives the vectorized loop in the distance function more work to do in each iteration. If you think this is worthwhile, I can spend more time polishing this, adding these array-variants as proper AM support functions, etc.
This gives another 10% speedup on the same test on my laptop. It could possibly be optimized further, by providing variants with different array sizes, or a variable-length version and let the function itself vectorize it in the optimal way. With some refactoring, I think we could also use this in CheckElementCloser(). This might also work well together with PR #311, but I haven't tested that.