Skip to content

Optimize visiting neighbors#386

Open
hlinnaka wants to merge 3 commits intopgvector:masterfrom
hlinnaka:optimize-visit-neighbors
Open

Optimize visiting neighbors#386
hlinnaka wants to merge 3 commits intopgvector:masterfrom
hlinnaka:optimize-visit-neighbors

Conversation

@hlinnaka
Copy link
Contributor

Here are two more micro-optimizations of HnswSearchLayer, for HNSW index build

1st Commit: Add a bulk-variant of AddToVisited

The idea is to move code around so that we collect all the 'hash' values to an array in a tight loop, before performing the hash table lookups. This codepath causes a lot of CPU cache misses, as the elements are scattered around memory, and performing all the fetches upfront allows the CPU to schedule fetching those cachelines sooner. That's my theory of why this works, anyway :-).

This gives a 5%-10% speedup on my laptop, on HNSW index build of a subset of the 'angular-100' dataset (same test I used on my previous PRs). I'd love to hear how this performs on other systems, as this could be very dependent on CPU details.

2nd & 3rd commits: Calculate 4 neighbor distances at a time in HNSW search

This is just a proof of concept at this stage, but shows promising results. The idea is to have a variant of the distance function that calculates the distance from one point 'q' to 4 other points in one call. That gives the vectorized loop in the distance function more work to do in each iteration. If you think this is worthwhile, I can spend more time polishing this, adding these array-variants as proper AM support functions, etc.

This gives another 10% speedup on the same test on my laptop. It could possibly be optimized further, by providing variants with different array sizes, or a variable-length version and let the function itself vectorize it in the optimal way. With some refactoring, I think we could also use this in CheckElementCloser(). This might also work well together with PR #311, but I haven't tested that.

@jkatz
Copy link
Contributor

jkatz commented Dec 22, 2023

🚀 I still have on my TODO to run some serious benchmarks on these patches; I'll work to get those up and running to see how it performs under different contexts on some larger machines. Given they're focused on index building, my thought process is to test:

  1. "Regression" tests using the ANN Benchmark suite to capture both performance/recall -- need to do two sets: with and without parallel build
  2. Increasing concurrent building benchmarks (discussed here)
  3. Testing with the parallel build benchmark on different large data sets (10MM + 100MM) with a focus on timing

@ankane
Copy link
Member

ankane commented Jan 8, 2024

Hey @hlinnaka, thanks for more PRs! I'm currently working on in-memory parallel index builds, but will dig into this and #388 once that's finished (as it may affect the impact on build time).

@hlinnaka hlinnaka force-pushed the optimize-visit-neighbors branch from 822af36 to 6e6e1c7 Compare January 29, 2024 09:14
@hlinnaka
Copy link
Contributor Author

Rebased this

@hlinnaka hlinnaka force-pushed the optimize-visit-neighbors branch from 6e6e1c7 to e40e687 Compare June 11, 2024 08:42
@hlinnaka
Copy link
Contributor Author

@ankane
Copy link
Member

ankane commented Sep 19, 2024

Hey @hlinnaka, I incorporated some of this in 8dde14a as part of work to reduce memory usage for HNSW index scans.

(still planning to visit more of it, but want to get further along in that/filtering work before introducing more complexity)

@ankane
Copy link
Member

ankane commented Sep 21, 2024

Pushed a version of bulk hashing to the bulk-hash branch, but seeing very little difference on Linux x86-64 and Mac arm64. Let me know if you're still seeing a difference (or if I messed something up with the code).

@hlinnaka hlinnaka force-pushed the optimize-visit-neighbors branch from e40e687 to e0f4a19 Compare October 7, 2024 14:15
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Oct 7, 2024

Pushed a version of bulk hashing to the bulk-hash branch, but seeing very little difference on Linux x86-64 and Mac arm64. Let me know if you're still seeing a difference (or if I messed something up with the code).

Thanks! I'm still seeing a modest ~4-5 % difference from the 'bulk-hash' patch. It feels smaller than what I saw before, but it's still measurable and repeatable.

(I got a new laptop since I wrote this, so I cannot compare on the same hardware anymore)

Rebased this again. The first commit is essentially the same as the bulk-hash branch.

Thanks to CPU cache effects (I think), it's faster to fetch all the
hashes first. This gives a 5%-10% speedup in HNSW build of a
100-dimension on my laptop.
Introduce a helper function to calculate the distances of all
candidates in an array. It doesn't change much on its own, but paves
the way for further optimizations, in next commit.
In my testing, this gives a further 10% speedup in HNSW index build.
@hlinnaka hlinnaka force-pushed the optimize-visit-neighbors branch from e0f4a19 to 55142c4 Compare October 7, 2024 14:21
@ankane
Copy link
Member

ankane commented Oct 8, 2024

@jkatz If you have a chance to test the bulk-hash branch vs its previous commit, I'd be curious to see what you find.

@jkatz
Copy link
Contributor

jkatz commented Oct 8, 2024

@ankane By previous commit do you mean what's currently on master? Happy to test. (Finishing up one comparing performance btw PG16 + PG17).

@ankane
Copy link
Member

ankane commented Oct 8, 2024

Cool, I meant the commit it was branched from (d5e8fc9), but master should work if you merge the latest changes into the branch (just want to make sure the single commit is the only difference).

@jkatz
Copy link
Contributor

jkatz commented Oct 11, 2024

@ankane Understood. I'm working on this test now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants