Skip to content

Replace Lists with arrays in SelectNeighbors()#388

Closed
hlinnaka wants to merge 1 commit intopgvector:masterfrom
hlinnaka:replace-list-with-array
Closed

Replace Lists with arrays in SelectNeighbors()#388
hlinnaka wants to merge 1 commit intopgvector:masterfrom
hlinnaka:replace-list-with-array

Conversation

@hlinnaka
Copy link
Contributor

For speed. A List has a fair amount of overhead from the extra palloc per element, as well as pointer indirection when walking the lists. This also allows us to use a faster inlined sort function.

I was expecting an up to 5%-10% speedup from this, based on profiling with perf. The sort was taking about 5% of CPU time, and lappend() and palloc() are 1%-2% each, so I figured that would be the maximum this could yield. But I'm actually seeing much more gain, up to 20%. So this is surprisingly effective - to the point that I wonder if I screwed up something. But it passes tests, and I haven't been able to spot a bug.

@jkatz
Copy link
Contributor

jkatz commented Dec 23, 2023

Cool! I am going to kick off a benchmark on this vs. HEAD

@jkatz
Copy link
Contributor

jkatz commented Dec 27, 2023

@hlinnaka Just a quick update -- I ran ANN Benchmarks on a fairly large machine comparing this to HEAD; I'm working on running some general indexing tests. Eyeballing the ANN Benchmark results, I saw a marginal gain on #388 that was more inline with your initial expectation. I'll continue running the tests and published a more detailed report.

@jkatz
Copy link
Contributor

jkatz commented Jan 4, 2024

@hlinnaka What kind of tests were you running and what were the different parameters (vector size, build parameters, search parameters)? I'm not seeing the same performance uplift as you, but rather in the no-more-than 5% range.

For speed. A List has a fair amount of overhead from the extra palloc
per element, as well as pointer indirection when walking the
lists. This also allows us to use a faster inlined sort function.
@hlinnaka hlinnaka force-pushed the replace-list-with-array branch from beb1c57 to ef870d2 Compare January 5, 2024 14:29
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Jan 5, 2024

@hlinnaka What kind of tests were you running and what were the different parameters (vector size, build parameters, search parameters)? I'm not seeing the same performance uplift as you, but rather in the no-more-than 5% range.

I just rebased this, and repeated the test I did earlier. This is what I did to initialize the test:

CREATE TABLE items_small (id int, embedding vector(100));
ALTER TABLE items_small ALTER COLUMN embedding SET STORAGE PLAIN;
-- `items.data` is the `glove-100-angular` dataset.
\COPY items_small FROM PROGRAM 'head -n50000 /home/heikki/git-sandbox/ann-benchmarks/items.data';

create index on items_small using hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=100);

On 'master':

postgres=# \timing on
Timing is on.
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 27857.182 ms (00:27.857)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 27303.268 ms (00:27.303)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 27510.155 ms (00:27.510)

With this PR:

postgres=# \timing on
Timing is on.
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 20230.559 ms (00:20.231)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 19757.048 ms (00:19.757)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 19648.914 ms (00:19.649)

@jkatz
Copy link
Contributor

jkatz commented Jan 5, 2024

@hlinnaka Ack. I had tried this test against the previous version and only saw a marginal 1.33% gain. I kept as close to the defaults as possible.

Setup

  • Instance: r7gd.16xlarge
  • Storage: NVMe (to isolate any network blips)
  • PostgreSQL 16.1
  • Vector size: 1536-dim
  • Rows: 10,000,000
  • Storage: EXTENDED
  • Distance: Cosine (<=>)
  • Test: Concurrent inserts - 50
  • Method:
INSERT INTO table SELECT generate_random_normalized_vector(VECTOR_DIM)
  • HNSW build parameters
    • m = 16
    • ef_construction = 64

(~0.5ms to generate vector)

Results

master@9a782d29f

scaling factor: 1
query mode: simple
number of clients: 50
number of threads: 50
maximum number of tries: 1
number of transactions per client: 200000
number of transactions actually processed: 10000000/10000000
number of failed transactions: 0 (0.000%)
latency average = 57.984 ms
latency stddev = 20.294 ms
initial connection time = 22.598 ms
tps = 861.786527 (without initial connection time)

PR@beb1c57073 (prior to rebase)

number of clients: 50                                                                                                                                                                                                                                                         
number of threads: 50                                                                                                                                                                                                                                                         
maximum number of tries: 1                                                                                                                                                                                                                                                    
number of transactions per client: 200000                                                                                                                                                                                                                                     
number of transactions actually processed: 10000000/10000000                                                                                                                                                                                                                  
number of failed transactions: 0 (0.000%)                                                                                                                                                                                                                                     
latency average = 57.225 ms                                                                                                                                                                                                                                                   
latency stddev = 19.949 ms                                                                                                                                                                                                                                                    
initial connection time = 22.822 ms                                                                                                                                                                                                                                           
tps = 873.285046 (without initial connection time) 

I'll rerun the test with 128-dim and 1536-dim vectors after updating both branches. I'll perform a similar load to yours first to see if I'm seeing the same results.

@ankane
Copy link
Member

ankane commented Jan 25, 2024

Hey @hlinnaka, just spent some time on this. I think it causes a bug in closer caching (since e points to a copy of the neighbor now), which is where the performance gain comes from (as far as I can tell). I also tried a version that makes c an array (to avoid creating a list in HnswUpdateConnection), but didn't see any performance difference.

@hlinnaka
Copy link
Contributor Author

Hey @hlinnaka, just spent some time on this. I think it causes a bug in closer caching (since e points to a copy of the neighbor now), which is where the performance gain comes from (as far as I can tell). I also tried a version that makes c an array (to avoid creating a list in HnswUpdateConnection), but didn't see any performance difference.

A-ha, thanks! I'll look into this again.

The interface of SelectNeighbors is super confusing. It modifies the inputs, namely the 'closerSet' flag on the neighbor array and the 'closer' flags on the original HnswCandidate structs. But it it doesn't actually modify the neighbors array itself, to put it in the right order. It requires tight co-operation from the caller.

I'll think of ways to clear that up. Hopefully to avoid mistakes like this again.

@ankane ankane closed this Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants