Replace Lists with arrays in SelectNeighbors() by hlinnaka · Pull Request #388 · pgvector/pgvector

hlinnaka · 2023-12-22T21:47:46Z

For speed. A List has a fair amount of overhead from the extra palloc per element, as well as pointer indirection when walking the lists. This also allows us to use a faster inlined sort function.

I was expecting an up to 5%-10% speedup from this, based on profiling with perf. The sort was taking about 5% of CPU time, and lappend() and palloc() are 1%-2% each, so I figured that would be the maximum this could yield. But I'm actually seeing much more gain, up to 20%. So this is surprisingly effective - to the point that I wonder if I screwed up something. But it passes tests, and I haven't been able to spot a bug.

jkatz · 2023-12-23T02:45:24Z

Cool! I am going to kick off a benchmark on this vs. HEAD

jkatz · 2023-12-27T15:41:43Z

@hlinnaka Just a quick update -- I ran ANN Benchmarks on a fairly large machine comparing this to HEAD; I'm working on running some general indexing tests. Eyeballing the ANN Benchmark results, I saw a marginal gain on #388 that was more inline with your initial expectation. I'll continue running the tests and published a more detailed report.

jkatz · 2024-01-04T21:54:08Z

@hlinnaka What kind of tests were you running and what were the different parameters (vector size, build parameters, search parameters)? I'm not seeing the same performance uplift as you, but rather in the no-more-than 5% range.

For speed. A List has a fair amount of overhead from the extra palloc per element, as well as pointer indirection when walking the lists. This also allows us to use a faster inlined sort function.

hlinnaka · 2024-01-05T14:29:15Z

@hlinnaka What kind of tests were you running and what were the different parameters (vector size, build parameters, search parameters)? I'm not seeing the same performance uplift as you, but rather in the no-more-than 5% range.

I just rebased this, and repeated the test I did earlier. This is what I did to initialize the test:

CREATE TABLE items_small (id int, embedding vector(100));
ALTER TABLE items_small ALTER COLUMN embedding SET STORAGE PLAIN;
-- `items.data` is the `glove-100-angular` dataset.
\COPY items_small FROM PROGRAM 'head -n50000 /home/heikki/git-sandbox/ann-benchmarks/items.data';

create index on items_small using hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=100);

On 'master':

postgres=# \timing on
Timing is on.
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 27857.182 ms (00:27.857)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 27303.268 ms (00:27.303)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 27510.155 ms (00:27.510)

With this PR:

postgres=# \timing on
Timing is on.
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 20230.559 ms (00:20.231)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 19757.048 ms (00:19.757)
postgres=# reindex index items_small_embedding_idx ;
REINDEX
Time: 19648.914 ms (00:19.649)

jkatz · 2024-01-05T18:28:12Z

@hlinnaka Ack. I had tried this test against the previous version and only saw a marginal 1.33% gain. I kept as close to the defaults as possible.

Setup

Instance: r7gd.16xlarge
Storage: NVMe (to isolate any network blips)
PostgreSQL 16.1
Vector size: 1536-dim
Rows: 10,000,000
Storage: EXTENDED
Distance: Cosine (<=>)
Test: Concurrent inserts - 50
Method:

INSERT INTO table SELECT generate_random_normalized_vector(VECTOR_DIM)

HNSW build parameters
- m = 16
- ef_construction = 64

(~0.5ms to generate vector)

Results

master@9a782d29f

scaling factor: 1
query mode: simple
number of clients: 50
number of threads: 50
maximum number of tries: 1
number of transactions per client: 200000
number of transactions actually processed: 10000000/10000000
number of failed transactions: 0 (0.000%)
latency average = 57.984 ms
latency stddev = 20.294 ms
initial connection time = 22.598 ms
tps = 861.786527 (without initial connection time)

PR@beb1c57073 (prior to rebase)

number of clients: 50                                                                                                                                                                                                                                                         
number of threads: 50                                                                                                                                                                                                                                                         
maximum number of tries: 1                                                                                                                                                                                                                                                    
number of transactions per client: 200000                                                                                                                                                                                                                                     
number of transactions actually processed: 10000000/10000000                                                                                                                                                                                                                  
number of failed transactions: 0 (0.000%)                                                                                                                                                                                                                                     
latency average = 57.225 ms                                                                                                                                                                                                                                                   
latency stddev = 19.949 ms                                                                                                                                                                                                                                                    
initial connection time = 22.822 ms                                                                                                                                                                                                                                           
tps = 873.285046 (without initial connection time)

I'll rerun the test with 128-dim and 1536-dim vectors after updating both branches. I'll perform a similar load to yours first to see if I'm seeing the same results.

ankane · 2024-01-25T02:22:38Z

Hey @hlinnaka, just spent some time on this. I think it causes a bug in closer caching (since e points to a copy of the neighbor now), which is where the performance gain comes from (as far as I can tell). I also tried a version that makes c an array (to avoid creating a list in HnswUpdateConnection), but didn't see any performance difference.

hlinnaka · 2024-01-26T19:26:30Z

Hey @hlinnaka, just spent some time on this. I think it causes a bug in closer caching (since e points to a copy of the neighbor now), which is where the performance gain comes from (as far as I can tell). I also tried a version that makes c an array (to avoid creating a list in HnswUpdateConnection), but didn't see any performance difference.

A-ha, thanks! I'll look into this again.

The interface of SelectNeighbors is super confusing. It modifies the inputs, namely the 'closerSet' flag on the neighbor array and the 'closer' flags on the original HnswCandidate structs. But it it doesn't actually modify the neighbors array itself, to put it in the right order. It requires tight co-operation from the caller.

I'll think of ways to clear that up. Hopefully to avoid mistakes like this again.

Replace Lists with arrays in SelectNeighbors()

ef870d2

For speed. A List has a fair amount of overhead from the extra palloc per element, as well as pointer indirection when walking the lists. This also allows us to use a faster inlined sort function.

hlinnaka force-pushed the replace-list-with-array branch from beb1c57 to ef870d2 Compare January 5, 2024 14:29

ankane mentioned this pull request Jan 8, 2024

Optimize visiting neighbors #386

Open

ankane closed this Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Lists with arrays in SelectNeighbors()#388

Replace Lists with arrays in SelectNeighbors()#388
hlinnaka wants to merge 1 commit intopgvector:masterfrom
hlinnaka:replace-list-with-array

hlinnaka commented Dec 22, 2023

Uh oh!

jkatz commented Dec 23, 2023

Uh oh!

jkatz commented Dec 27, 2023

Uh oh!

jkatz commented Jan 4, 2024

Uh oh!

hlinnaka commented Jan 5, 2024

Uh oh!

jkatz commented Jan 5, 2024 •

edited

Loading

Uh oh!

ankane commented Jan 25, 2024 •

edited

Loading

Uh oh!

hlinnaka commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

hlinnaka commented Dec 22, 2023

Uh oh!

jkatz commented Dec 23, 2023

Uh oh!

jkatz commented Dec 27, 2023

Uh oh!

jkatz commented Jan 4, 2024

Uh oh!

hlinnaka commented Jan 5, 2024

Uh oh!

jkatz commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

Results

Uh oh!

ankane commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlinnaka commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

jkatz commented Jan 5, 2024 •

edited

Loading

ankane commented Jan 25, 2024 •

edited

Loading