Hello, I did a simple benchmark of pgvecto.rs 0.2.1 vs. pgvector 0.6

Here's the code we used in benchmark to check the build status: <a href="https://githu

<div class="highlight highlight-source-diff notranslate position-relative overflow-auto" dir="auto"

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

pgvecto.rs 0.2.1 much slower than pgvector 0.6.1? about pgvecto.rs HOT 9 CLOSED

ergo70 commented on June 2, 2024

pgvecto.rs 0.2.1 much slower than pgvector 0.6.1?

from pgvecto.rs.

Comments (9)

ergo70 commented on June 2, 2024 3

"This behavior is similar to the fast_update parameters of the GIN index."

Not quite. It is correct that the fastupdate mechanism uses a list of pending changes:

They are not applied asynchronously, except by auto VACUUM. A GIN update hitting gin_pending_list_limit, or triggered by gin_clean_pending_list() will run synchronously
It is configurable via pending_list_limit and fastupdate for controlling latency

https://www.postgresql.org/docs/current/gin-implementation.html

While this might sound like nitpicking, GIN behavior is documented and user controllable, which is important if you run PostgreSQL in production.

But what concerns me more is this:

"But there is an additional requirement for any table scan in PostgreSQL: it must verify that each retrieved row be “visible” to the query's MVCC snapshot, as discussed in Chapter 13. Visibility information is not stored in index entries, only in heap entries; so at first glance it would seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently."

https://www.postgresql.org/docs/current/indexes-index-only-scans.html

Does your mechanism properly consult the visibility map when updating the index asynchronously? Otherwise this would introduce dirty reads, which PostgreSQL does not support: https://www.postgresql.org/docs/current/transaction-iso.html

If the indexing time is quite long, PostgreSQL has CREATE INDEX CONCURRENTLY, which is well established since v8.2: https://www.postgresql.org/docs/16/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY, and integrates well with the rest of PostgreSQL's machinery.

One key promise and strength of PostgreSQL is guaranteed data consistency and IMHO this must not be compromised. That's why I'm asking all those questions.

best regards

Ernst-Georg

from pgvecto.rs.

ergo70 commented on June 2, 2024 1

Thank you for the info. IMHO, this is pretty unusual behavior for PostgreSQL, having an INDEX that still can mutate in the background after CREATE INDEX has reported completion. I have never seen PostgreSQL or any extension behave like this, up to now. But obviously I have not seen them all. ;-)

Does this mutability guarentee to never cross transaction boundaries, i.e. can any individual transaction rely on the current state of the INDEX it saw when it was started? Otherwise, one transaction could change the content of the INDEX, virtually pulling the carpet out from under the feet of tuple versions still visible in another running transaction, e.g. by DELETE.

But anyway, judging from the documentation and the code, I should check if idx_status is NORMAL, and idx_indexing is true in pg_vector_index_stat? (The code points to cur.execute("SELECT idx_indexing FROM pg_vector_index_info;") but pg_vector_index_info does not exist. I assume this is pg_vector_index_stat now).

postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
 idx_status | idx_indexing
------------+--------------
 NORMAL     | t
(1 row)

And...

postgres=# explain analyze SELECT id, embedding <=> (select embedding from items where id=1) AS _score FROM items ORDER BY _score LIMIT 100;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=12289.23..12291.11 rows=100 width=8) (actual time=567.031..570.355 rows=100 loops=1)
   InitPlan 1 (returns $1)
     ->  Gather  (cost=1000.00..12289.23 rows=1 width=18) (actual time=0.154..19.708 rows=1 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           ->  Parallel Seq Scan on items items_1  (cost=0.00..11289.12 rows=1 width=18) (actual time=11.191..17.660 rows=0 loops=3)
                 Filter: (id = 1)
                 Rows Removed by Filter: 325000
   ->  Index Scan using items_embedding_idx on items  (cost=0.00..18398.70 rows=975000 width=8) (actual time=567.029..570.282 rows=100 loops=1)
         Order By: (embedding <=> $1)
 Planning Time: 0.078 ms
 Execution Time: 570.502 ms
(12 rows)

No change to my previous result.

best regards

Ernst-Georg

from pgvecto.rs.

usamoi commented on June 2, 2024

pgvecto.rs constructs the index asynchronously. When you insert new rows into the table, they will first be placed in an append-only file. The background thread will periodically merge the newly inserted row to the existing index. When a user performs any search prior to the merge process, it scans the append-only file to ensure accuracy and consistency.

You need to check the view pg_vector_index_stat to monitor the progress of indexing.

https://docs.pgvecto.rs/usage/monitoring.html

from pgvecto.rs.

VoVAllen commented on June 2, 2024

Here's the code we used in benchmark to check the build status: https://github.com/silver-ymz/vector-db-benchmark/blob/30164378f2c8bb06d30d89de3bc96c836da1b2fd/engine/clients/pgvector/upload.py#L91-L95

from pgvecto.rs.

gaocegege commented on June 2, 2024

postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
 idx_status | idx_indexing
------------+--------------
- NORMAL     | t
+ NORMAL     | f
(1 row)

It should be idx_indexing=false, indicating that the indexing is complete.

from pgvecto.rs.

gaocegege commented on June 2, 2024

Thank you for the info. IMHO, this is pretty unusual behavior for PostgreSQL, having an INDEX that still can mutate in the background after CREATE INDEX has reported completion. I have never seen PostgreSQL or any extension behave like this, up to now. But obviously I have not seen them all. ;-)

I appreciate you bringing up this discussion. It is a tradeoff between availability and indexing time. In the case of pgvector, the indexing time is quite long (6 hours), which can result in blocking insertions and new queries. To mitigate this, the decision was made to build the index asynchronously, reducing the overall unavailable time.

The pgvecto.rs index is built asynchronously by background threads, allowing non-blocking inserts and always ready for new queries.

from pgvecto.rs.

VoVAllen commented on June 2, 2024

This behavior is similar to the fast_update parameters of the GIN index.

from pgvecto.rs.

usamoi commented on June 2, 2024

Does your mechanism properly consult the visibility map when updating the index asynchronously?

The visibility map is consulted in queries and not handled by indexes.

If the indexing time is quite long, PostgreSQL has CREATE INDEX CONCURRENTLY

pgvecto.rs supports it.

One key promise and strength of PostgreSQL is guaranteed data consistency and IMHO this must not be compromised.

I agree with you.

from pgvecto.rs.

ergo70 commented on June 2, 2024

postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
 idx_status | idx_indexing
------------+--------------
 NORMAL     | f
(1 row)

postgres=# explain analyze SELECT id, embedding <=> (select embedding from items where id=1) AS _score FROM items ORDER BY _score LIMIT 100;
                                                                  QUERY PLAN

------------------------------------------------------------------------------------------------------------
----------------------------------
 Limit  (cost=12289.23..12291.11 rows=100 width=8) (actual time=14.484..16.928 rows=100 loops=1)
   InitPlan 1 (returns $1)
     ->  Gather  (cost=1000.00..12289.23 rows=1 width=18) (actual time=0.132..13.139 rows=1 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           ->  Parallel Seq Scan on items items_1  (cost=0.00..11289.12 rows=1 width=18) (actual time=7.382.
.11.660 rows=0 loops=3)
                 Filter: (id = 1)
                 Rows Removed by Filter: 325000
   ->  Index Scan using items_embedding_idx on items  (cost=0.00..18398.70 rows=975000 width=8) (actual time
=14.482..16.881 rows=100 loops=1)
         Order By: (embedding <=> $1)
 Planning Time: 0.074 ms
 Execution Time: 16.987 ms
(12 rows)

This looks better now. ;-)

from pgvecto.rs.

pgvecto.rs 0.2.1 much slower than pgvector 0.6.1? about pgvecto.rs HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent