Giter VIP home page Giter VIP logo

Comments (9)

ergo70 avatar ergo70 commented on June 2, 2024 3

"This behavior is similar to the fast_update parameters of the GIN index."

Not quite. It is correct that the fastupdate mechanism uses a list of pending changes:

  1. They are not applied asynchronously, except by auto VACUUM. A GIN update hitting gin_pending_list_limit, or triggered by gin_clean_pending_list() will run synchronously
  2. It is configurable via pending_list_limit and fastupdate for controlling latency

https://www.postgresql.org/docs/current/gin-implementation.html

While this might sound like nitpicking, GIN behavior is documented and user controllable, which is important if you run PostgreSQL in production.

But what concerns me more is this:

"But there is an additional requirement for any table scan in PostgreSQL: it must verify that each retrieved row be โ€œvisibleโ€ to the query's MVCC snapshot, as discussed in Chapter 13. Visibility information is not stored in index entries, only in heap entries; so at first glance it would seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently."

https://www.postgresql.org/docs/current/indexes-index-only-scans.html

Does your mechanism properly consult the visibility map when updating the index asynchronously? Otherwise this would introduce dirty reads, which PostgreSQL does not support: https://www.postgresql.org/docs/current/transaction-iso.html

If the indexing time is quite long, PostgreSQL has CREATE INDEX CONCURRENTLY, which is well established since v8.2: https://www.postgresql.org/docs/16/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY, and integrates well with the rest of PostgreSQL's machinery.

One key promise and strength of PostgreSQL is guaranteed data consistency and IMHO this must not be compromised. That's why I'm asking all those questions.

best regards

Ernst-Georg

from pgvecto.rs.

ergo70 avatar ergo70 commented on June 2, 2024 1

Thank you for the info. IMHO, this is pretty unusual behavior for PostgreSQL, having an INDEX that still can mutate in the background after CREATE INDEX has reported completion. I have never seen PostgreSQL or any extension behave like this, up to now. But obviously I have not seen them all. ;-)

Does this mutability guarentee to never cross transaction boundaries, i.e. can any individual transaction rely on the current state of the INDEX it saw when it was started? Otherwise, one transaction could change the content of the INDEX, virtually pulling the carpet out from under the feet of tuple versions still visible in another running transaction, e.g. by DELETE.

But anyway, judging from the documentation and the code, I should check if idx_status is NORMAL, and idx_indexing is true in pg_vector_index_stat? (The code points to cur.execute("SELECT idx_indexing FROM pg_vector_index_info;") but pg_vector_index_info does not exist. I assume this is pg_vector_index_stat now).

postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
 idx_status | idx_indexing
------------+--------------
 NORMAL     | t
(1 row)

And...

postgres=# explain analyze SELECT id, embedding <=> (select embedding from items where id=1) AS _score FROM items ORDER BY _score LIMIT 100;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=12289.23..12291.11 rows=100 width=8) (actual time=567.031..570.355 rows=100 loops=1)
   InitPlan 1 (returns $1)
     ->  Gather  (cost=1000.00..12289.23 rows=1 width=18) (actual time=0.154..19.708 rows=1 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           ->  Parallel Seq Scan on items items_1  (cost=0.00..11289.12 rows=1 width=18) (actual time=11.191..17.660 rows=0 loops=3)
                 Filter: (id = 1)
                 Rows Removed by Filter: 325000
   ->  Index Scan using items_embedding_idx on items  (cost=0.00..18398.70 rows=975000 width=8) (actual time=567.029..570.282 rows=100 loops=1)
         Order By: (embedding <=> $1)
 Planning Time: 0.078 ms
 Execution Time: 570.502 ms
(12 rows)

No change to my previous result.

best regards

Ernst-Georg

from pgvecto.rs.

usamoi avatar usamoi commented on June 2, 2024

pgvecto.rs constructs the index asynchronously. When you insert new rows into the table, they will first be placed in an append-only file. The background thread will periodically merge the newly inserted row to the existing index. When a user performs any search prior to the merge process, it scans the append-only file to ensure accuracy and consistency.

You need to check the view pg_vector_index_stat to monitor the progress of indexing.

https://docs.pgvecto.rs/usage/monitoring.html

from pgvecto.rs.

VoVAllen avatar VoVAllen commented on June 2, 2024

Here's the code we used in benchmark to check the build status: https://github.com/silver-ymz/vector-db-benchmark/blob/30164378f2c8bb06d30d89de3bc96c836da1b2fd/engine/clients/pgvector/upload.py#L91-L95

from pgvecto.rs.

gaocegege avatar gaocegege commented on June 2, 2024
postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
 idx_status | idx_indexing
------------+--------------
- NORMAL     | t
+ NORMAL     | f
(1 row)

It should be idx_indexing=false, indicating that the indexing is complete.

from pgvecto.rs.

gaocegege avatar gaocegege commented on June 2, 2024

Thank you for the info. IMHO, this is pretty unusual behavior for PostgreSQL, having an INDEX that still can mutate in the background after CREATE INDEX has reported completion. I have never seen PostgreSQL or any extension behave like this, up to now. But obviously I have not seen them all. ;-)

I appreciate you bringing up this discussion. It is a tradeoff between availability and indexing time. In the case of pgvector, the indexing time is quite long (6 hours), which can result in blocking insertions and new queries. To mitigate this, the decision was made to build the index asynchronously, reducing the overall unavailable time.

The pgvecto.rs index is built asynchronously by background threads, allowing non-blocking inserts and always ready for new queries.

from pgvecto.rs.

VoVAllen avatar VoVAllen commented on June 2, 2024

This behavior is similar to the fast_update parameters of the GIN index.

from pgvecto.rs.

usamoi avatar usamoi commented on June 2, 2024

Does your mechanism properly consult the visibility map when updating the index asynchronously?

The visibility map is consulted in queries and not handled by indexes.

If the indexing time is quite long, PostgreSQL has CREATE INDEX CONCURRENTLY

pgvecto.rs supports it.

One key promise and strength of PostgreSQL is guaranteed data consistency and IMHO this must not be compromised.

I agree with you.

from pgvecto.rs.

ergo70 avatar ergo70 commented on June 2, 2024
postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
 idx_status | idx_indexing
------------+--------------
 NORMAL     | f
(1 row)

postgres=# explain analyze SELECT id, embedding <=> (select embedding from items where id=1) AS _score FROM items ORDER BY _score LIMIT 100;
                                                                  QUERY PLAN

------------------------------------------------------------------------------------------------------------
----------------------------------
 Limit  (cost=12289.23..12291.11 rows=100 width=8) (actual time=14.484..16.928 rows=100 loops=1)
   InitPlan 1 (returns $1)
     ->  Gather  (cost=1000.00..12289.23 rows=1 width=18) (actual time=0.132..13.139 rows=1 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           ->  Parallel Seq Scan on items items_1  (cost=0.00..11289.12 rows=1 width=18) (actual time=7.382.
.11.660 rows=0 loops=3)
                 Filter: (id = 1)
                 Rows Removed by Filter: 325000
   ->  Index Scan using items_embedding_idx on items  (cost=0.00..18398.70 rows=975000 width=8) (actual time
=14.482..16.881 rows=100 loops=1)
         Order By: (embedding <=> $1)
 Planning Time: 0.074 ms
 Execution Time: 16.987 ms
(12 rows)

This looks better now. ;-)

from pgvecto.rs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.