Comments (9)
"This behavior is similar to the fast_update parameters of the GIN index."
Not quite. It is correct that the fastupdate mechanism uses a list of pending changes:
- They are not applied asynchronously, except by auto VACUUM. A GIN update hitting gin_pending_list_limit, or triggered by gin_clean_pending_list() will run synchronously
- It is configurable via pending_list_limit and fastupdate for controlling latency
https://www.postgresql.org/docs/current/gin-implementation.html
While this might sound like nitpicking, GIN behavior is documented and user controllable, which is important if you run PostgreSQL in production.
But what concerns me more is this:
"But there is an additional requirement for any table scan in PostgreSQL: it must verify that each retrieved row be โvisibleโ to the query's MVCC snapshot, as discussed in Chapter 13. Visibility information is not stored in index entries, only in heap entries; so at first glance it would seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently."
https://www.postgresql.org/docs/current/indexes-index-only-scans.html
Does your mechanism properly consult the visibility map when updating the index asynchronously? Otherwise this would introduce dirty reads, which PostgreSQL does not support: https://www.postgresql.org/docs/current/transaction-iso.html
If the indexing time is quite long, PostgreSQL has CREATE INDEX CONCURRENTLY, which is well established since v8.2: https://www.postgresql.org/docs/16/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY, and integrates well with the rest of PostgreSQL's machinery.
One key promise and strength of PostgreSQL is guaranteed data consistency and IMHO this must not be compromised. That's why I'm asking all those questions.
best regards
Ernst-Georg
from pgvecto.rs.
Thank you for the info. IMHO, this is pretty unusual behavior for PostgreSQL, having an INDEX that still can mutate in the background after CREATE INDEX has reported completion. I have never seen PostgreSQL or any extension behave like this, up to now. But obviously I have not seen them all. ;-)
Does this mutability guarentee to never cross transaction boundaries, i.e. can any individual transaction rely on the current state of the INDEX it saw when it was started? Otherwise, one transaction could change the content of the INDEX, virtually pulling the carpet out from under the feet of tuple versions still visible in another running transaction, e.g. by DELETE.
But anyway, judging from the documentation and the code, I should check if idx_status is NORMAL, and idx_indexing is true in pg_vector_index_stat? (The code points to cur.execute("SELECT idx_indexing FROM pg_vector_index_info;") but pg_vector_index_info does not exist. I assume this is pg_vector_index_stat now).
postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
idx_status | idx_indexing
------------+--------------
NORMAL | t
(1 row)
And...
postgres=# explain analyze SELECT id, embedding <=> (select embedding from items where id=1) AS _score FROM items ORDER BY _score LIMIT 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=12289.23..12291.11 rows=100 width=8) (actual time=567.031..570.355 rows=100 loops=1)
InitPlan 1 (returns $1)
-> Gather (cost=1000.00..12289.23 rows=1 width=18) (actual time=0.154..19.708 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on items items_1 (cost=0.00..11289.12 rows=1 width=18) (actual time=11.191..17.660 rows=0 loops=3)
Filter: (id = 1)
Rows Removed by Filter: 325000
-> Index Scan using items_embedding_idx on items (cost=0.00..18398.70 rows=975000 width=8) (actual time=567.029..570.282 rows=100 loops=1)
Order By: (embedding <=> $1)
Planning Time: 0.078 ms
Execution Time: 570.502 ms
(12 rows)
No change to my previous result.
best regards
Ernst-Georg
from pgvecto.rs.
pgvecto.rs
constructs the index asynchronously. When you insert new rows into the table, they will first be placed in an append-only file. The background thread will periodically merge the newly inserted row to the existing index. When a user performs any search prior to the merge process, it scans the append-only file to ensure accuracy and consistency.
You need to check the view pg_vector_index_stat
to monitor the progress of indexing.
https://docs.pgvecto.rs/usage/monitoring.html
from pgvecto.rs.
Here's the code we used in benchmark to check the build status: https://github.com/silver-ymz/vector-db-benchmark/blob/30164378f2c8bb06d30d89de3bc96c836da1b2fd/engine/clients/pgvector/upload.py#L91-L95
from pgvecto.rs.
postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
idx_status | idx_indexing
------------+--------------
- NORMAL | t
+ NORMAL | f
(1 row)
It should be idx_indexing=false
, indicating that the indexing is complete.
from pgvecto.rs.
Thank you for the info. IMHO, this is pretty unusual behavior for PostgreSQL, having an INDEX that still can mutate in the background after CREATE INDEX has reported completion. I have never seen PostgreSQL or any extension behave like this, up to now. But obviously I have not seen them all. ;-)
I appreciate you bringing up this discussion. It is a tradeoff between availability and indexing time. In the case of pgvector, the indexing time is quite long (6 hours), which can result in blocking insertions and new queries. To mitigate this, the decision was made to build the index asynchronously, reducing the overall unavailable time.
The pgvecto.rs index is built asynchronously by background threads, allowing non-blocking inserts and always ready for new queries.
from pgvecto.rs.
This behavior is similar to the fast_update parameters of the GIN index.
from pgvecto.rs.
Does your mechanism properly consult the visibility map when updating the index asynchronously?
The visibility map is consulted in queries and not handled by indexes.
If the indexing time is quite long, PostgreSQL has CREATE INDEX CONCURRENTLY
pgvecto.rs
supports it.
One key promise and strength of PostgreSQL is guaranteed data consistency and IMHO this must not be compromised.
I agree with you.
from pgvecto.rs.
postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
idx_status | idx_indexing
------------+--------------
NORMAL | f
(1 row)
postgres=# explain analyze SELECT id, embedding <=> (select embedding from items where id=1) AS _score FROM items ORDER BY _score LIMIT 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
----------------------------------
Limit (cost=12289.23..12291.11 rows=100 width=8) (actual time=14.484..16.928 rows=100 loops=1)
InitPlan 1 (returns $1)
-> Gather (cost=1000.00..12289.23 rows=1 width=18) (actual time=0.132..13.139 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on items items_1 (cost=0.00..11289.12 rows=1 width=18) (actual time=7.382.
.11.660 rows=0 loops=3)
Filter: (id = 1)
Rows Removed by Filter: 325000
-> Index Scan using items_embedding_idx on items (cost=0.00..18398.70 rows=975000 width=8) (actual time
=14.482..16.881 rows=100 loops=1)
Order By: (embedding <=> $1)
Planning Time: 0.074 ms
Execution Time: 16.987 ms
(12 rows)
This looks better now. ;-)
from pgvecto.rs.
Related Issues (20)
- bug(cnpg): type "vector" does not exist HOT 1
- The default max_stack_depth parameter in the container image may be too small HOT 12
- How could I use AzureOpenAI to build LlamaIndex? HOT 2
- vector type not found in the database - pgvecto_rs.psycopg register_vector HOT 5
- copy with BINARY FORMAT fails with cannot find a dumper for type vector HOT 3
- sdk: Sparse vector indices type mismatch between sdk and function signature with numpy.ndarray
- How to find index size ? HOT 3
- Execute pg_resetwal in docker Unraid HOT 3
- feat: Support vector aggregation function HOT 1
- feat(fdw): How to be compatible with new pgvector types HOT 1
- feat: ANN benchmark HOT 3
- bench(fdw): Latency HOT 4
- fix(bench): Fix ZillizBench HOT 1
- feat: Add pgvecto.rs to vector hub HOT 2
- unknown x86 target feature HOT 2
- install patched pgrx failed HOT 4
- Can I index array of vectors? HOT 4
- chore(ecosystem): Langchain Python SDK Bump Version HOT 2
- Feature Request: Add Sum Aggregation and Column-Wise Multiplication for Sparse Vectors HOT 4
- SELECT * FROM pg_vector_index_stat does not work with partitions HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pgvecto.rs.