Giter VIP home page Giter VIP logo

pg_embedding's Introduction

pg_embedding


IMPORTANT NOTICE:

As of Sept 29, 2023, Neon is no longer committing to pg_embedding.

Support will remain in place for existing users of the extension, but we strongly encourage migrating to pgvector.

For migration instructions, see Migrate from pg_embedding to pgvector, in the Neon documentation.


The pg_embedding extension enables the using the Hierarchical Navigable Small World (HNSW) algorithm for vector similarity search in PostgreSQL.

This extension is based on ivf-hnsw implementation of HNSW the code for the current state-of-the-art billion-scale nearest neighbor search system[1].

Using the pg_embedding extension

This section describes how to use the pg_embedding extension with a simple example demonstrating the required statements, syntax, and options.

For information about migrating from pgvector to pg_embedding, see Migrate from pgvector to pg_embedding, in the Neon documentation.

Usage summary

The statements in this usage summary are described in further detail in the following sections.

CREATE EXTENSION embedding;
CREATE TABLE documents(id integer PRIMARY KEY, embedding real[]);
INSERT INTO documents(id, embedding) VALUES (1, '{0,1,2}'), (2, '{1,2,3}'),  (3, '{1,1,1}');
SELECT id FROM documents ORDER BY embedding <-> ARRAY[3,3,3] LIMIT 1;

Enable the extension

To enable the pg_embedding extension, run the following CREATE EXTENSION statement:

CREATE EXTENSION embedding;

Create a table for your vector data

To store your vector data, create a table similar to the following:

CREATE TABLE documents(id INTEGER, embedding REAL[]);

This statement generates a table named documents with an embedding column for storing vector data. Your table and vector column names may differ.

Insert data

To insert vector data, use an INSERT statement similar to the following:

INSERT INTO documents(id, embedding) VALUES (1, '{0,1,2}'), (2, '{1,2,3}'),  (3, '{1,1,1}');

Query

The pg_embedding extension supports Euclidean (L2), Cosine, and Manhattan distance metrics.

Euclidean (L2) distance:

SELECT id FROM documents ORDER BY embedding <-> array[3,3,3] LIMIT 1;

Cosine distance:

SELECT id FROM documents ORDER BY embedding <=> array[3,3,3] LIMIT 1;

Manhattan distance:

SELECT id FROM documents ORDER BY embedding <~> array[3,3,3] LIMIT 1;

where:

  • SELECT id FROM documents selects the id field from all records in the documents table.
  • ORDER BY sorts the selected records in ascending order based on the calculated distances. In other words, records with values closer to the [1.1, 2.2, 3.3] query vector according to the distance metric will be returned first.
  • <->, <=>, and <~> operators define the distance metric, which calculates the distance between the query vector and each row of the dataset.
  • LIMIT 1 limits the result set to one record after sorting.

In summary, the query retrieves the ID of the record from the documents table whose value is closest to the [3,3,3] query vector according to the specified distance metric.

Create an HNSW index

To optimize search behavior, you can add an HNSW index. To create the HNSW index on your vector column, use a CREATE INDEX statement as shown in the following examples. The pg_embedding extension supports indexes for use with Euclidean, Cosine, and Manhattan distance metrics.

Euclidean (L2) distance index:

CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=3, efconstruction=5, efsearch=5);
SET enable_seqscan = off;
SELECT id FROM documents ORDER BY embedding <-> array[3,3,3] LIMIT 1;

Cosine distance index:

CREATE INDEX ON documents USING hnsw(embedding ann_cos_ops) WITH (dims=3, m=3, efconstruction=5, efsearch=5);
SET enable_seqscan = off;
SELECT id FROM documents ORDER BY embedding <=> array[3,3,3] LIMIT 1;

Manhattan distance index:

CREATE INDEX ON documents USING hnsw(embedding ann_manhattan_ops) WITH (dims=3, m=3, efconstruction=5, efsearch=5);
SET enable_seqscan = off;
SELECT id FROM documents ORDER BY embedding <~> array[3,3,3] LIMIT 1;

Tuning the HNSW algorithm

The following options allow you to tune the HNSW algorithm when creating an index:

  • dims: Defines the number of dimensions in your vector data. This is a required parameter.
  • m: Defines the maximum number of links or "edges" created for each node during graph construction. A higher value increases accuracy (recall) but also increases the size of the index in memory and index construction time.
  • efconstruction: Influences the trade-off between index quality and construction speed. A high efconstruction value creates a higher quality graph, enabling more accurate search results, but a higher value also means that index construction takes longer.
  • efsearch: Influences the trade-off between query accuracy (recall) and speed. A higher efsearch value increases accuracy at the cost of speed. This value should be equal to or larger than k, which is the number of nearest neighbors you want your search to return (defined by the LIMIT clause in your SELECT query).

In summary, to prioritize search speed over accuracy, use lower values for m and efsearch. Conversely, to prioritize accuracy over search speed, use a higher value for m and efsearch. A higher efconstruction value enables more accurate search results at the cost of index build time, which is also affected by the size of your dataset.

How HNSW search works

HNSW is a graph-based approach to indexing multi-dimensional data. It constructs a multi-layered graph, where each layer is a subset of the previous one. During a search, the algorithm navigates through the graph from the top layer to the bottom to quickly find the nearest neighbor. An HNSW graph is known for its superior performance in terms of speed and accuracy.

The search process begins at the topmost layer of the HNSW graph. From the starting node, the algorithm navigates to the nearest neighbor in the same layer. The algorithm repeats this step until it can no longer find neighbors more similar to the query vector.

Using the found node as an entry point, the algorithm moves down to the next layer in the graph and repeats the process of navigating to the nearest neighbor. The process of navigating to the nearest neighbor and moving down a layer is repeated until the algorithm reaches the bottom layer.

In the bottom layer, the algorithm continues navigating to the nearest neighbor until it can't find any nodes that are more similar to the query vector. The current node is then returned as the most similar node to the query vector.

The key idea behind HNSW is that by starting the search at the top layer and moving down through each layer, the algorithm can quickly navigate to the area of the graph that contains the node that is most similar to the query vector. This makes the search process much faster than if it had to search through every node in the graph.

References

  • [1] Dmitry Baranchuk, Artem Babenko, Yury Malkov; Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 202-216 link

pg_embedding's People

Contributors

bayandin avatar danieltprice avatar hlinnaka avatar jeltef avatar knizhnik avatar save-buffer avatar vadim2404 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pg_embedding's Issues

Installation issue on M2 MacBook

Steps to reproduce

make

Actual result

(make: arm64-apple-darwin20.0.0-clang: No such file or directory)

Environment

M2 MacBook
XCode 14.2

Problem

It's possible, that as a relatively inexperienced Mac user, I just miss something basic and obvious in build system settings. I would appreciate any advice.

Efficient HNSW + filtering

Is there any plan to implement efficient HNSW based ANN with filtering? Imagine you have a table:

CREATE TABLE document (
  id UUID PRIMARY KEY,
  text TEXT NOT NULL,
  embedding REAL[] NOT NULL,
  category TEXT NOT NULL
)

And then you want to do a query like:

SELECT id, text
FROM document
WHERE category = $1
ORDER BY embedding <-> $2
LIMIT 10;

To do this efficiently is not straightforward and will likely need some work, see e.g.: https://qdrant.tech/articles/filtrable-hnsw/
(In this scenario, you could probably achieve similar effect by partitioning the table -- but that's not possible for more complex filters)

ERROR: unrecognized parameter "maxelements"

Steps to reproduce

  1. CREATE EXTENSION IF NOT EXISTS embedding;
  2. CREATE TABLE test_data(id integer PRIMARY KEY, embedding real[]);
  3. CREATE INDEX ON test_data USING hnsw (embedding) WITH (maxelements = 1000000, dims=1536, m=32);

Expected result

CREATE INDEX
Index created

Actual result

ERROR: unrecognized parameter "maxelements"

Environment

Ubuntu 22.04
Postgres 16

Logs, links

Is 0.1.0 released?

I packaged pg_embedding for nixpkgs, but I don't know if they'll merge it because 0.1.0 is not tagged here, and PostgreSQL extensions might later fail to upgrade if we take an unreleased commit. If 0.1.0 is effectively released, could you please tag it? Thanks very much.

Why another library?

Why did you all make a new library instead of improving the existing with this new indexing method/search algorithm/whatever?

HNSW Index creation fails on Greenplum (PostgreSQL 12.12)

Hello, I'm trying to install pg_embedding extension on Greenplum 7 (PostgreSQL 12.12 MPP), but getting the following errors when running: make PG_CONFIG=$GPHOME/bin/pg_config install (fyi: pgvector works fine with Greenplum)

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -Werror=implicit-fallthrough=3 -Wno-format-truncation -Wno-stringop-truncation -O3 -fargument-noalias-global -fno-omit-frame-pointer -g  -Werror=uninitialized -Werror=implicit-function-declaration -O3 -fPIC -I. -I./ -I/usr/local/greenplum-db-7.0.0-beta.3/include/postgresql/server -I/usr/local/greenplum-db-7.0.0-beta.3/include/postgresql/internal   -D_GNU_SOURCE -I/usr/include/libxml2   -c -o embedding.o embedding.c
embedding.c: In function ‘hnsw_options’:
embedding.c:401:19: error: implicit declaration of function ‘build_reloptions’; did you mean ‘view_reloptions’? [-Werror=implicit-function-declaration]
  return (bytea *) build_reloptions(reloptions, validate,
                   ^~~~~~~~~~~~~~~~
                   view_reloptions
embedding.c:401:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  return (bytea *) build_reloptions(reloptions, validate,
         ^
embedding.c: In function ‘hnsw_handler’:
embedding.c:513:13: error: ‘IndexAmRoutine’ {aka ‘struct IndexAmRoutine’} has no member named ‘amoptsprocnum’; did you mean ‘amoptions’?
  amroutine->amoptsprocnum = 0;
             ^~~~~~~~~~~~~
             amoptions
embedding.c:527:11: error: ‘IndexAmRoutine’ {aka ‘struct IndexAmRoutine’} has no member named ‘amusemaintenanceworkmem’
  amroutine->amusemaintenanceworkmem = false; /* not used during VACUUM */
           ^~
embedding.c:528:11: error: ‘IndexAmRoutine’ {aka ‘struct IndexAmRoutine’} has no member named ‘amparallelvacuumoptions’
  amroutine->amparallelvacuumoptions = VACUUM_OPTION_PARALLEL_BULKDEL;
           ^~
embedding.c:528:39: error: ‘VACUUM_OPTION_PARALLEL_BULKDEL’ undeclared (first use in this function); did you mean ‘CURSOR_OPT_PARALLEL_OK’?
  amroutine->amparallelvacuumoptions = VACUUM_OPTION_PARALLEL_BULKDEL;
                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                       CURSOR_OPT_PARALLEL_OK
embedding.c:528:39: note: each undeclared identifier is reported only once for each function it appears in
embedding.c:534:22: warning: assignment to ‘aminsert_function’ {aka ‘_Bool (*)(struct RelationData *, long unsigned int *, _Bool *, struct ItemPointerData *, struct RelationData *, enum IndexUniqueCheck,  struct IndexInfo *)’} from incompatible pointer type ‘_Bool (*)(struct RelationData *, Datum *, _Bool *, ItemPointerData *, struct RelationData *, IndexUniqueCheck,  _Bool,  IndexInfo *)’ {aka ‘_Bool (*)(struct RelationData *, long unsigned int *, _Bool *, struct ItemPointerData *, struct RelationData *, enum IndexUniqueCheck,  _Bool,  struct IndexInfo *)’} [-Wincompatible-pointer-types]
  amroutine->aminsert = hnsw_insert;
                      ^
embedding.c:543:11: error: ‘IndexAmRoutine’ {aka ‘struct IndexAmRoutine’} has no member named ‘amadjustmembers’
  amroutine->amadjustmembers = NULL;
           ^~
embedding.c: In function ‘l2_distance’:
embedding.c:583: warning: ignoring #pragma clang loop [-Wunknown-pragmas]
  #pragma clang loop vectorize(enable)```

Possibility to return score/distance

Hi! Thank you for this great extension. We are wondering if there is a way to return the score (distance) of each result item similar to how pgvector is doing?

:bug: Installation Issue

Steps to reproduce

make
suod make install

Expected result

Trying to install pg_embedding

Actual result

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -Ofast -fPIC -I. -I./ -I/usr/include/postgresql/15/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o distfunc.o distfunc.c
distfunc.c: In function ‘l2_dist_impl_avx2’:
distfunc.c:20:5: error: ISO C90 forbids array ‘partial_result’ whose size cannot be evaluated [-Werror=vla]
   20 |     float partial_result[elts_per_vector];
      |     ^~~~~
distfunc.c:30:9: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
   30 |         __m256 vecX2 = _mm256_loadu_ps(x + i + elts_per_vector);
      |         ^~~~~~
distfunc.c:37:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
   37 |     float res1 = partial_result[0] + partial_result[4];
      |     ^~~~~
distfunc.c:45:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
   45 |     size_t tail_size = n % elts_per_loop;
      |     ^~~~~~
cc1: some warnings being treated as errors

Environment

Debian 11

It would also be nice to include installation notes in the README.md

Missing features

Lists of features that would be useful

Correctness

  • VACUUM support

Performance

  • Deduplicate vectors (like GIN & btree)
  • Implement IVF-flat, with clusters stored in btrees, ordered by distance from centroid.
    This allows for prefetching & improved prioritization of candidate tuples)
  • Store the index on disk
    The index shouldn't need time to rebuild every time the system restarts, and shouldn't consume memory if there is no-one to use the index.
    I hear that this is WIP.

Features

Segfault on empty table

Steps to reproduce

postgres=# create extension embedding ;
CREATE EXTENSION
postgres=# create table embeddings(id integer primary key, payload real[]);
CREATE TABLE
postgres=# create index on embeddings using hnsw(payload) with (maxelements=1000000, dims=100, m=32);
CREATE INDEX
postgres=# select id from embeddings order by payload <-> (select array_agg(g) from generate_series(1, 100) g) limit 100;
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

backtrace:

#0  HierarchicalNSW::searchBaseLayer (this=0x7f3a3a446000, point=0x563d28a99da0, ef=64) at hnswalg.cpp:45
#1  0x00007f3a96a559bc in HierarchicalNSW::searchKnn (this=0x0, query=0x0, k=0) at hnswalg.cpp:207
#2  0x00007f3a96a55e04 in hnsw_search (hnsw=0x0, point=0x0, efSearch=0, n_results=0x7fffebe385a0, results=0xc8924200) at hnswalg.cpp:329
#3  0x00007f3a96a52424 in hnsw_gettuple (scan=0x563d28a98900, dir=ForwardScanDirection) at embedding.c:316
#4  0x0000563d270b935f in index_getnext_tid (scan=0x563d28a98900, direction=ForwardScanDirection) at indexam.c:541
#5  0x0000563d270b974d in index_getnext_slot (scan=0x563d28a98900, direction=ForwardScanDirection, slot=0x563d28a5c548) at indexam.c:633
#6  0x0000563d273834e4 in IndexNextWithReorder (node=0x563d28a5c218) at nodeIndexscan.c:262
#7  0x0000563d2735383b in ExecScanFetch (node=0x563d28a5c218, accessMtd=0x563d27383270 <IndexNextWithReorder>, recheckMtd=0x563d27383760 <IndexRecheck>) at execScan.c:132
#8  0x0000563d27353479 in ExecScan (node=0x563d28a5c218, accessMtd=0x563d27383270 <IndexNextWithReorder>, recheckMtd=0x563d27383760 <IndexRecheck>) at execScan.c:198
#9  0x0000563d273814f1 in ExecIndexScan (pstate=0x563d28a5c218) at nodeIndexscan.c:531
#10 0x0000563d2734edf2 in ExecProcNodeFirst (node=0x563d28a5c218) at execProcnode.c:464
#11 0x0000563d27384ef2 in ExecProcNode (node=0x563d28a5c218) at ../../../src/include/executor/executor.h:273
#12 0x0000563d273842f9 in ExecLimit (pstate=0x563d28a5bf78) at nodeLimit.c:96
#13 0x0000563d2734edf2 in ExecProcNodeFirst (node=0x563d28a5bf78) at execProcnode.c:464
#14 0x0000563d27346d02 in ExecProcNode (node=0x563d28a5bf78) at ../../../src/include/executor/executor.h:273
#15 0x0000563d27342091 in ExecutePlan (estate=0x563d28a83828, planstate=0x563d28a5bf78, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, 
    direction=ForwardScanDirection, dest=0x563d28a51cd0, execute_once=true) at execMain.c:1670
#16 0x0000563d27341f2b in standard_ExecutorRun (queryDesc=0x563d28a353c8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:365
#17 0x0000563d27341ce2 in ExecutorRun (queryDesc=0x563d28a353c8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:309
#18 0x0000563d27621d44 in PortalRunSelect (portal=0x563d289e34b8, forward=true, count=0, dest=0x563d28a51cd0) at pquery.c:924
#19 0x0000563d276217b8 in PortalRun (portal=0x563d289e34b8, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x563d28a51cd0, altdest=0x563d28a51cd0, qc=0x7fffebe38c08)
    at pquery.c:768
#20 0x0000563d2761cf2a in exec_simple_query (query_string=0x563d28969d68 "select id from embeddings order by payload <-> (select array_agg(g) from generate_series(1, 100) g) limit 100;")
    at postgres.c:1274
#21 0x0000563d2761c1cb in PostgresMain (dbname=0x563d289a1418 "postgres", username=0x563d28965058 "heikki") at postgres.c:4632
#22 0x0000563d2752ec06 in BackendRun (port=0x563d28998f50) at postmaster.c:4452
#23 0x0000563d2752c62c in BackendStartup (port=0x563d28998f50) at postmaster.c:4180
#24 0x0000563d27529f8e in ServerLoop () at postmaster.c:1779
#25 0x0000563d27528ddd in PostmasterMain (argc=3, argv=0x563d28962fc0) at postmaster.c:1463
#26 0x0000563d273d885d in main (argc=3, argv=0x563d28962fc0) at main.c:198

Did some sleuthing with debugger: In HierarchicalNSW::searchBaseLayer, we have this:

    std::vector<uint32_t> visited;                                                                                                                                       │
    visited.resize((cur_element_count + 31) >> 5);                   

In this case, cur_element_count is 0, so the size of visited array is 0. The crash happens here, where we try to assign to the empty vector:

    visited[enterpoint_node >> 5] = 1 << (enterpoint_node & 31);                                                                                                             │

duplicated results after TRUNCATE

Steps to reproduce

postgres=# create table embeddings(id integer primary key, payload real[]);
CREATE TABLE
postgres=# create index on embeddings using hnsw(payload) with (maxelements=1000000, dims=100, m=32);
CREATE INDEX
postgres=# insert into embeddings select a, (select array_agg(g) from generate_series(1, 100) g) from generate_series(1, 20) a;
INSERT 0 20
postgres=# truncate embeddings;
TRUNCATE TABLE
postgres=# insert into embeddings select a, (select array_agg(g) from generate_series(1, 100) g) from generate_series(1, 20) a;
INSERT 0 20
postgres=# select ctid, id from embeddings order by payload <-> (select array_agg(g) from generate_series(1, 100) g) limit 100;
  ctid  | id 
--------+----
 (0,1)  |  1
 (0,1)  |  1
 (1,1)  | 18
 (1,1)  | 18
 (0,2)  |  2
 (0,2)  |  2
 (1,2)  | 19
 (1,2)  | 19
 (0,3)  |  3
 (0,3)  |  3
 (1,3)  | 20
 (1,3)  | 20
 (0,4)  |  4
 (0,4)  |  4
 (0,5)  |  5
 (0,5)  |  5
 (0,6)  |  6
 (0,6)  |  6
 (0,7)  |  7
 (0,7)  |  7
 (0,8)  |  8
 (0,8)  |  8
 (0,9)  |  9
 (0,9)  |  9
 (0,10) | 10
 (0,10) | 10
 (0,11) | 11
 (0,11) | 11
 (0,12) | 12
 (0,12) | 12
 (0,13) | 13
 (0,13) | 13
 (0,14) | 14
 (0,14) | 14
 (0,15) | 15
 (0,15) | 15
 (0,16) | 16
 (0,16) | 16
 (0,17) | 17
 (0,17) | 17
(40 rows)

how to create HNSW indexes in parallel?

How to create HNSW indexes in parallel or insert data in parallel after creating index?
I saw a project called hnswlib, which supports parallel insertion of data, but it is not used for databases.
And I noticed that hnswlib's code is a bit similar to yours.
So do you have any ideas?
Many thanks!

Amount of RAM on macos

	if (sysctlbyname("hw.memsize", NULL, &total, NULL, 0) < 0)

Oh, really? You expect memory size to be placed to oldlenp?

Manhattan distance

I was wondering if the Manhattan distance could be added to this extension?

Evaluate Profile Guided Optimization (PGO)

Hi!

Recently I tested a lot of software with PGO and measured the performance improvements from PGO - the results are here. Since my results show interesting improvements on many databases (including PostgreSQL) I think it would be a good idea to measure PGO effects on pg_embedding as well. If the results will show an improvement - would be great to see a note in the documentation about PGO. Hopefully, it can help with improving pg_embedding performance (at least CPU usage).

Support for other types

Is there any interest in support for other data types in addition to real[]?

I have an interest in indexing bit varying columns to support compressed representations for large (100 million - 1 billion) scale search. A billion 768d vectors is ~3TB, a billion 768 bit bitstrings is only ~90 GB (plus overhead)

I have a branch off of 0.2.0 that calculates simple matching distance pop_count(x^y) / len(x) on bit varying columns. currently re-writing off a more recent commit (a lot has changed since 0.3.6)

Q: is this likely ever to get packaged?

I'm using this great extension with my open source project.

Scripting it for installing on third party installations is not so bad but it does raise the bar slightly for adoption. It would be great if it eventually became a standard package. Is there any prospect of that?

Thanks for all the hard work!

Failure when creating Euclidean/Cosine hnsw indexes

I'm getting an error when I try to create Euclidean/Cosine hnsw indexes. The ann_manhattan_ops index seems to work fine.

Steps to reproduce

These fail:

CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=384, m=3, efconstruction=5, efsearch=5);
CREATE INDEX ON documents USING hnsw(embedding ann_cos_ops) WITH (dims=384, m=3, efconstruction=5, efsearch=5);

This works:

CREATE INDEX ON documents USING hnsw(embedding ann_manhattan_ops) WITH (dims=384, m=3, efconstruction=5, efsearch=5);

Expected result

Index created

Actual result

[XX000] ERROR: Function is not supported by HNSW inodex

Environment

Postgres 15 Docker

Logs, links

Happy to provide this with guidance on what you need.

Better Guidance on Determining m/efconstruction/efsearch values.

It would be VERY helpful if there were some rules of thumb for how to right-size these parameters. For example, pgvector gives the following guidance Choose an appropriate number of lists - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows. Could similar guidance be created for pg_embedding?

There is some guidance about "large values" and what they do, but I don't know what a higher values mean. Should it be 10x higher or just 1 higher?

Apache 2 license?

Looks like there is an update_license branch and a PR related to this topic (though the PR is outdated and isn't directed at main branch), but would like to confirm that this project will be Apache 2 licensed?

If not, we'll remove it from our postgres extensions directory https://pgt.dev/extensions/pg_embedding

[BUG] after build index ;insert error

Environment

postgresql15.2

Logs, links

test_db=# \d test_vectors_embedding_hnsw2
                                Table "public.test_vectors_embedding_hnsw2"
 Column |      Type      | Collation | Nullable |                         Default                          
--------+----------------+-----------+----------+----------------------------------------------------------
 id     | bigint         |           | not null | nextval('test_vectors_embedding_hnsw2_id_seq'::regclass)
 emb    | real[]         |           |          | 
 biz    | character(255) |           |          | 
 pic    | character(255) |           |          | 
 url    | character(700) |           |          | 
 cls    | character(2)   |           |          | 
 loc    | character(128) |           |          | 
Indexes:
    "test_vectors_embedding_hnsw2_pkey" PRIMARY KEY, btree (id)
    "test_vectors_embedding_hnsw2_emb_idx" disk_hnsw (emb) WITH (dims='256', m='32')

test_db=# Drop index test_vectors_embedding_hnsw2_emb_idx;
DROP INDEX
test_db=# INSERT INTO test_vectors_embedding_hnsw2 (emb, pic, url) VALUES ('{-1.3577173, -0.1931086, -2.054078, -1.0216528, -0.2908677, 0.0417148, -1.275564, 0.3681692, 0.6960625, -0.4160977, -1.022937, -0.3357292, -1.7334518, 0.6310615, -0.9466465, 0.2149753, -0.0170873, -0.2950798, 1.0806773, -1.2019562, -0.4723569, 0.280157, -0.1214862, -0.5946828, -0.8663544, -0.2719779, 0.1970003, -1.000358, -0.7949438, -0.8798426, 0.9840588, -0.819663, -0.6153127, -0.0383352, 0.0031938, 0.2136006, -0.8936025, 1.8357444, -1.4972517, 0.5109758, -0.9384633, -0.7397373, 0.1170649, -0.8619239, -0.4392602, -0.4289132, 0.1355756, 0.1290145, -0.9775255, -2.0477536, -0.3968557, -0.8216435, 0.8082756, -0.4985549, -0.1131888, 1.6151655, 1.0111067, 0.5304176, -1.2303437, -0.5870577, 0.1953384, 0.4804874, 0.5157396, -0.5184563, -1.1904018, -0.6855107, 1.1018984, 0.9297467, -0.0623092, -0.151415, -0.0986392, -0.7027535, -1.9203912, -0.2652288, -0.3080295, -1.0096159, -0.8962705, 0.7063244, -1.352454, -0.2865452, -0.8807335, 0.9797034, -0.0904295, 1.334077, -0.7468812, 2.2190342, 1.162147, 0.3164219, 0.1395898, -1.0999525, -1.1906483, -1.0920788, 0.0463939, -0.4177737, -0.1881599, -0.4334898, -0.2400387, 0.9126538, 0.8113637, 0.4602213, -0.3553045, -0.9312688, 0.491594, 1.7958134, -0.5205359, -1.7871464, -0.8688284, -0.289492, -0.2719598, 0.8960803, -0.7156042, -0.7482584, 1.4487717, -1.6332766, -0.0701321, -0.2844273, -0.3483454, 0.6967442, -0.082129, 0.3499782, 0.231226, -0.9935235, 0.5810836, -0.7631179, 0.5115957, -0.4710389, 2.7917955, 0.8310192, -1.0967236, 1.3042188, -0.0621228, 0.9727247, 1.5298697, -0.4735105, 0.8201959, -1.6014637, -0.4435724, 0.1150684, -2.3777487, -0.092687, 1.64431, -1.3853956, -2.3693242, 0.2795227, 1.4520471, 0.7070943, -1.1565708, -1.8917516, 0.178189, 0.0070163, 0.3306862, -0.5251174, -0.666474, 0.7975529, -0.4793687, 0.4736622, 0.1003931, -0.2413153, 0.4790334, 0.4233853, -0.3626219, -1.7925045, -0.0311574, -0.771827, 0.5282623, -0.590151, -0.4058989, 1.0258397, 0.4007719, 0.331128, -0.4066766, -0.8737853, 0.3456081, -0.6774357, -0.5742541, 1.3907764, -1.8045238, 1.0758289, -2.2130907, 0.9411963, -1.0718011, -2.4546704, 0.2509904, -0.5624139, 0.4104578, -0.7894603, 0.3795136, 1.8772784, -1.3694749, 0.9232821, 0.6553418, -0.1282401, 0.5815499, 0.4233072, 0.3157484, -0.135366, -0.020872, -0.793581, -0.9765639, -0.0791832, 0.3738998, 1.5921856, -1.0544736, 0.6868084, 1.0082403, 0.1005407, 1.0620728, -0.2379073, 0.0730326, -0.1270735, -1.7749763, -0.2000193, 0.7045384, -0.7347514, -0.7330724, 1.645151, -0.0203024, 0.5865587, -0.7982014, 0.3127671, 0.2199927, -0.7628009, 0.2664881, -0.263574, -1.166938, 0.3191151, -0.2087096, -1.3463137, 1.0447439, 0.8546895, -1.0444397, 1.1071848, 0.1644423, 0.923902, 0.3043146, 0.0062285, 1.0722132, -0.0621383, -0.2353123, 1.0340056, 0.5977604, 0.532452, 0.2615763, 0.2707317, 0.9625043, 0.750101, 0.7676484, 1.4126079, -1.1933744, -1.0857062, 0.7211734, -1.0081924, -0.051911, -0.1272659, -0.160972, 0.7146422}','--', '--')
test_db-# ;
INSERT 0 1
test_db=# CREATE INDEX ON test_vectors_embedding_hnsw2 USING disk_hnsw (emb ann_l2_ops) WITH (dims=256, m=32);
CREATE INDEX
test_db=# \d test_vectors_embedding_hnsw2
                                Table "public.test_vectors_embedding_hnsw2"
 Column |      Type      | Collation | Nullable |                         Default                          
--------+----------------+-----------+----------+----------------------------------------------------------
 id     | bigint         |           | not null | nextval('test_vectors_embedding_hnsw2_id_seq'::regclass)
 emb    | real[]         |           |          | 
 biz    | character(255) |           |          | 
 pic    | character(255) |           |          | 
 url    | character(700) |           |          | 
 cls    | character(2)   |           |          | 
 loc    | character(128) |           |          | 
Indexes:
    "test_vectors_embedding_hnsw2_pkey" PRIMARY KEY, btree (id)
    "test_vectors_embedding_hnsw2_emb_idx" disk_hnsw (emb) WITH (dims='256', m='32')

test_db=# INSERT INTO test_vectors_embedding_hnsw2 (emb, pic, url) VALUES ('{-1.3577173, -0.1931086, -2.054078, -1.0216528, -0.2908677, 0.0417148, -1.275564, 0.3681692, 0.6960625, -0.4160977, -1.022937, -0.3357292, -1.7334518, 0.6310615, -0.9466465, 0.2149753, -0.0170873, -0.2950798, 1.0806773, -1.2019562, -0.4723569, 0.280157, -0.1214862, -0.5946828, -0.8663544, -0.2719779, 0.1970003, -1.000358, -0.7949438, -0.8798426, 0.9840588, -0.819663, -0.6153127, -0.0383352, 0.0031938, 0.2136006, -0.8936025, 1.8357444, -1.4972517, 0.5109758, -0.9384633, -0.7397373, 0.1170649, -0.8619239, -0.4392602, -0.4289132, 0.1355756, 0.1290145, -0.9775255, -2.0477536, -0.3968557, -0.8216435, 0.8082756, -0.4985549, -0.1131888, 1.6151655, 1.0111067, 0.5304176, -1.2303437, -0.5870577, 0.1953384, 0.4804874, 0.5157396, -0.5184563, -1.1904018, -0.6855107, 1.1018984, 0.9297467, -0.0623092, -0.151415, -0.0986392, -0.7027535, -1.9203912, -0.2652288, -0.3080295, -1.0096159, -0.8962705, 0.7063244, -1.352454, -0.2865452, -0.8807335, 0.9797034, -0.0904295, 1.334077, -0.7468812, 2.2190342, 1.162147, 0.3164219, 0.1395898, -1.0999525, -1.1906483, -1.0920788, 0.0463939, -0.4177737, -0.1881599, -0.4334898, -0.2400387, 0.9126538, 0.8113637, 0.4602213, -0.3553045, -0.9312688, 0.491594, 1.7958134, -0.5205359, -1.7871464, -0.8688284, -0.289492, -0.2719598, 0.8960803, -0.7156042, -0.7482584, 1.4487717, -1.6332766, -0.0701321, -0.2844273, -0.3483454, 0.6967442, -0.082129, 0.3499782, 0.231226, -0.9935235, 0.5810836, -0.7631179, 0.5115957, -0.4710389, 2.7917955, 0.8310192, -1.0967236, 1.3042188, -0.0621228, 0.9727247, 1.5298697, -0.4735105, 0.8201959, -1.6014637, -0.4435724, 0.1150684, -2.3777487, -0.092687, 1.64431, -1.3853956, -2.3693242, 0.2795227, 1.4520471, 0.7070943, -1.1565708, -1.8917516, 0.178189, 0.0070163, 0.3306862, -0.5251174, -0.666474, 0.7975529, -0.4793687, 0.4736622, 0.1003931, -0.2413153, 0.4790334, 0.4233853, -0.3626219, -1.7925045, -0.0311574, -0.771827, 0.5282623, -0.590151, -0.4058989, 1.0258397, 0.4007719, 0.331128, -0.4066766, -0.8737853, 0.3456081, -0.6774357, -0.5742541, 1.3907764, -1.8045238, 1.0758289, -2.2130907, 0.9411963, -1.0718011, -2.4546704, 0.2509904, -0.5624139, 0.4104578, -0.7894603, 0.3795136, 1.8772784, -1.3694749, 0.9232821, 0.6553418, -0.1282401, 0.5815499, 0.4233072, 0.3157484, -0.135366, -0.020872, -0.793581, -0.9765639, -0.0791832, 0.3738998, 1.5921856, -1.0544736, 0.6868084, 1.0082403, 0.1005407, 1.0620728, -0.2379073, 0.0730326, -0.1270735, -1.7749763, -0.2000193, 0.7045384, -0.7347514, -0.7330724, 1.645151, -0.0203024, 0.5865587, -0.7982014, 0.3127671, 0.2199927, -0.7628009, 0.2664881, -0.263574, -1.166938, 0.3191151, -0.2087096, -1.3463137, 1.0447439, 0.8546895, -1.0444397, 1.1071848, 0.1644423, 0.923902, 0.3043146, 0.0062285, 1.0722132, -0.0621383, -0.2353123, 1.0340056, 0.5977604, 0.532452, 0.2615763, 0.2707317, 0.9625043, 0.750101, 0.7676484, 1.4126079, -1.1933744, -1.0857062, 0.7211734, -1.0081924, -0.051911, -0.1272659, -0.160972, 0.7146422}','--', '--');
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?> 

the postgresql.log

2023-08-02 16:54:01.122 CST 2286052     LOG:  database system is ready to accept connections
2023-08-02 16:57:56.487 CST 2286052     LOG:  server process (PID 2307560) exited with exit code 245
2023-08-02 16:57:56.487 CST 2286052     LOG:  terminating any other active server processes
2023-08-02 16:57:56.500 CST 2286052     LOG:  all server processes terminated; reinitializing
2023-08-02 16:57:57.072 CST 2308083     LOG:  database system was interrupted; last known up at 2023-08-02 16:54:01 CST
2023-08-02 16:57:57.249 CST 2308083     LOG:  database system was not properly shut down; automatic recovery in progress
2023-08-02 16:57:57.251 CST 2308083     LOG:  redo starts at 558/62D71628
2023-08-02 16:57:57.340 CST 2308083     LOG:  invalid record length at 558/62D80F00: wanted 24, got 0
2023-08-02 16:57:57.340 CST 2308083     LOG:  redo done at 558/62D80ED8 system usage: CPU: user: 0.04 s, system: 0.04 s, elapsed: 0.08 s
2023-08-02 16:57:57.344 CST 2308084     LOG:  checkpoint starting: end-of-recovery immediate wait
2023-08-02 16:57:57.501 CST 2308084     LOG:  checkpoint complete: wrote 30 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.147 s, sync=0.004 s, total=0.159 s; sync files=20, longest=0.003 s, average=0.001 s; distance=62 kB, estimate=62 kB
2023-08-02 16:57:57.512 CST 2286052     LOG:  database system is ready to accept connections

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.