netrasys / pgann Goto Github PK

View Code? Open in Web Editor NEW

291.0 291.0 15.0 22 KB

Fast Approximate Nearest Neighbor (ANN) searches with a PostgreSQL database.

License: MIT License

Python 100.00%

ann approximate-nearest-neighbor-search nearest-neighbor-search nearest-neighbors postgres vectors

pgann's People

Stargazers

Watchers

Forkers

krzynio laeeth sethips amirstudy roszcz netankit raymon-ai sailfish009 hadryan greatwallisme wyfunique s4n0i aniketmaithani jwcnewton suryatmodulus

pgann's Issues

Add dockerfile with cube max dimension tuning

hello.

I found this wonderful docker file, where cube max dimension limit was increased up to 2048.
https://hub.docker.com/r/expert/postgresql-large-cube

Idea for performance improvement

Hi,

could your approach be improved by using dblink? Example: http://www.programmersought.com/article/76671080348/

Best
Wilhelm

Worse performance with GIST?

Hey! This is really great, and I want to thank you for looking into using PostgreSQL cube data types and GIST indexes for nearest neighbor queries. I did want to ask you, though, did you ever run into any issues where the GIST index actually made performance worse? The reason I ask is, I seem to be experiencing that. It's all documented in this StackOverflow question and also in a related GitHub repo. I'm still researching the problem, but my working theory right now is that, without the index postgres will do a parallel sequential scan on the table, whereas with the index it'll only do a sequential scan of the index. If that's true, then I'm trying to figure out how to coax it into doing a parallel index scan if at all. Will update you as I progress!

Performance

Nice to find a ANN which is not RAM based! Thanks!!

Just tried this with 50 000 000 entries of length 90. Size of the table is with the embeddings 38Gb. Used postgres in a container and ran a search for a single random embedding with:

sql = "select id,embeddings from images order by embeddings <-> cube({0}) asc limit 25".format((emb_string))

This search took 230s. I have very good cpu and memory speeds on this computer. I'm I doing something wrong or is this reasonable?

My issue is this: the total size of the table was about 38Gb which means it sort of fits in RAM. Is it better to use faiss? If I double the db size will it take double the time?

Can we get a blog post

Hi,
Really happy to have stumbled on this and get an indication this idea can work at scale. This is awesome and we'd love to know more about your experience running this in prod. Any chance for a blog post ?
Tnx

Approximate?

Why is this an "approximate" nearest neighbor search? The documentation at https://www.postgresql.org/docs/13/cube.html says nothing about distances or search being approximate, and I don't see anything in https://github.com/postgres/postgres/blob/472e518a44eacd9caac7d618f1b6451672ca4481/contrib/cube/cube.c to indicate it's anything other than a typical kd-tree search.

netrasys / pgann Goto Github PK

pgann's People

Stargazers

Watchers

Forkers

pgann's Issues

Add dockerfile with cube max dimension tuning

Idea for performance improvement

Worse performance with GIST?

Performance

Can we get a blog post

Approximate?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent