jbellis / jvector Goto Github PK

View Code? Open in Web Editor NEW

1.3K 27.0 91.0 12.5 MB

JVector: the most advanced embedded vector search engine

License: Apache License 2.0

Java 96.58% Python 0.50% Shell 0.23% C 2.69%

ann java knn machine-learning search-engine similarity-search vector-search

jvector's People

Stargazers

Watchers

Forkers

zznate tjake jkni pushkala-datastax shaunakdas88 jeremiahdjordan dlg99 chrislin22 msmygit mike-tr-adamson phact bradfordcp neiko2002 manoj-inukolunu whoschek blueoceandevops cloudmarc delysid749 cc-umu ailabteam denisbelogradskiy anguyen777 di-dimmasik levboguslavskiy youya9 sameron saddam213 mgrygles-lab madesaguna magicpr palakiew justone25 misselvexu shepardo jooekong zkenk akay231 goodbey857 inayet maljefairi paulm18 tori-ham san-ctuary farhadfa22 2567176308 hudawei996 hg9forever davidalphafox anhlt18vn peak915 tarungopalkrishna serkan-ozal a925907195 heemin32 automationkit kevindrosendahl yabinmeng holmistr jbhateja vbekiaris chengsecret frosner robertomalatesta xjtushilei mdogan wrmay kim-eungseo siddhsql michaeljmarshall maxwell-guo mdumandag luyuncheng yannikhinteregger shanthshivam vahidsh1 hansolo lightningrob douglasrolins rd-99 shahinsharifi chaojun-zhang philipdomann santoshdahale namtran2299 stevew00ds shultseva murodin russcam zll600 mehmettokgoz

jvector's Issues

why does Bench OOM after several data files?

It sure looks to me like we retain no references to the DataSet after we're done with it, but on a 16GB heap I reliably enter GC hell by the 4th dataset. (In my current order, the 4th is Sift, which is a smaller set of data than Glove-200, which comes 3rd. So it's not just increasing data size.)

Get PQ working with grid search

Mostly this means, merge the PQ branch of hnswrecall

mvn release task that checks for license headers?

Not sure if this is a big ask or a small one. If it's a lot of work we can skip it.

GraphIndex.View should be AutoCloseable

GraphIndex.getView returns a View interface that isn't AutoCloseable but the OnDiskGraphIndex.OnDiskView implementation of View is Autocloseable. This means that users of the GraphIndex.getView method can't do:

try (var view = graph.getView())
{
  ...
}

This is easy to workaround but annoying.

Optimize selectDiverse + copyDiverse sequence in CNS

Even more important now that enforceMaxConnLimit relies on this too.

Modularize the maven build

This is some maven refactoring that will involve:

Move the current pom to be a 'parent' pom
Creation of a core directory which will be the library classes and related unit tests
Creation of a example directory with will be the current Bench and Sift classes
Update the readme to reflect the above

Both subdirs will need child poms.

find out how much we can compress openai embeddings vectors

right now Bench hardcodes dimensions at

        var pqDims = ds.baseVectors.get(0).length / 2;

because with the datasets from ann-benchmarks (nytimes-256, glove-200) this gives virtually no recall loss at overquery=2.

however, OpenAI's embeddings vectors are ridiculously overparameterized, we should be able to get more aggressive there

length / 4 ? length / 8? let's find out

Try out removing the COHHG CompletionTracker

I think it's not necessary for Vamana single-level graphs

JVector recall regression vs HNSW

HNSW (concurrent5 branch + hnswrecall)

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
HNSW   M=16 ef=100: top 100/1 recall 0.7170, build 37.37s, query 11.68s. 209069030 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
HNSW   M=16 ef=100: top 100/1 recall 0.6954, build 81.76s, query 8.32s. 241306820 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
HNSW   M=16 ef=100: top 100/1 recall 0.6329, build 142.23s, query 12.33s. 251840840 nodes visited

JVector

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
Index   M=16 ef=100: top 100/1 recall 0.6393, build 45.96s, query 10.84s. 165252200 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
Index   M=16 ef=100: top 100/1 recall 0.5328, build 133.32s, query 9.93s. 229974700 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
Index   M=16 ef=100: top 100/1 recall 0.3820, build 210.34s, query 11.57s. 183856980 nodes visited

update SiftTest to perform both in-memory and from-disk tests

Recall should be ~0.975 for in-memory and ~0.96 for from-disk (with PQ)

Compare new Simd code with hnswrecall + lucene

I remember seeing 40% improvement enabling simd, we're seeing less now. Did Lucene's hand-unrolled simd add value?

ensure builds are jdk-11 compatible

thinking of OSS C* here

figure out why sometimes Bench build takes 10x longer

mvn exec does not build before executing

so if you don't "clean install" manually first you will either get ClassNotFoundException (moderately confusing and bad) or it will silently run the last version that you built (very very bad).

Investigate issues loading sources in IDE

When debugging code, it is helpful to see a dependency's source code. AFAICT, the sources are not yet published. Here is the current warning I see in IntelliJ:

Put the Disk in DiskANN

It may be useful to look at https://github.com/datastax/cassandra/tree/diskann, but this only added PQ to the existing HNSW structure. What we want to do here is implement full diskann, which means saving vectors as part of the graph structure, and fetching them as we perform BFS to minimize "seeks" (mmap paging-in) compared to reading them separately at the end of the search to re-rank in exact order.

Will also need to add a isApproximate() flag (or the opposite, isExact()) so that search knows when it needs to do the re-rank.

I think the right place for this is NeighborSimilarity.

Add example invocations into readme

Need a quick description of running via maven exec plugin that @jkni put in. Also will add a stanza for SiftSmall

Add test for GraphCache

JDK 20 build stuff

Add native simd vector support

This will take moving to MemorySegments and FloatBuffers.

Then c code using jextract

add back caching of the closest nodes to the entry point

Add tests

Add the rest of the methods in SimdOps to VectorUtil + VectorUtilSupport so it runs w/o Panama

Comparison of DiskANN vs Lucene HNSW

Need to use the actual Lucene disk classes for this. Have sample code deep in old vsearch history, will dig it up. Using my concurrent5 Lucene branch is kind of cheating but I think it's okay vs spending 32x as much time waiting for serial Lucene to build the graph, the actual graphs produced are nearly equivalent.

Goal is to build Deep100M index on machine with enough memory (64GB should be enough) and then serve it from disk on either a smaller VM or a container. I could be wrong but I don't think we can keep it from using all the ram as disk cache otherwise.

Find out if NBHM is better or worse than ConcurrentHashMap in OnHeapGraphIndex

See commit history for where NBHM got added.

Will probably need to test on low dimension vectors (i.e. random data, not the grid search) so that vector comparison time doesn't swamp the Map in the measurement.

Add info line when vector api is correctly enabled

Currently we have a warning when it's NOT enabled but nothing when it IS enabled (so, running w/ enabled is indistinguishable from running on an unsupported JDK which is also silent)

In Lucene it looks like

INFO: Java vector incubator API enabled; uses preferredBitSize=256

Make PQ take a RAVV instead of List<float[]>

This will allow doing PQ against a subset of vectors without reading them all into memory, more easily

Experiment with 4-bit PQ

Interested to see how much accuracy we lose if we do 4-bit PQ (two PQ entries per byte) intead of 8-.

Document how to embed JVector

Add annotations indicating when a method returns a pooled float[]

Something like @PooledFloat and @unpooled/unshared/unique/distinctFloat. (Naming things is hard.) Go through the RAVV implementations and PQ methods and annotate them appropriately.

Automate releases as GitHub workflow

We should automate the release process once we're comfortable with the results. I'd prefer a workflow that runs when an appropriate tag is pushed, but I'm open to other options.

Add decodedSquareDistance fast-path to pq

Fixing this:
// TODO implement other similarity functions efficiently

(I'm not sure it's possible to do the same for cosine.)

Add GH workflow for tests

add PQ unit tests

mvnw exec:exec@bench fails

$ ./mvnw exec:exec@bench
11:14:40,831 [INFO] Scanning for projects...
11:14:40,876 [INFO]
11:14:40,876 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
11:14:40,877 [INFO] Building jvector 1.0-SNAPSHOT
11:14:40,877 [INFO] --------------------------------[ jar ]---------------------------------
11:14:40,907 [INFO]
11:14:40,907 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
WARNING: Using incubator modules: jdk.incubator.vector
Heap space available is 34359738368
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" io.jhdf.exceptions.HdfException: Failed to open file '/home/jonathan/Projects/jvector/hdf5/nytimes-256-angular.hdf5' . Is it a HDF5 file?
        at io.jhdf.HdfFile.<init>(HdfFile.java:228)
        at com.github.jbellis.jvector.example.Bench.load(Bench.java:149)
        at com.github.jbellis.jvector.example.Bench.gridSearch(Bench.java:245)
        at com.github.jbellis.jvector.example.Bench.main(Bench.java:232)
Caused by: java.nio.file.NoSuchFileException: hdf5/nytimes-256-angular.hdf5
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:308)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:367)
        at io.jhdf.HdfFile.<init>(HdfFile.java:188)
        ... 3 more
11:14:41,055 [ERROR] Command execution failed.

I note that mvnw clean install works and tests pass.

mvn:exec doesn't use JAVA_HOME

$ ./mvnw exec:exec@bench
10:58:11,915 [INFO] Scanning for projects...
10:58:12,000 [INFO]
10:58:12,000 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
10:58:12,002 [INFO] Building jvector 1.0-SNAPSHOT
10:58:12,002 [INFO] --------------------------------[ jar ]---------------------------------
10:58:12,069 [INFO]
10:58:12,070 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
Error occurred during initialization of boot layer
java.lang.module.FindException: Module jdk.incubator.vector not found
10:58:12,220 [ERROR] Command execution failed.

$ ./mvnw -v
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /home/jonathan/.m2/wrapper/dists/apache-maven-3.6.3-bin/1iopthnavndlasol9gbrbg6bf2/apache-maven-3.6.3
Java version: 20.0.1, vendor: Oracle Corporation, runtime: /home/jonathan/.jdks/openjdk-20.0.1

$ java -version
openjdk version "11.0.19" 2023-04-18

$ env |grep JAVA
JAVA_HOME=/home/jonathan/.jdks/openjdk-20.0.1/

I believe the problem is that pom.xml executes java instead of using JAVA_HOME. (As the output above shows, mvnw itself is using JAVA_HOME correctly.)

Change package to io.github.jbellis

This better aligns with the artifact groupId.

Rename CachingGraphIndex.BFS_DISTANCE?

This is a cosmetic thing, but always good to avoid confusion.

CachingGraphIndex.BFS_DISTANCE is a parameter used to populate GraphCache. Currently, we are actually performing DFS with a stack (the recursive call stack), rather than BFS with a queue.

Immediate possible "fixes":

Simplest thing is rename BFS_DISTANCE to DFS_DISTANCE (or more generically just DISTANCE) , and keep the current DFS implementation
Slightly more work is to keep the naming and re-implement our graph traversal using BFS. Our queue would hold (remaining_steps, node) pairs

Also, one other minor thing. The set of nodes that are distance = 0 from the entry node should correspond to precisely the entry node; this check should probably be a strict < 0 check.

add back entrypoint updating once graph is built

Investigate kmeans stopping points

Currently we stop after max iterations or when < 1% of points change centroids.

Changed centroids is technically a bad measurement since the centroids themselves move. But maybe it's close enough?

Alternatively, the "correct" way is to check the residual distance from vectors to centroid.

We should find out

(0) How good does kmeans have to be to achieve the recall that we're looking for? Can we do fewer iterations and be effectively unchanged?
(1) What % of vectors are still changing centroids at that point?
(2) Is residual distance materially better in practice, given that it's more expensive to compute?

Packaging must produce a releasable javadoc jar

We do not currently produce a javadoc jar. Quick attempt to add one surfaces many issues building javadocs. Resolve these and get the jar produced by packaging.

Add Java21 support

Merge OnDiskHnswGraphTest from C* into JVector

FullyConnected and RandomlyConnected classes may be useful. VectorCache and CachedLevel will not, we don't have levels anymore.

(RandomlyConnected is only used in C* VectorCacheTest but we could improve OnDiskHnswGraph by using that too, maybe?)

General documentation and API index

First off, great work!

It'd be very helpful if there were general documentation which helped map the theory and concepts to the class hierarchy or the main facades.

That may be augmented w/ potentially more examples and definitely an API index to browse.

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 32.82s,
PQ encode 14.51s,
Index   M=8 ef=60 PQ=false: top 100/1 recall 0.5968, build 8.41s, query 3.61s. -1 nodes visited
Index   M=8 ef=60 PQ=true: top 100/1 recall 0.5897, build 8.41s, query 9.32s. -1 nodes visited

After:

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 20.01s,
PQ encode 1.65s,
Index   M=8 ef=60 PQ=false: top 100/1 recall 0.5927, build 8.31s, query 3.65s. -1 nodes visited
Index   M=8 ef=60 PQ=true: top 100/1 recall 0.0005, build 8.31s, query 10.49s. -1 nodes visited

add Searcher Builder option withoutExactOrdering, see how much it improves Bench times

if withoutExactOrdering is specified, don't bother loading the vector from disk or re-sorting at the end

Add additional project README with basic instructions

We can expand on it as we go.

jbellis / jvector Goto Github PK

jvector's People

Stargazers

Watchers

Forkers

jvector's Issues

Recommend Projects

Recommend Topics

Recommend Org