Giter VIP home page Giter VIP logo

jvector's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jvector's Issues

why does Bench OOM after several data files?

It sure looks to me like we retain no references to the DataSet after we're done with it, but on a 16GB heap I reliably enter GC hell by the 4th dataset. (In my current order, the 4th is Sift, which is a smaller set of data than Glove-200, which comes 3rd. So it's not just increasing data size.)

GraphIndex.View should be AutoCloseable

GraphIndex.getView returns a View interface that isn't AutoCloseable but the OnDiskGraphIndex.OnDiskView implementation of View is Autocloseable. This means that users of the GraphIndex.getView method can't do:

try (var view = graph.getView())
{
  ...
}

This is easy to workaround but annoying.

Modularize the maven build

This is some maven refactoring that will involve:

  • Move the current pom to be a 'parent' pom
  • Creation of a core directory which will be the library classes and related unit tests
  • Creation of a example directory with will be the current Bench and Sift classes
  • Update the readme to reflect the above

Both subdirs will need child poms.

find out how much we can compress openai embeddings vectors

right now Bench hardcodes dimensions at

        var pqDims = ds.baseVectors.get(0).length / 2;

because with the datasets from ann-benchmarks (nytimes-256, glove-200) this gives virtually no recall loss at overquery=2.

however, OpenAI's embeddings vectors are ridiculously overparameterized, we should be able to get more aggressive there

length / 4 ? length / 8? let's find out

JVector recall regression vs HNSW

HNSW (concurrent5 branch + hnswrecall)

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
HNSW   M=16 ef=100: top 100/1 recall 0.7170, build 37.37s, query 11.68s. 209069030 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
HNSW   M=16 ef=100: top 100/1 recall 0.6954, build 81.76s, query 8.32s. 241306820 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
HNSW   M=16 ef=100: top 100/1 recall 0.6329, build 142.23s, query 12.33s. 251840840 nodes visited

JVector

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
Index   M=16 ef=100: top 100/1 recall 0.6393, build 45.96s, query 10.84s. 165252200 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
Index   M=16 ef=100: top 100/1 recall 0.5328, build 133.32s, query 9.93s. 229974700 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
Index   M=16 ef=100: top 100/1 recall 0.3820, build 210.34s, query 11.57s. 183856980 nodes visited

mvn exec does not build before executing

so if you don't "clean install" manually first you will either get ClassNotFoundException (moderately confusing and bad) or it will silently run the last version that you built (very very bad).

Put the Disk in DiskANN

It may be useful to look at https://github.com/datastax/cassandra/tree/diskann, but this only added PQ to the existing HNSW structure. What we want to do here is implement full diskann, which means saving vectors as part of the graph structure, and fetching them as we perform BFS to minimize "seeks" (mmap paging-in) compared to reading them separately at the end of the search to re-rank in exact order.

Will also need to add a isApproximate() flag (or the opposite, isExact()) so that search knows when it needs to do the re-rank.

I think the right place for this is NeighborSimilarity.

Comparison of DiskANN vs Lucene HNSW

Need to use the actual Lucene disk classes for this. Have sample code deep in old vsearch history, will dig it up. Using my concurrent5 Lucene branch is kind of cheating but I think it's okay vs spending 32x as much time waiting for serial Lucene to build the graph, the actual graphs produced are nearly equivalent.

Goal is to build Deep100M index on machine with enough memory (64GB should be enough) and then serve it from disk on either a smaller VM or a container. I could be wrong but I don't think we can keep it from using all the ram as disk cache otherwise.

Add info line when vector api is correctly enabled

Currently we have a warning when it's NOT enabled but nothing when it IS enabled (so, running w/ enabled is indistinguishable from running on an unsupported JDK which is also silent)

In Lucene it looks like

INFO: Java vector incubator API enabled; uses preferredBitSize=256

Experiment with 4-bit PQ

Interested to see how much accuracy we lose if we do 4-bit PQ (two PQ entries per byte) intead of 8-.

Automate releases as GitHub workflow

We should automate the release process once we're comfortable with the results. I'd prefer a workflow that runs when an appropriate tag is pushed, but I'm open to other options.

mvnw exec:exec@bench fails

$ ./mvnw exec:exec@bench
11:14:40,831 [INFO] Scanning for projects...
11:14:40,876 [INFO]
11:14:40,876 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
11:14:40,877 [INFO] Building jvector 1.0-SNAPSHOT
11:14:40,877 [INFO] --------------------------------[ jar ]---------------------------------
11:14:40,907 [INFO]
11:14:40,907 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
WARNING: Using incubator modules: jdk.incubator.vector
Heap space available is 34359738368
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" io.jhdf.exceptions.HdfException: Failed to open file '/home/jonathan/Projects/jvector/hdf5/nytimes-256-angular.hdf5' . Is it a HDF5 file?
        at io.jhdf.HdfFile.<init>(HdfFile.java:228)
        at com.github.jbellis.jvector.example.Bench.load(Bench.java:149)
        at com.github.jbellis.jvector.example.Bench.gridSearch(Bench.java:245)
        at com.github.jbellis.jvector.example.Bench.main(Bench.java:232)
Caused by: java.nio.file.NoSuchFileException: hdf5/nytimes-256-angular.hdf5
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:308)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:367)
        at io.jhdf.HdfFile.<init>(HdfFile.java:188)
        ... 3 more
11:14:41,055 [ERROR] Command execution failed.

I note that mvnw clean install works and tests pass.

mvn:exec doesn't use JAVA_HOME

$ ./mvnw exec:exec@bench
10:58:11,915 [INFO] Scanning for projects...
10:58:12,000 [INFO]
10:58:12,000 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
10:58:12,002 [INFO] Building jvector 1.0-SNAPSHOT
10:58:12,002 [INFO] --------------------------------[ jar ]---------------------------------
10:58:12,069 [INFO]
10:58:12,070 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
Error occurred during initialization of boot layer
java.lang.module.FindException: Module jdk.incubator.vector not found
10:58:12,220 [ERROR] Command execution failed.

$ ./mvnw -v
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /home/jonathan/.m2/wrapper/dists/apache-maven-3.6.3-bin/1iopthnavndlasol9gbrbg6bf2/apache-maven-3.6.3
Java version: 20.0.1, vendor: Oracle Corporation, runtime: /home/jonathan/.jdks/openjdk-20.0.1

$ java -version
openjdk version "11.0.19" 2023-04-18

$ env |grep JAVA
JAVA_HOME=/home/jonathan/.jdks/openjdk-20.0.1/

I believe the problem is that pom.xml executes java instead of using JAVA_HOME. (As the output above shows, mvnw itself is using JAVA_HOME correctly.)

Rename CachingGraphIndex.BFS_DISTANCE?

This is a cosmetic thing, but always good to avoid confusion.

CachingGraphIndex.BFS_DISTANCE is a parameter used to populate GraphCache. Currently, we are actually performing DFS with a stack (the recursive call stack), rather than BFS with a queue.

Immediate possible "fixes":

  • Simplest thing is rename BFS_DISTANCE to DFS_DISTANCE (or more generically just DISTANCE) , and keep the current DFS implementation
  • Slightly more work is to keep the naming and re-implement our graph traversal using BFS. Our queue would hold (remaining_steps, node) pairs

Also, one other minor thing. The set of nodes that are distance = 0 from the entry node should correspond to precisely the entry node; this check should probably be a strict < 0 check.

Investigate kmeans stopping points

Currently we stop after max iterations or when < 1% of points change centroids.

Changed centroids is technically a bad measurement since the centroids themselves move. But maybe it's close enough?

Alternatively, the "correct" way is to check the residual distance from vectors to centroid.

We should find out

(0) How good does kmeans have to be to achieve the recall that we're looking for? Can we do fewer iterations and be effectively unchanged?
(1) What % of vectors are still changing centroids at that point?
(2) Is residual distance materially better in practice, given that it's more expensive to compute?

Merge OnDiskHnswGraphTest from C* into JVector

FullyConnected and RandomlyConnected classes may be useful. VectorCache and CachedLevel will not, we don't have levels anymore.

(RandomlyConnected is only used in C* VectorCacheTest but we could improve OnDiskHnswGraph by using that too, maybe?)

General documentation and API index

First off, great work!


It'd be very helpful if there were general documentation which helped map the theory and concepts to the class hierarchy or the main facades.

That may be augmented w/ potentially more examples and definitely an API index to browse.

Performance improvements broke PQ

before 2b03f5f

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 32.82s,
PQ encode 14.51s,
Index   M=8 ef=60 PQ=false: top 100/1 recall 0.5968, build 8.41s, query 3.61s. -1 nodes visited
Index   M=8 ef=60 PQ=true: top 100/1 recall 0.5897, build 8.41s, query 9.32s. -1 nodes visited

After:

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 20.01s,
PQ encode 1.65s,
Index   M=8 ef=60 PQ=false: top 100/1 recall 0.5927, build 8.31s, query 3.65s. -1 nodes visited
Index   M=8 ef=60 PQ=true: top 100/1 recall 0.0005, build 8.31s, query 10.49s. -1 nodes visited

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.