Giter VIP home page Giter VIP logo

ssg's Introduction

SSG : Satellite System Graph For Approximate Nearest Neighbor Search

SSG is a graph-based approximate nearest neighbor search (ANNS) algorithm. It provides a flexible and efficient solution for the metric-free large-scale ANNS on dense real vectors. It implements the algorithm of our paper: Satellite System Graph: Towards the Efficiency Up-Boundary of Graph-Based Approximate Nearest Neighbor Search

Benchmark datasets

Data set Download dimension nb base vectors nb query vectors original website
SIFT1M original website 128 1,000,000 10,000 original website
GIST1M original website 128 1,000,000 1,000 original website
Crawl crawl.tar.gz (1.7GB) 300 1,989,995 10,000 original website
GloVe-100 glove-100.tar.gz (424MB) 100 1,183,514 10,000 original website
Deep100M deep100m.tar.gz* (34GB) 96 100,000,000 10,000 original website
  • For Deep100M we will provide the download link upon request

ANNS performance

The performance was tested without parallelism. Among all the graph-based algorithms, NSG and SSG has the smallest index size.

Performance

Compared Algorithms:

Tree-based

Quantization-based

Graph-based

Please see our NSG paper for the performance of other graph-based algorithms - FANNG.

How to use

Compile

  • Prerequisite : openmp, cmake, boost
  • Compile:
    1. Go to the root directory of faiss, it's under the directory of extern_libraries aside of ours.
    2. Execute the following commands:
$ cd /path/to/project
$ mkdir -p build && cd build
$ cmake .. && make -j

Building SSG Index

The main interfaces and classes have its respective test codes under directory tests/.

Please follow the instructions below to build the SSG index.

a) Build a kNN graph

Firstly, we need to prepare a kNN graph.

We suggest you use our efanna_graph to build this kNN graph. But you can also use any alternatives you like, such as KGraph or faiss.

b) Convert the kNN graph to an SSG

For example:

$ cd /path/to/project/build/tests/
$ ./test_ssg_index data_path knn_graph_path L R Angle ssg_path
  • data_path is the path of the origin data.
  • knn_graph_path is the path of the pre-built kNN graph.
  • L controls the quality of the NSG, the larger the better, L > R.
  • R controls the index size of the graph, the best R is related to the intrinsic dimension of the dataset.
  • Angle controls the angle between two edges.
  • ssg_path is the path where the result SSG stored.

Approximate Nearest Neighbor Search using SSG

For example:

$ cd /path/to/project/build/tests/
$ ./test_ssg_optimized_search data_path query_path ssg_path search_L search_K result_path [random_seed]
  • data_path is the path of the origin data.
  • query_path is the path of the query data.
  • ssg_path is the path of the pre-built SSG.
  • search_L controls the quality of the search results, the larger the better but slower (must larger than search_K).
  • search_K controls the number of neighbors we want to find.
  • result_path is the path of the result neighbors.
  • random_seed (optional) is the random seed.

NOTE: For now, we only provide interface for search for only one query at a time, and test the performance with single thread.

NOTE: Data alignment is essential for the correctness of our procedure, because we use SIMD instructions for acceleration of numerical computing such as AVX and SSE2. You should use it to ensure your data elements (feature) is aligned with 8 or 16 int or float. For example, if your features are of dimension 70, then it should be extend to dimension 72. And the last 2 dimension should be filled with 0 to ensure the correctness of the distance computing. And this is what data_align() does.

NOTE: Only data-type int32 and float32 are supported for now.

Python API

Install

$ cd /path/to/project/
$ python setup.py install

Usage

NOTE: currently Python API only supports search method.

import numpy as np
import pyssg

data = np.fromfile("/path/to/sift_base.fvecs", dtype=np.float32)
dim = data[0].view(np.int32)
data = data.reshape(-1, dim + 1)
data = np.ascontiguousarray(data[:, 1:])
ndata, dim = data.shape

pyssg.set_seed(1234)
index = pyssg.IndexSSG(dim, ndata)
index.load("/path/to/ssg", data)

k, l = 100, 300
query = np.random.randn(dim).astype(np.float32)
knn = index.search(query, k, l)
print(knn)

Please refer to tests/test_python_query.py for a real-world example.

Parameters used in Our Paper

SSG Building

We use the following parameters to get the SSG index in Fig. 6 of our paper.

We use efanna_graph to build the kNN graph

Step 1. Build kNN Graph

Dataset K L iter S R
SIFT1M 200 200 12 10 100
GIST1M 400 400 12 15 100
Crawl 400 420 12 15 100
GloVe-100 400 420 12 20 200
  • Commands:
$ efanna_graph/tests/test_nndescent sift.fvecs sift_200nn.knng 200 200 12 10 100
$ efanna_graph/tests/test_nndescent gist.fvecs gist_400nn.knng 400 400 12 15 100
$ efanna_graph/tests/test_nndescent crawl.fvecs crawl_400nn.knng 400 420 12 15 100
$ efanna_graph/tests/test_nndescent glove-100.fvecs glove-100_400nn.knng 400 420 12 15 200

Step 2. Convert kNN Graph to SSG

  • Parameters:
Dataset L R Angle
SIFT1M 100 50 60
GIST1M 500 70 60
Crawl 500 40 60
GloVe-100 500 50 60
  • Commands:
$ ./test_ssg_index sift.fvecs sift_200nn.knng 100 50 60 sift.ssg
$ ./test_ssg_index gist.fvecs gist_400nn.knng 500 70 60 gist.ssg
$ ./test_ssg_index crawl.fvecs crawl_400nn.knng 500 40 60 crawl.ssg
$ ./test_ssg_index glove-100.fvecs glove-100_400nn.knng 500 50 60 glove-100.ssg

SSG Search

  • search_L: range from search_K to 2000
  • random_seed: 161803398

Pre-built kNN Graph and NSG Index

Here we provide our pre-built kNN graphs and SSG index files used in our papar's experiments.

Dataset kNN Graph SSG Index
SIFT1M sift_200nn.knng sift.ssg
GIST1M gist_400nn.knng gist.ssg
Crawl crawl_400nn.knng crawl.ssg
GloVe-100 glove_400nn.knng glove.ssg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.