Giter VIP home page Giter VIP logo

annbench's Introduction

annbench: a lightweight benchmark for approximate nearest neighbor search

annbench is a simple benchmark for approximate nearest neighbor search algorithms in Python. This repository design is strongly influenced by a great project, ann-benchmarks, that provides comprehensive and thorough benchmarks for various algorithms. In contrast, we aim to deliver more lightweight and straightforward scripts with the following features.

  • Support Euclidean distance only
  • Support Recall@1 only
  • Support libraries installable via pip/conda only
  • Search with a single thread
  • Sweep by a single query parameter
  • AWS Deep Learning AMI, c5.4xlarge (Aug 28, 2021)

Getting started

git clone https://github.com/matsui528/annbench.git
cd annbench
pip install -r requirements.txt
# conda install faiss-cpu -y -c pytorch  # If you'd like to try faiss, run this on anaconda
# conda install faiss-gpu -y -c pytorch  # or, if you have GPUs, install faiss-gpu

python download.py dataset=siftsmall  # Downloaded on ./dataset

python run.py dataset=siftsmall algo=annoy  # Indices are on ./interim. Results are on ./output

python plot.py   # Plots are on ./result_img

Run all algorithms on all dataset

# Downloading deep1m takes some hours
python download.py --multirun dataset=siftsmall,sift1m,deep1m

# Will take some hours
python run.py --multirun dataset=siftsmall,sift1m,deep1m algo=linear,annoy,ivfpq,hnsw,ivfpq4bit,scann,pq,pq4bit
# Or, if you have GPUs, 
# python run.py --multirun dataset=siftsmall,sift1m,deep1m algo=linear,annoy,ivfpq,hnsw,linear_gpu,ivfpq_gpu,ivfpq4bit,scann,pq,pq4bit

python plot.py

How it works

Config

  • All config files are on ./conf. You can edit the config files to change parameters.

Download

  • Download a target dataset by python download.py dataset=DATASET. Several datasets can be downloaded at once by python download.py --multirun dataset=DATASET1,DATASET2,DATASET3. See hydra for more detailed APIs for multirun.
  • The data will be placed on ./dataset.

Run

  • Evaluate a target algorithm (ALGO) with a target dataset (DATASET) by python run.py dataset=DATASET algo=ALGO. You can run multiple algorithms on multiple datasets by python run.py --multirun dataset=DATASET1,DATASET2 algo=ALGO1,ALGO2.
  • Indices (data structures for search) are stored on ./interim. They are reused for each run with different query parameters.
  • The search result will be on ./output.
  • By default, we run each algorithm num_trial=10 times and return the average runtime. You can change this by: python run.py num_trial=5

Plot

  • You can visualize the search result by python plot.py. This script checks ./output and generate figures for each dataset on ./result_img.
  • In order not to print query parameters, you can set the flag false: python plot.py with_query_param=false.
  • To change the size of the image: python plot.py width=15 height=10

Log

  • When running run.py or plot.py, the output files will be on ./log as well. For example with python run.py algo=annoy dataset=siftsmall, the result file will be saved on (1) ./output/siftsmall/annoy/result.yaml and (2) ./log/2020-03-11/22-30-59/0/result.yaml.

Supported datasets

dataset dim #base #query #train misc
siftsmall 128 10,000 100 25,000 A toy dataset for hello-world
sift1m 128 1,000,000 10,000 100,000
deep1m 96 1,000,000 10,000 100,000 The first 1M vectors of Deep1B. Hepler scripts

Supported Algorithms

Add a new algorithm

  • Write a wrapper class on ./annbench/algo. The class must inherit BaseANN class. See annoy.py for examples.
  • Update ./annbench/algo/proxy.py
  • Add the name of the library on requirements.txt.
  • Add a config file on ./conf/algo.
  • Make sure the algorithm runs on a single thread

Add a new dataset

Advanced

Evaluation criteria

Index/query parameters

  • We define a simple guideline to set parameters. An algorithm has to have several index parameters and a single query parameter. For one set of index parameters, one index (data structure) is built. For this index, we run the search by sweeping the query parameter.
  • For example with ivfpq, let us consider the following index parameters:
    param_index={"M": 8, "nlist": 100}
    With these parameters, one index (let us denote ivfpq(M=8, nlist=100)) is created. This index is stored in the disk as M8_nlist100.bin, where the way of naming is defined in the function stringify_index_param. Here, a query parameter is defined as:
    param_query={"nprobe": [1, 2, 4, 8, 16]}
    In the search step, the index is read from the disk onto the memory first. Then we run the search five times, with for nprobe in [1, 2, 4, 8, 16]. This creates five results (five pairs of (recall, runtime)). By connecting these results, one polyline is drawn on the final plot.
  • Note that you must sort the values of the above query parameter. If you forget to sort (e.g., [1, 4, 2, 8, 16]), the final graph would become weird.

Specialization

  • Index/query parameters for each algorithm is defined in ./conf/algo/. These parameters are used for all datasets by default. If you'd like to specialize parameters for a specific dataset, you can define the specialized version in ./conf/specialized_param/.
  • For example, the default parameters for ivfpq is defined here, where nlist=100. You can set nlist=1000 for the sift1m dataset by adding a config file here

Dynamic configuration from the command line

  • In addition to editing the config files, you can override values from the commandline thanks to hydra.
  • Examples:
    • Writing index structures on SOMEWHEE/LARGE_HDD/:
      python run.py interim=SOMEWHERE/LARGE_HDD/interim
    • Run the ivfpq algorithm with different query parameters:
      python run.py algo=ivfpq dataset=siftsmall param_query.nprobe=[1,5,25]

Reference

Contribute

  • Feel free to open a pull request

Author

annbench's People

Contributors

matsui528 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.