microsoft / sptag Goto Github PK

A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.

License: MIT License

CMake 1.52% C++ 90.90% Dockerfile 0.05% C 0.04% SWIG 0.84% Cuda 2.41% Python 4.24%

space-partition-tree neighborhood-graph vector-search fresh-update distributed-serving approximate-nearest-neighbor-search

sptag's Introduction

SPTAG: A library for fast approximate nearest neighbor search

SPTAG

SPTAG (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by Microsoft Research (MSR) and Microsoft Bing.

What's NEW

Result Iterator with Relaxed Monotonicity Signal Support
New Research Paper SPFresh: Incremental In-Place Update for Billion-Scale Vector Search - published in SOSP 2023
New Research Paper VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity - published in OSDI 2023

Introduction

This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.

SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.

How it works

SPTAG is inspired by the NGS approach [WangL12]. It contains two basic modules: index builder and searcher. The RNG is built on the k-nearest neighborhood graph [WangWZTG12, WangWJLZZH14] for boosting the connectivity. Balanced k-means trees are used to replace kd-trees to avoid the inaccurate distance bound estimation in kd-trees for very high-dimensional vectors. The search begins with the search in the space partition trees for finding several seeds to start the search in the RNG. The searches in the trees and the graph are iteratively conducted.

Highlights

Fresh update: Support online vector deletion and insertion
Distributed serving: Search over multiple machines

Build

Requirements

swig >= 4.0.2
cmake >= 3.12.0
boost >= 1.67.0

Fast clone

set GIT_LFS_SKIP_SMUDGE=1
git clone --recurse-submodules https://github.com/microsoft/SPTAG

OR

git config --global filter.lfs.smudge "git-lfs smudge --skip -- %f"
git config --global filter.lfs.process "git-lfs filter-process --skip"

Install

For Linux:

mkdir build
cd build && cmake .. && make

It will generate a Release folder in the code directory which contains all the build targets.

For Windows:

mkdir build
cd build && cmake -A x64 ..

It will generate a SPTAGLib.sln in the build directory. Compiling the ALL_BUILD project in the Visual Studio (at least 2019) will generate a Release directory which contains all the build targets.

For detailed instructions on installing Windows binaries, please see here

Using Docker:

docker build -t sptag .

Will build a docker container with binaries in /app/Release/.

Verify

Run the SPTAGTest (or Test.exe) in the Release folder to verify all the tests have passed.

Usage

The detailed usage can be found in Get started. There is also an end-to-end tutorial for building vector search online service using Python Wrapper in Python Tutorial. The detailed parameters tunning can be found in Parameters.

References

Please cite SPTAG in your publications if it helps your research:

@inproceedings{xu2023spfresh,
  title={SPFresh: Incremental In-Place Update for Billion-Scale Vector Search},
  author={Xu, Yuming and Liang, Hengyu and Li, Jin and Xu, Shuotao and Chen, Qi and Zhang, Qianxi and Li, Cheng and Yang, Ziyue and Yang, Fan and Yang, Yuqing and others},
  booktitle={Proceedings of the 29th Symposium on Operating Systems Principles},
  pages={545--561},
  year={2023}
}

@inproceedings{zhang2023vbase,
  title={$\{$VBASE$\}$: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity},
  author={Zhang, Qianxi and Xu, Shuotao and Chen, Qi and Sui, Guoxin and Xie, Jiadong and Cai, Zhizhen and Chen, Yaoqi and He, Yinxuan and Yang, Yuqing and Yang, Fan and others},
  booktitle={17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)},
  year={2023}
}

@inproceedings{ChenW21,
  author = {Qi Chen and 
            Bing Zhao and 
            Haidong Wang and 
            Mingqin Li and 
            Chuanjie Liu and 
            Zengzhong Li and 
            Mao Yang and 
            Jingdong Wang},
  title = {SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search},
  booktitle = {35th Conference on Neural Information Processing Systems (NeurIPS 2021)},
  year = {2021}
}

@manual{ChenW18,
  author    = {Qi Chen and
               Haidong Wang and
               Mingqin Li and 
               Gang Ren and
               Scarlett Li and
               Jeffery Zhu and
               Jason Li and
               Chuanjie Liu and
               Lintao Zhang and
               Jingdong Wang},
  title     = {SPTAG: A library for fast approximate nearest neighbor search},
  url       = {https://github.com/Microsoft/SPTAG},
  year      = {2018}
}

@inproceedings{WangL12,
  author    = {Jingdong Wang and
               Shipeng Li},
  title     = {Query-driven iterated neighborhood graph search for large scale indexing},
  booktitle = {ACM Multimedia 2012},
  pages     = {179--188},
  year      = {2012}
}

@inproceedings{WangWZTGL12,
  author    = {Jing Wang and
               Jingdong Wang and
               Gang Zeng and
               Zhuowen Tu and
               Rui Gan and
               Shipeng Li},
  title     = {Scalable k-NN graph construction for visual descriptors},
  booktitle = {CVPR 2012},
  pages     = {1106--1113},
  year      = {2012}
}

@article{WangWJLZZH14,
  author    = {Jingdong Wang and
               Naiyan Wang and
               You Jia and
               Jian Li and
               Gang Zeng and
               Hongbin Zha and
               Xian{-}Sheng Hua},
  title     = {Trinary-Projection Trees for Approximate Nearest Neighbor Search},
  journal   = {{IEEE} Trans. Pattern Anal. Mach. Intell.},
  volume    = {36},
  number    = {2},
  pages     = {388--403},
  year      = {2014
}

Contribute

This project welcomes contributions and suggestions from all the users.

We use GitHub issues for tracking suggestions and bugs.

License

The entire codebase is under MIT license

sptag's People

Contributors

Stargazers

Watchers

Forkers

junhuaw maggieqi haohaohaohaohaohaozhang shilangchen1011 jizhihang dreadlord1984 zhouyonglong kuyun-zhangyang yangyang233 zengjieyu xumeng723 amirunpri2018 mysqlsc ezhangle welleast noeltoby shafiahmed solnat bigdatasciencegroup bowei longjohncoder atmb4u epoulsen educatedmf kusanagi2501 jcaip fredyfx italianleprechaun rbozydar dvnguyen95 acastrooo sleeplessinva bynanex sunn-e jajonsraviation xiaohualong tornadomeet dut3062796s 26597925 wangchengqun zhoudaqing xiangliu886 jiajiabei997 jacklicn zhuwenxiao sahanduiuc jwmneu ganeshkrishnan1 crazyer-ai charloco kevinyzy wooowlili vanclsky samjaninf forehawk architecturetech dlreseach leaderyangzi amiercheng manodeep chcheruk gitqifan goribco lloves b-xiang rocknhawk jowood legendary001 namisan karpitpatel pidanself liangwq ltyiing nextmap evwang rayscolopendra kjlcl wangmingxjtu dsias jaernstaang master-starcloud zhaisitong xuzhouchuan zhp510730568 polytronicgr rock999 davidegiarolo simon5u roycezjq suaannihilant billyzju grospy liuliuluk davidalphafox happyyang challenzhou pariyat olivierh59500 iverson476ers 406975648

sptag's Issues

Error when make.

thanks for this great project, I like this project so much!
When I run cd build && cmake .. && make, get an error at 86%.

As the error messge said: this is caused by the SPTAG.py missed. And I found SPTAG.py is in .gitignore
I dont know how to solve this, do you have any ideas?
thank you so much.

Could you explain the algorithm of inserting and its approximate time complexity?

Especially, will the performance compromise if the added points are imbalanced?

Hash table of visited nodes for NGQueue

Describe the bug
Following #65 . The size of the hash table of visited nodes for NGQueue is set to 16394 ("two" hash table of size 8192). However, when searching in a large data-set and setting the "MaxCheck" to a relatively large number (i.e. 16394), the hash table would be full during the searching process. After that, every node that is not in the hash table (either visit for first time or not first time) would result in a -1 return value from _CheckAndSet function in WorkSpace.h. Since the CheckAndSet function only check if the return value from _CheckAndSet is equal to 0, all those nodes with -1 return value would be added to the NGQueue, causing duplicated search and inaccurate result (i.e. duplicate) not only during actual searching process, but also during the refine stage of the building process (which use searching result to update RNG).

To Reproduce
Steps to reproduce the behavior:

Build with a large data-set. (theoretically, a data-set larger than 16394 would be above to reproduce using extreme parameter, but a large data-set, i.e. 1M, would be easier to reproduce)
The behavior is already reproduced during refine stage of the building process.
Search with "MaxCheck" setting to a number larger than 16394.
The result of the search may contains duplicate result.

Expected behavior
Either:
Hash table is large enough to store all the visited node.
Or:
Proper handle of -1 return value from _CheckAndSet function or proper handle of the situation that hash table becomes full.

Desktop (please complete the following information):

OS: Windows 10
Version : Current master version

Additional context
Have tried to set the m_poolSize to a large number (i.e. 65535), does solve the problem ("MaxCheck" set to 16394), but the problem would still occur if using a much larger "MaxCheck". Since there is at least the memory space issue, and search efficiency, should not simply set the m_poolSize to a very large number.
Maybe set the m_poolSize according to "MaxCheckForRefineGraph" during build stage, and according to "MaxCheck" during search stage.

ImportError: dynamic module does not define module export function (PyInit__SPTAG)

Describe the bug
Ubuntu 18.04 installed SPTAG and trying to run the sample code here.
Also added this snippet based on a different issue:

import sys
sys.path.append('./Release')
import SPTAG
import numpy as np

n = 100
k = 3
r = 3
: 
:

Additional context
I get this error message:

  File "test_ann.py", line 3, in <module>
    import SPTAG
  File "./Release/SPTAG.py", line 15, in <module>
    import _SPTAG
ImportError: dynamic module does not define module export function (PyInit__SPTAG)

what am I doing wrong?

Unable to use the C# wrapper

I was trying to use the C# wrapper and faced some issues. Here are the things I tried:

Built the CSHARPSPTAG.dll compiling the CSHARPSPTAG project in VS17. But I was unable to add reference to this dll directly to a C# project. It said

A reference to 'path\to\CSHARPSPTAG.dll' could not be added. Please make sure that the file is accessible, and that it is a valid assembly or COM component.

Following this and this, I created a new C# DLL project with the swig generated *.cs files to make a .NET dll that I can add reference to. I then added SPTAGLib.dll and CSHARPSPTAG.dll in the \bin\Release\ dir of my project. This time, I was able to get it compiled, but when I ran it, it threw an exception

System.TypeInitializationException was unhandled
Message: An unhandled exception of type 'System.TypeInitializationException' occurred in ConsoleTest.exe
Additional information: The type initializer for 'CSHARPSPTAGPINVOKE' threw an exception.

I also tried adding all the TBB related dlls into the bin of my C# project but none of it seems to work.

Can any of you share how you managed to make it work?

Typo in README :)

In this sentence:
"SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relatrive neighborhood graph (SPTAG-BKT)," "relatrive" should be "relative."

Thanks for open-sourcing!

ImportError: dynamic module does not define init function (init_SPTAG)

When I run the official test python code, I get an error：
Traceback (most recent call last):
File "testSPTAG.py", line 4, in
import SPTAG
File "SPTAG/Release/SPTAG.py", line 17, in
_SPTAG = swig_import_helper()
File "SPTAG/Release/SPTAG.py", line 16, in swig_import_helper
return importlib.import_module('_SPTAG')
File "/usr/lib/python2.7/importlib/ __ init __.py", line 37, in import_module
__import __(name)
ImportError: dynamic module does not define init function (init_SPTAG)

How does SPTAG solve the false-negative issue?

A well-documented drawback of LSH, in general, is the 'false-negative' issue, whereby you do not know truly whether all nearest neighbors were actually considered at query time.

Approaches to solving this in other libraries have made good headway - how does SPTAG handle this specific issue?

Improve exception safety with smart pointers

Would you like to wrap any pointer data members with the class template “std::unique_ptr”?

Pip installable python wrapper / KNN graph construction

To benchmark against other python ann search methods, would be nice to have some pip installable package that would make it easier to set-up and run these search methods.

Excellent package. Thanks!

About the performance in my test.

Hey, thanks for this great project.
I wanna use this algorithm in face feature search, I tried another algorithm called hnswlib, this is also fast, but static, cannot delete.
So I compare three methods: SPTAG.AnnIndex("BKT"), hnswlib and brutoforce search (np.argmax).
Following table is each method's cost time. All are querying 1000 features' 3 most similar features in 26458 fetures.

algorithm	SPTAG.AnnIndex("BKT")	hnswlib	brutoforce search
cost time(ms)	3837.684	177.665	3636.569

As the table shows, SPTAG.AnnIndex("BKT") is not faster than brutoforce search, don't get spped up.
This is my SPTAG.AnnIndex("BKT") test code:

@calc_time
def testSearch(index, q, k):
   j = SPTAG.AnnIndex.Load(index)
   _t0 = datetime.now()
   for t in range(q.shape[0]):
       result = j.Search(q[t].tobytes(), k)
   _t1 = datetime.now()
   print("Search time is {} ms".format(1000*(_t1-_t0).total_seconds()))

I don't know if this is normal, or my test code is wrong?
Thank you.

It seems that the current version of SPTAG only supports np.float32.

When i use the np.float64, I turned the SPTAG.Annindex() datatype parameter from 'Float' (used in the test demo) to 'Double', 'Float64','float64', 'double'， but all failed.

Index Algorithm type, required?

-a, --algo Index Algorithm type, required.
such as?

reserved identifier violation

I would like to point out that an identifier like “_SPTAG_CORE_COMMONDEFS_H_” does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

DLL load failed when import SPTAG

I have built the project successfully，but when I import SPTAG, I met this error:
`

import sys
sys.path.append(r"D:\library\github_repositories\SPTAG-2\build\Release")
import SPTAG
Traceback (most recent call last):
File "", line 1, in
File "D:\library\github_repositories\SPTAG-2\build\Release\SPTAG.py", line 15, in
import _SPTAG
ImportError: DLL load failed: 找不到指定的模块。`

The files in folder build/Release are:

_SPTAG.exp
_SPTAG.lib
_SPTAG.pyd
_SPTAGClient.exp
_SPTAGClient.lib
_SPTAGClient.pyd
aggregator.exe
client.exe
indexbuilder.exe
search.exe
server.exe
SPTAG.py
SPTAGClient.py
SPTAGLib.dll
SPTAGLib.lib
test.exe

my environments are :

vs2015
python3.6
windows7

Could anyone tell me the solution for this error?

Parallel problem during BKT building

Inside the KmeansAssign function during the BKT building stage, it seems like it tries to parallel the job amount the threads (if set the NumberOfThreads to more than 1)

However, as the omp_get_num_threads() would only return 1 outside the "omp parallel for" loop, the function actually never parallel the job.
I'm not sure if this is done on purpose (since the BKT build wouldn't take much time, so no need to parallel?) or just a small bug.
I've also tried to use omp_get_max_thread() instead of omp_get_num_threads() to make use of the parallel. However, there is some kind of a bug that would just end the build process without any error message.

How does SPTAG handle sparse vectors?

Is your feature request related to a problem? Please describe.
I want to use SPTAG to do kNN queries on vectors from a TF-IDF vectorizer, usually sparse vectors. Can I use sparse vectors in SPTAG or rather do you have an idea of how it might perform with sparsity?

Kubernetes documentation

Most ANN libraries do not have a clear method for deployment. Docker and Kubernetes are widely used.

I would appreciate it if you added some documentation on how to deploy SPTAG onto a Kubernetes cluster.

Test search with metadata - can someone explain what metadata is stored/associated?

I am still trying to figure out SPTAG, but unclear on what the metadata methods provide.
for example in the Test() method in sample:

the metadata is generated thus:

for i in range(n):
       m += str(i) + '\n'
   m = m.encode()

does mean each line is associated with each row in the vectors?
can someone provide a simple example of storing metadata with the vectors?

Windows = "Could not find TBB!"

I installed the latest version of TBB through anaconda (https://anaconda.org/intel/tbb), but for some reason, every time I try to build SPTAG, it keeps stopping with "CMake Error at CMakeLists.txt:109 (message): Could not find TBB!"

Does anyone know a work-around for this? In the case that this is due to a faulty TBB installation, does anyone have some insight on how to properly install it on their windows machine (Visual Studio 2019)? All the available TBB installation documentation that I found are very outdated.

OverflowError in method 'AnnIndex_AddWithMetaData' when running test script

Describe the bug
OverflowError: in method 'AnnIndex_AddWithMetaData', argument 4 of type 'int'
when running the Python test script from "Get Started"

To Reproduce
Create a new Python file in the Release directory, copy&paste the Python test script from "Get Started".

Additional context
My environment:
Ubuntu 18.04, Kernel 4.15.0 x86_64
Tried swig 3.0 & 4.0
cmake 3.14.4
gcc 7.4.0
python 3.6

Add benchmarks

It would be great to get some ANN performance benchmarks for SPTAG for single node and distributed. The repo doesn't say how well it scales.

My team wanted to use FAISS a few months ago but there was a problem with the License so we are now using HNSWlib even though it isn't quite as performant. We would love to know if this is a viable alternative to FAISS

https://github.com/erikbern/ann-benchmarks is a great repo for ANN benchmarks and it would be great if you could help the author add SPTAG benchmarks for single node and multiple node setups. The author may have some difficulties setting everything up.

If you can't help out in the above repo. It would still be useful to show your own performance benchmarks in the README.

Need better documentation to run in client/server mode

Describe the bug
I haven't been able to figure out how to run SPTAG in a server/client mode on Ubunt 18.04. I want to host SPTAG on a cloud server and run a python client to perform kNN lookups.

For example, the documentations states thus:

./Server [options]
Options: 
  -m, --mode <value>              Service mode, interactive or socket.
  -c, --config <value>            Configure file of the index
:

But I can't seem to find the Server binary. I see a lowercase server binary in the Release folder, but it doesn't accept these arguments and gives an error:

Command 'server' not found, but can be installed with:
apt install rsplib-legacy-wrapper

Can someone post a simpler guide to using SPTAG in client/server mode for us noobs?
Thanks for such a wonderful tool!

Planned GPU Support?

Hi just wondering if there are plans to speed up computation via GPUs?

Upgrade .sln to Visual Studio 2019 and compatibility

Describe the bug
I'm not expert in C++ compilation but seems very hard to get the solution built.

My questions are :

Did someone test the build for VS 2019 ? is it working ?
How can I build the solution for C# wrapper by using VS 2019 instead of cmake
Seems that boost 1.70 is not working ? am I wrong ?

Desktop (please complete the following information):

OS: windows 10
Visual Studio version : VS 2019 16.1.6

Thanks in advance for your feedback

Is the SPTAGClient only has Search method? Not SearchWithMetaData?

Can anyone help me? Thanks

Frustrated with figuring out SPTAG

I have been trying to figure SPTAG out to use in my ANN searches to no avail. I am not getting an responses from the contributors, so am trying my luck with other uses. Could someone provide a simple walkthrough of how to use this tool? This is what I am stuck on:

If i want to create a 300 dimension store, I need to a "build" first? and is this done only once?

n = 300
k = 5
r = 3
def testBuild(algo, distmethod, x, out):
   i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
   i.SetBuildParam("NumberOfThreads", '4')
   i.SetBuildParam("DistCalcMethod", distmethod)
   ret = i.Build(x, x.shape[0])
   i.Save(out)

For incremental adds, do I just use this method without overwriting previous ones:

def testAdd(index, x, out, algo, distmethod):
   if index != None:
       i = SPTAG.AnnIndex.Load(index)
   else:
       i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
   i.SetBuildParam("NumberOfThreads", '4')
   i.SetBuildParam("DistCalcMethod", distmethod)
   if i.Add(x, x.shape[0]):
       i.Save(out)

what does the testAddWithMetaData do? does it return say an identifier with each result it returns? Is there a type of metadata that is supported i.e. can i save text snippets with it?

Could someone from microsoft please take a moment to provide a simple guide to using this powerful tool?

Why server test results are slower than computers？

I tested it with a GIST data set，The size of the query vector is (1000*960)，Return 100 neighbors，The time I built the index is 20562.084 s，The query time is 35.082 s，recall is 0.92669，The time to add 100 vectors is 22.281 s. I feel this time is very long, am I using it correctly? Its index construction and query time are longer than NSG.

Probelm with result = j.Search(q[t], k)

I used SPTAG in face searching. There're more than a million faces in my database.

When I change k value from 28000 to 30000, the searching time raised from 0.4s to several minutes.
I got many duplicate indices from j.Search() when I set k to a value bigger than 10000.

Looking forward to your reply.

My environment:

Python 3.7
Ubuntu 18.04

SaveIndex failed (Segmentation Fault) after Delete.

Describe the bug
When I test SPTAG in python by your python test code. I find when I delete some data from Index, SaveIndex always get segmentation fault.

This is Delete and Save code:

def testDelete(index, x, out):
   i = SPTAG.AnnIndex.Load(index)
   ret = i.Delete(x.tobytes(), x.shape[0])
   print (ret)
   print("Del finished...................")
   i.Save(out)

After save tree.bin finished, segmentation fault.
Thanks for your favor.

The SPTAG module caused Python to stop working.

The SPTAG module caused Python to stop working.
Fault Description: I am using SPTAG source code compiled with python3.6 64-bit VS2017, test.exe test passed.
Under Windows 7, Python 3.6 can import the SPTAG module. I have successfully indexed 1 million data, but when my data volume exceeds 1 million, an error occurs. The specific tips are:
PYTHON has stopped working,
Problem Event Name: BEX64
Application name: python.exe
Application version: 3.6.150.1013
Fault module name: ucrtbase.DLL
Exception code: C0000417
I went to the Internet to find a solution. Some people said that it is the reason for the VC runtime library, but I still installed the VC2015 and VC2017 runtime libraries.

Also, I noticed that when the program fails, the word "Start to build BKTree" does not appear after ""Setting DistCalcMethod", but it appears directly:
"Save Data To xxxxx\vectors.bin"
Save Data(0,1) Finish!
Save BKT to xxxxx\tree.bin
An error window will pop up here, prompting "Python has stopped working"

When the file "vectors.bin" reaches 2G size, will encounter a fatal error!

Below is the code that will cause the error：

import SPTAG
import numpy as np
n = 1024*1024 #this szie will encounted a fatal error
k = 3
r = 3
Dimension=512 #the size of vectors.bin will be 1024×1024×512×4=2G  n×Dimension×4
def testBuild(algo, distmethod, x, out):
   i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
   i.SetBuildParam("NumberOfThreads", '4')
   i.SetBuildParam("DistCalcMethod", distmethod)
   ret = i.Build(x.tobytes(), x.shape[0])
   i.Save(out)
def Test(algo, distmethod):
   x = np.ones((n, Dimension), dtype=np.float32) * np.reshape(np.arange(n, dtype=np.float32), (n, 1))
   q = np.ones((r, Dimension), dtype=np.float32) * np.reshape(np.arange(r, dtype=np.float32), (r, 1)) * 2
   print ("Build.............................")
   testBuild(algo, distmethod, x, 'testindices')
   testSearch('testindices', q, k)

if __name__ == '__main__':
   Test('BKT', 'L2')

And here is the code that can run successfully:

import SPTAG
import numpy as np
n = 1024*1024 -1 #Note the difference here, 1 less than the parameters of the above code.
k = 3
r = 3
Dimension=512 #the size of vectors.bin will be 1024×1024×512×4=2G  n×Dimension×4
def testBuild(algo, distmethod, x, out):
   i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
   i.SetBuildParam("NumberOfThreads", '4')
   i.SetBuildParam("DistCalcMethod", distmethod)
   ret = i.Build(x.tobytes(), x.shape[0])
   i.Save(out)
def Test(algo, distmethod):
   x = np.ones((n, Dimension), dtype=np.float32) * np.reshape(np.arange(n, dtype=np.float32), (n, 1))
   q = np.ones((r, Dimension), dtype=np.float32) * np.reshape(np.arange(r, dtype=np.float32), (r, 1)) * 2
   print ("Build.............................")
   testBuild(algo, distmethod, x, 'testindices')
   testSearch('testindices', q, k)

if __name__ == '__main__':
   Test('BKT', 'L2')

How can I solve this error?
Please Help Me！

Encoding issues in output of search

Describe the bug
Bad encoding in output of search

To Reproduce
Steps to reproduce the behavior:

Build an index with algorithm BKT using the cli
Build a search using a cli
Open output file of search
Contains bad encoding part way through the file

Expected behavior
Consistent encoding throughout the results

Output
Melanoma:[email protected]_Style_Transfer_for_Videos.pdf|1.005@^_O�:�^B�>�|1807.00273v1.Photorealistic_Style_Transfer_for_Videos.pdf��?�<:i��v3~~Y�$��<�n)8^\��=�J^R<J^?�<~~@hw��~~K;LV^A�~~FO~U;�^?

Desktop (please complete the following information):

OS:Ubuntu 18.04

Will you release a Mac version in the future?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

delete failed

When we use delete function, the model reports OverflowError. The data type is already float32.

Need more documentation for index parameters

I suppose there are about 19 parameters of BKT index in the code, which may influence the performance/accuracy tradeoff seriously. Can someone post a simpler guide to demonstrate how to tune these parameters?

The dimension of my data is (269, 768). The build process was failed, with no errors returned.

I used the code in demo.

The main function is :

The error message is:

Error copying file "SPTAG/Wrappers/inc/SPTAGClient.py" to "SPTAG/Release/".

DescriptionI get an error when compiling. The error seems related to copying of files from the Wrappers/inc folder to the Release folder. When I compile I get the following error.

Error copying file "/home/junior/programming/builds/SPTAG/Wrappers/inc/SPTAGClient.py" to "/home/junior/programming/builds/SPTAG/Release/". make[2]: *** [Wrappers/CMakeFiles/_SPTAGClient.dir/build.make:246: ../Release/_SPTAGClient.so] Error 1 make[2]: *** Deleting file '../Release/_SPTAGClient.so' make[1]: *** [CMakeFiles/Makefile2:386: Wrappers/CMakeFiles/_SPTAGClient.dir/all] Error 2 make: *** [Makefile:130: all] Error 2

To Reproduce
Steps to reproduce the behavior:

git clone
cd SPTAG
mkdir build
cd ubild && cmake .. && make

Desktop (please complete the following information):

OS: Fedora 30
Python 2.7

How to set Cosine similarity as distance method with the Python Wrapper?

From GettingStarted.md I figured that we can use L2 distance by:
SPTAG.AnnIndex.SetBuildParam("DistCalcMethod", "L2")
What is the value for cosine similarity distance?

Automatically tuning hyper-parameters in SPTAG on NNI

SPTAG has around 10 hyperparameters. We try to use automl algorithms to find best-performing hyper-parameters for SPTAG. Neural Network Intelligence toolkit (NNI) is a perfect tool for this task. It is a toolkit to help users design and tune machine learning models (e.g., hyperparameters), neural network architectures, or complex system’s parameters, in an efficient and automatic way. NNI has many tuning algorithms built in, and can be executed in a distributed way on a local machine, a remote server, or a large scale training platform such as OpenPAI or Kubernetes.

How to set the parameter "RefineIterations" for improve Accuracy？

In python demo code , I added a line in testBuild function:
i.SetBuildParam("RefineIterations","3")
But it seems useless , when the index building , the RefineIterations still is 2.

Segmentation fault (core dumped)

Describe the bug

Test SPTag wrapper on python 2.7, ubuntu 18.04
TRY x is FLOAT data. Got error Segmentation fault (core dumped)

import SPTAG
import numpy as np

n = 100
k = 3
r = 3

def testBuild(algo, distmethod, x, out):
    i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
    i.SetBuildParam("NumberOfThreads", '4')
    i.SetBuildParam("DistCalcMethod", distmethod)
    ret = i.Build(x.tobytes(), x.shape[0])
    i.Save(out)

def testSearch(index, q, k):
    j = SPTAG.AnnIndex.Load(index)
    for t in range(q.shape[0]):
        result = j.Search(q[t].tobytes(), k)
        print(result[0])  # ids
        print(result[1])  # distances

def Test(algo, distmethod):
    x = np.ones((n, 10), dtype=np.float32) * np.reshape(np.random.uniform(low=0.5, high=13.3, size=(n,)), (n,1))

    print('x.shape: ', str(x.shape))
    print(x[1,])

    print("Build.............................")
    testBuild(algo, distmethod, x, 'testindices')
    testSearch('testindices', q, k)

if __name__ == '__main__':
    Test('BKT', 'L2')
    # Test('KDT', 'L2')

Log console:

('x.shape: ', '(100, 10)')
[ 0.60697413 0.60697413 0.60697413 0.60697413 0.60697413 0.60697413
0.60697413 0.60697413 0.60697413 0.60697413]
Build.............................
Setting NumberOfThreads with value 4
Setting DistCalcMethod with value L2
Save Data To testindices/vectors.bin
Save Data (0, 1) Finish!
Save BKT to testindices/tree.bin
Segmentation fault (core dumped)

When the number ofvectors too big, index build will fail to complete!

When there are too many vectors, such as 5 million （n = 1024×1024×5）， index build will fail to complete.
The program stops at this place, prompting:
"Save Data To xxxxx\vectors.bin"
And I noticed that the file size of vectors.bin reach to 300G! that is not normal，the The correct file size should be 10G.

The code that caused the error is as follows：

import SPTAG
import numpy as np
n = 1024*1024*5 #this szie will cause the file size of vectors.bin to reach 300G!
k = 3
r = 3
Dimension=512 #the size of vectors.bin will be 1024×1024×512×4=2G  n×Dimension×4
def testBuild(algo, distmethod, x, out):
   i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
   i.SetBuildParam("NumberOfThreads", '4')
   i.SetBuildParam("DistCalcMethod", distmethod)
   ret = i.Build(x.tobytes(), x.shape[0])
   i.Save(out)
def Test(algo, distmethod):
   x = np.ones((n, Dimension), dtype=np.float32) * np.reshape(np.arange(n, dtype=np.float32), (n, 1))
   q = np.ones((r, Dimension), dtype=np.float32) * np.reshape(np.arange(r, dtype=np.float32), (r, 1)) * 2
   print ("Build.............................")
   testBuild(algo, distmethod, x, 'testindices')

if __name__ == '__main__':
   Test('BKT', 'L2')

How can I solve this error?
Please Help Me！

boost 1.70.0 version cause cmake error. fixed install boost 1.67.0

Describe the bug
When I install SPTAG with boost 1.70.0 version cause cmake below error message on Ubuntu machine.

/usr/local/include/boost/asio/detail/io_object_impl.hpp:88:53: error: ‘class boost::asio::execution_context’ has no member named ‘get_executor’

Removed 1.70.0 and Installed boost 1.67.0, solved the build issue.

Here is the detail error message.

/usr/local/include/boost/asio/io_context_strand.hpp:89:19: note:   no known conversion for argument 1 from ‘boost::asio::execution_context’ to ‘const boost::asio::io_context::strand&’

In file included from /usr/local/include/boost/asio/basic_socket.hpp:21:0,
                 from /usr/local/include/boost/asio/basic_socket_acceptor.hpp:19,
                 from /usr/local/include/boost/asio/ip/tcp.hpp:19,
                 from /home/USERNAME/project/SPTAG/AnnService/inc/Socket/Connection.h:13,
                 from /home/USERNAME/project/SPTAG/AnnService/src/Socket/Connection.cpp:4:

/usr/local/include/boost/asio/detail/io_object_impl.hpp: In instantiation of ‘boost::asio::detail::io_object_impl<IoObjectService, Executor>::io_object_impl(ExecutionContext&, typename std::enable_if<std::is_convertible<ExecutionContext&, boost::asio::execution_context&>::value>::type*) [with ExecutionContext = boost::asio::execution_context; IoObjectService = boost::asio::detail::deadline_timer_service<boost::asio::time_traits<boost::posix_time::ptime> >; Executor = boost::asio::executor; typename std::enable_if<std::is_convertible<ExecutionContext&, boost::asio::execution_context&>::value>::type = void]’:

/usr/local/include/boost/asio/basic_deadline_timer.hpp:174:20:   required from ‘boost::asio::basic_deadline_timer<Time, TimeTraits, Executor>::basic_deadline_timer(ExecutionContext&, typename std::enable_if<std::is_convertible<ExecutionContext&, boost::asio::execution_context&>::value>::type*) [with ExecutionContext = boost::asio::execution_context; Time = boost::posix_time::ptime; TimeTraits = boost::asio::time_traits<boost::posix_time::ptime>; Executor = boost::asio::executor; typename std::enable_if<std::is_convertible<ExecutionContext&, boost::asio::execution_context&>::value>::type = void]’

/home/USERNAME/project/SPTAG/AnnService/src/Socket/Connection.cpp:29:31:   required from here

/usr/local/include/boost/asio/detail/io_object_impl.hpp:88:53: error: ‘class boost::asio::execution_context’ has no member named ‘get_executor’
         is_same<ExecutionContext, io_context>::value)
                                                     ^

AnnService/CMakeFiles/server.dir/build.make:348: recipe for target 'AnnService/CMakeFiles/server.dir/src/Socket/Connection.cpp.o' failed
make[2]: *** [AnnService/CMakeFiles/server.dir/src/Socket/Connection.cpp.o] Error 1
CMakeFiles/Makefile2:176: recipe for target 'AnnService/CMakeFiles/server.dir/all' failed
make[1]: *** [AnnService/CMakeFiles/server.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

To Reproduce
Steps to reproduce the behavior:
Install boost 1.70.0 and run install instruction
on https://github.com/microsoft/SPTAG

mkdir build
cd build && cmake .. && make

Test OS version information:
Microsoft Azure default Ubuntu 14 distribution
OS: Linux dw-ubuntu14 4.15.0-1045-azure #49-Ubuntu SMP Mon May 13 16:30:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Boost install steps I followed
https://stackoverflow.com/questions/12578499/how-to-install-boost-on-ubuntu/24086375#24086375

Removed 1.70.0 and Installed boost 1.67.0, solved the build issue.

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Distributed Serving

Thanks for open sourcing, however its written that it supports distributed serving, and its unclear in the documentation how to do it. Is it all about loading same index into multiple machines ?

Thanks

can you give me demos of QueryFile and TruthFile?

can you give me demos of QueryFile and TruthFile?
[email protected]
I want the files.

microsoft / sptag Goto Github PK

sptag's Introduction

SPTAG: A library for fast approximate nearest neighbor search

SPTAG

What's NEW

Introduction

How it works

Highlights

Build

Requirements

Fast clone

Install

Verify

Usage

References

Contribute

License

sptag's People

Contributors

Stargazers

Watchers

Forkers

sptag's Issues

Log console:

Recommend Projects

Recommend Topics

Recommend Org