Comments (22)
Thanks for your requirement.
So far Milvus doesn't allow user specify data partition logic, but we are actually planning this.
A possible solution is:
Extend the Insert api: insert(table_name, vector_list, vector_id, partition_hint), if user provide the partition_hint, the vectors will be stored into a partition folder
Add a new api: delete_vectors_by_partition(table_name, partition_hint), user call this api to delete vectors for certain partition.
The final solution is not yet decided, please feel free to tell us if you have any suggestion.
Thanks!
from milvus.
Thanks for your solution.
If user specify data partition logic such as partition by date, will the recall rate or search speed decrease? Is there an upper limit on the number of partitions?
Looking forward to your reply, thanks!
from milvus.
Theoretically, partition logic won't affect the recall rate. But could affect search performance. For instance, assume we have 10000 vectors, if we put them into 1000 partitions, each partition contain 10 vectors(too few vectors to build index), so the search action is a 'brute-force search'; but if we put them into one partition, we can build index for this partition to get best search performance.
In my opinion milvus shouldn't limit partition number. User had to take responsibility for a reasonable partition number.
from milvus.
OK, if we put all vectors into 100 partitions by date, one partition have one million vectors, how much the search performance may decrease compare with only one partition probably?
from milvus.
The performance is same. Since one million vectors will be split into small data files(each file is about 1GB in default).
Partition number could affect search performance only in the case that vector number is too few.
from milvus.
Thanks!
"one million vectors will be split into small data files", Is this quantity of the small data files determined by parameter "nlist" ?
from milvus.
For Milvus 0.3.x: it is defined by index_building_threshold
in the server_config.yaml
For Milvus 0.4.x and 0.5.x: it is defined by create_table api. For python example:
create_table({'table_name': TABLE_NAME, 'dimension': TABLE_DIMENSION, 'index_file_size': 1024, 'metric_type':MetricType.L2})
The unit of 'index_file_size' is MB. Default value is 1024MB.
from milvus.
OK, got it.
For Milvus 0.3.1, what's the effect of the parameter "nlist" in config file "server_config" ?
from milvus.
The 'nlist' means split vectors into clusters within a file, after build index. Assume one file contains 10000 vectors, 'nlist' set to 200, then user perform 'build_index', the 10000 vectors will be split into 200 clusters(not equally), each cluster has an index.
from milvus.
Is the recall rate sensitive to this parameter 'nlist'?
from milvus.
There is another parameter 'nprobe' related to 'nlist'. The 'nprobe' is a search parameter, means how many cluster will be picked up to find topk result, 'nprobe' must always less-equal than 'nlist'. The two parameters can both affect search performance and recall rate.
Assume a file contains 10000 vectors.
If you set 'nlist'=1, 'nprobe'=1, that means all vectors in a single cluster and search engine will search all vectors in this cluster, the recall rate must be 100%, but the search performance is pool since all 10000 vectors were computed.
If you set 'nlist'=100, 'nprobe'=1, that means 10000 vectors split into 100 clusters, search engine firstly find the most closest cluster, then find topk in this cluster, the recall rate may less than 90%, but search performance is good.
from milvus.
OK,If the parameter is set to 'nlist'=100, 'nprobe'=1 or set 'nlist'=100, 'nprobe'=10, how much difference will the query efficiency be?
from milvus.
It is hard to say. A query performance is affected by many facets, including data swap, index parameters, search parameters, hardware ability, so on.
from milvus.
If the above factors are the same, will the query time increase linearly as nprobe increases?
from milvus.
Query time has several phase: collect/prepare index files, data load from disk to cpu, index compare, find topk in nprobe clusters, reduce to final result, serialize result and send to client, etc.
The nprobe parameter only affect one of the phases. Although this phase time cost is linearly depend by nprobe, the whole query time is not linearly.
from milvus.
Thanks for your detailed analysis.
from milvus.
Will milvus recall rates and performance change significantly on skewed and evenly distributed data sets?
from milvus.
I don't think it can significantly affect recall rate and performance. But I intent to say evenly distributed data sets is a better practice.
from milvus.
OK, according to previous usage, on a data set containing ten million vectors,nlist set to the default of 16384 .When nprobe is 1, return top1000, a cluster actually containing 50 vectors, which can recall 20 vectors with a recall rate of 40%; When nprobe increased to 100, the recall rate was 90%.In this case, should we increase the nlist and decrease the nprobe ?
from milvus.
'index_file_size' default value is 1024MB. Assume the 10M vectors are 512 dimension, each file contains 500000 vectors. 'nlist' set to 16384, each cluster contain about 30 vectors. 'nprobe' set to 1, topk set to 1000, the single cluster could only contain 35 vectors, the result will like this:
id = 12340 distance = 0.0
id = 34743 distance = 71.00025939941406
..... 35 valid items
id = 63112 distance = 92.93685913085938
id = 98257 distance = 93.01753997802734
id = -1 distance = 3.4028234663852886e+38
id = -1 distance = 3.4028234663852886e+38
......
id = -1 distance = 3.4028234663852886e+38
..... 965 invalid items
It only return 35 valid items to client. So the recall rate is very pool.
To increase the recall rate, you need to increase 'nprobe'. The larger the 'nprobe', the higher recall rate. If 'nprobe' equals to 'nlist', recall rate is 100%.
from milvus.
Thanks for your detailed analysis.
Looking forward the new api about vector deletion via generated date.
Best wishes!
from milvus.
#77 'Support Table partition' already implemented in 0.6.0. Please wait 0.6.0 release.
from milvus.
Related Issues (20)
- [Bug]: [major compaction] The "executing plan number" is wrong when major compact enable partition key as clustering key HOT 1
- [Bug]: The GPU memory usage is much larger than the actual vector data HOT 3
- [Bug]: After milvus-backup restore data, after rebuilding the index, the loading report NoSuchKey(key=file/stats_log/449033619691086290/ HOT 7
- [Bug]: [major compaction] Search failed with error "User Field(embeddings) is not loaded" when set vector field as clustering key field and enable partition key field together
- [Bug]: search方法只能获取138条数据。 HOT 9
- [Bug]: [benchmark][cluster][LRU] search and load failed in concurrent DDL & DQL scene HOT 2
- [Enhancement]: restful v2, support config dbName in the http header HOT 4
- [Bug]: Cannot Connect to ETCD with user/password 2.4.0 - no configuration? HOT 8
- [Bug]: Search with invalid sparse vector does not return an error HOT 1
- [Bug]: Use the Go client to delete the data. After a while, use attu's query interface to query the data. Only part of it was deleted. HOT 14
- [Bug]: [major compaction] Major compact hangs when enable "usePartitionKeyAsClusteringKey" and "useVectorAsClusteringKey" together with partition key field HOT 1
- [Bug]: [LRU] count(*) returns doubled value after LRU search HOT 4
- [Bug]: nil response cause ut panic HOT 1
- [Bug]: Parameter offset of interface HybridSearch does not seem to take effect HOT 1
- [Bug]: [benchmark][standalone][LRU] after upgrading image, the number of loaded segment is less then before HOT 2
- [Bug]: Milvus standalone random crash with 100% CPU utilisation HOT 6
- [Bug]: [benchmark][standalone][LRU] search raises error `fail to search on QueryNode 12: worker(12) query failed: cannot create std::vector larger than max_size()` in multi-collections scene HOT 2
- [Bug]: Data write to collection via Milvus Spark Connector fails with error 'Fail to list databases' HOT 1
- [Bug]: Data load from collection results in error 'java.lang.Exception: Fail to describe collection' HOT 18
- [Feature]: Improve Performance with large segment numbers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from milvus.