Comments (6)
I answered on SO.
from vespa.
Thanks @bratseth
could you advise how can we go forward with exploring if i can still create buckets by a user-defined criteria (as in key-value pairs defined in http://docs.vespa.ai/documentation/documents.html) and yet make use of indexed mode, and how to target that from search.
the data scale i am targeting will make such buckets have ~500million records (if it were to be a single bucket) that i want to split by a custom criteria (e.g. hourly time range) and leverage all benefits of indexed mode (e.g. stemming, linguisitics / normalization etc. are not available in streaming mode).
on other note, is Vespa planning to open an IRC / group where devs can bounce off such thoughts with contributors?..will help the community a lot.
Thank You!
from vespa.
Yes, you definitely shouldn't use streaming for this, but also I don't think you need to use a custom bucketing scheme. Data is not physically stored separately per bucket, that would cause updates to be slow, they are just used as a data management unit. So the only thing you can hope to achieve by custom bucketing is to cause a skewed data distribution, which just reduces efficiency.
Just write all the documents straightforwardly, with a timestamp field marked as an attribute, and if that field is a strong criterion in queries, also mark it as fast-search.
Regarding IRC/group, someone was going to set up a Slack channel, I'll find out what happened about that. Would that work for you or do you think there's better option?
from vespa.
Thanks! Slack would be awesome but is a little uncontrolled (and i can't assume how the cost works out)!
i guess any mailing list would do for start (e.g. apache has a lot of them) or even a google group would serve well. SO will soon have a lot of relevant content to attract lesser github issues (i am mostly opening them here on github to catalog initial questions that would otherwise take considerable time to figure from vast documentation and source code, and of'course to help my greed of getting onboard faster!).
your confirmation about how we can openly structure data and yet operate at scale solves a lot of stuff! i can see how this is very different from elasticsearch approach of physical indices and elasticity is more of abstract data segments and several processes keeping watch on underlying system.
some side notes --
i've also compiled Vespa on debian (ubuntu) and apart from few changes, mostly in configuration, it went well (used gcc-6.3.0 and it spews segfaults on parallel build around sets & maps .. single thread build works). i'll however need a nod if there are any unknowns ahead if i'd want to submit a PR for a cleaner approach towards this (or why not to do that)
e.g. i had to add -Wno-format-security to address LOG(debug, <const char *>) going without format specifier (this is at a lot of places) that Debian's default hardening doesn't allow. same is with -Wno-unused-result which is quite lax with CentOS.
thanks for help!
from vespa.
1 - segfaults on parallel builds -> That should not happen. Our internal builds run with -j 51 on our 48 core build machines. With no issues. The travis build I think runs with 4 cores. As far as I know there shall be no issues with parallel builds. If you encounter any that is a bug and needs to be rectified ASAP. Do you have any logs or anything that can help us narrow it down.
2 - We do not actively turn on -Wformat-security. I will give that a try and see what pops out. It sounds like something we should turn on.
3 - Same for -Wunused-result. That one might be a little harder to satisfy. I am not sure if these are OS dependent. If so we need to build on ubuntu too. I know we are doing RHEL 6, RHEL 7 and Centos 7 and Fedora 25/26.
If you find anything that should be improved feel free to submit pull request or give us hint. As long as your CLA is done we are open to any help.
from vespa.
-
parallel builds on ubuntu 14.04 gcc 6.3.0
gist with few logs
https://gist.github.com/shwetanks/6bc1230df2dd5e25e4698073eafae292#file-vespa_parallel_ubuntu-log -
e.g.
-
e.g.
additionally
Building CXX object vespamalloc/src/vespamalloc/malloc/CMakeFiles/vespamalloc_mallocdst16.dir/threadpooldst.cpp.o
In file included from vespamalloc/src/vespamalloc/malloc/mallocdst16.cpp:32:0:
vespamalloc/src/vespamalloc/malloc/overload.h: In function ‘void* valloc(size_t)’:
vespamalloc/src/vespamalloc/malloc/overload.h:123:7: error: declaration of ‘void* valloc(size_t)’ has a different exception specifier
void *valloc(size_t size)
^~~~~~
vespamalloc/src/vespamalloc/malloc/overload.h:122:7: note: from previous declaration ‘void*
valloc(size_t) throw ()’
void *valloc(size_t size) __attribute__((visibility ("default")));
^~~~~~
make[2]: ***
[vespamalloc/src/vespamalloc/malloc/CMakeFiles/vespamalloc_mallocdst16.dir/mallocdst16.cpp.o]
Error 1
make[2]: *** Waiting for unfinished jobs....
..just to add, i did compile on centos-7 as well and found none of above.
from vespa.
Related Issues (20)
- Vespa visit not returning deleted documents when selection criteria is added. HOT 3
- [Schema streaming mode] Enhence rank calculation for substring search HOT 5
- Make behaviour between `global` and `second` phase when `ranking.sorting=''` is given consistent HOT 1
- Allow multiple tensor outputs from native Vespa embedders HOT 3
- Pyvespa deployment fails HOT 5
- [Schema streaming mode] bm25 score is always zero HOT 2
- Array length as search criteria HOT 1
- Sorry couldn't find a discussion tab, so asking here. Can vespa be used as a primary database? HOT 3
- Evaluate onnx models with onnxruntime HOT 3
- Export Vespa golang client API as a library HOT 5
- Parse exception for regular expression HOT 2
- Make YQL Query Syntax Parse Exception more meaningful
- Case sensitive search not supported on index fields HOT 2
- special-tokens not applied for document, only query HOT 1
- Add a topk tensor function for mapped tensors
- Indexing language fails on an empty array HOT 2
- Reindexing is getting stalled
- Inconsistent rendering of string versus array of string with regards to unicode escaping HOT 2
- Vespa 9: Consider updating bm25 hyperparameter defaults
- Segmented And behaviour with weakAnd for CJK languages HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vespa.