References: id scheme Format: id:::<key/value-p

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

parallel builds on ubuntu 14.04 gcc 6.3.0 gist with few logs<

Proton: Custom bucketing & Query about vespa HOT 6 CLOSED

vespa-engine commented on April 28, 2024

Proton: Custom bucketing & Query

from vespa.

Comments (6)

bratseth commented on April 28, 2024

I answered on SO.

from vespa.

shwetanks commented on April 28, 2024

Thanks @bratseth
could you advise how can we go forward with exploring if i can still create buckets by a user-defined criteria (as in key-value pairs defined in http://docs.vespa.ai/documentation/documents.html) and yet make use of indexed mode, and how to target that from search.

the data scale i am targeting will make such buckets have ~500million records (if it were to be a single bucket) that i want to split by a custom criteria (e.g. hourly time range) and leverage all benefits of indexed mode (e.g. stemming, linguisitics / normalization etc. are not available in streaming mode).
on other note, is Vespa planning to open an IRC / group where devs can bounce off such thoughts with contributors?..will help the community a lot.

Thank You!

from vespa.

bratseth commented on April 28, 2024

Yes, you definitely shouldn't use streaming for this, but also I don't think you need to use a custom bucketing scheme. Data is not physically stored separately per bucket, that would cause updates to be slow, they are just used as a data management unit. So the only thing you can hope to achieve by custom bucketing is to cause a skewed data distribution, which just reduces efficiency.

Just write all the documents straightforwardly, with a timestamp field marked as an attribute, and if that field is a strong criterion in queries, also mark it as fast-search.

Regarding IRC/group, someone was going to set up a Slack channel, I'll find out what happened about that. Would that work for you or do you think there's better option?

from vespa.

shwetanks commented on April 28, 2024

Thanks! Slack would be awesome but is a little uncontrolled (and i can't assume how the cost works out)!
i guess any mailing list would do for start (e.g. apache has a lot of them) or even a google group would serve well. SO will soon have a lot of relevant content to attract lesser github issues (i am mostly opening them here on github to catalog initial questions that would otherwise take considerable time to figure from vast documentation and source code, and of'course to help my greed of getting onboard faster!).

your confirmation about how we can openly structure data and yet operate at scale solves a lot of stuff! i can see how this is very different from elasticsearch approach of physical indices and elasticity is more of abstract data segments and several processes keeping watch on underlying system.

some side notes --

i've also compiled Vespa on debian (ubuntu) and apart from few changes, mostly in configuration, it went well (used gcc-6.3.0 and it spews segfaults on parallel build around sets & maps .. single thread build works). i'll however need a nod if there are any unknowns ahead if i'd want to submit a PR for a cleaner approach towards this (or why not to do that)
e.g. i had to add -Wno-format-security to address LOG(debug, <const char *>) going without format specifier (this is at a lot of places) that Debian's default hardening doesn't allow. same is with -Wno-unused-result which is quite lax with CentOS.

thanks for help!

from vespa.

baldersheim commented on April 28, 2024

1 - segfaults on parallel builds -> That should not happen. Our internal builds run with -j 51 on our 48 core build machines. With no issues. The travis build I think runs with 4 cores. As far as I know there shall be no issues with parallel builds. If you encounter any that is a bug and needs to be rectified ASAP. Do you have any logs or anything that can help us narrow it down.

2 - We do not actively turn on -Wformat-security. I will give that a try and see what pops out. It sounds like something we should turn on.

3 - Same for -Wunused-result. That one might be a little harder to satisfy. I am not sure if these are OS dependent. If so we need to build on ubuntu too. I know we are doing RHEL 6, RHEL 7 and Centos 7 and Fedora 25/26.

If you find anything that should be improved feel free to submit pull request or give us hint. As long as your CLA is done we are open to any help.

from vespa.

shwetanks commented on April 28, 2024

parallel builds on ubuntu 14.04 gcc 6.3.0
gist with few logs
https://gist.github.com/shwetanks/6bc1230df2dd5e25e4698073eafae292#file-vespa_parallel_ubuntu-log
e.g.

vespa/filedistribution/src/vespa/filedistribution/distributor/filedownloader.cpp

Line 288 in f76406b

LOG(debug, ("Reading resume data for " + fileReference).c_str());
e.g.

vespa/vespalib/src/vespa/vespalib/net/selector.cpp

Line 45 in f76406b

WakeupPipe::write_token()

additionally

Building CXX object vespamalloc/src/vespamalloc/malloc/CMakeFiles/vespamalloc_mallocdst16.dir/threadpooldst.cpp.o

In file included from vespamalloc/src/vespamalloc/malloc/mallocdst16.cpp:32:0:
vespamalloc/src/vespamalloc/malloc/overload.h: In function ‘void* valloc(size_t)’:
vespamalloc/src/vespamalloc/malloc/overload.h:123:7: error: declaration of ‘void* valloc(size_t)’ has a different exception specifier
void *valloc(size_t size)
        ^~~~~~
vespamalloc/src/vespamalloc/malloc/overload.h:122:7: note: from previous declaration ‘void* 
valloc(size_t) throw ()’
void *valloc(size_t size) __attribute__((visibility ("default")));
       ^~~~~~
make[2]: *** 
[vespamalloc/src/vespamalloc/malloc/CMakeFiles/vespamalloc_mallocdst16.dir/mallocdst16.cpp.o] 
Error 1
make[2]: *** Waiting for unfinished jobs....

..just to add, i did compile on centos-7 as well and found none of above.

from vespa.

Proton: Custom bucketing & Query about vespa HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent