Giter VIP home page Giter VIP logo

Comments (6)

bratseth avatar bratseth commented on April 28, 2024

I answered on SO.

from vespa.

shwetanks avatar shwetanks commented on April 28, 2024

Thanks @bratseth
could you advise how can we go forward with exploring if i can still create buckets by a user-defined criteria (as in key-value pairs defined in http://docs.vespa.ai/documentation/documents.html) and yet make use of indexed mode, and how to target that from search.

the data scale i am targeting will make such buckets have ~500million records (if it were to be a single bucket) that i want to split by a custom criteria (e.g. hourly time range) and leverage all benefits of indexed mode (e.g. stemming, linguisitics / normalization etc. are not available in streaming mode).
on other note, is Vespa planning to open an IRC / group where devs can bounce off such thoughts with contributors?..will help the community a lot.

Thank You!

from vespa.

bratseth avatar bratseth commented on April 28, 2024

Yes, you definitely shouldn't use streaming for this, but also I don't think you need to use a custom bucketing scheme. Data is not physically stored separately per bucket, that would cause updates to be slow, they are just used as a data management unit. So the only thing you can hope to achieve by custom bucketing is to cause a skewed data distribution, which just reduces efficiency.

Just write all the documents straightforwardly, with a timestamp field marked as an attribute, and if that field is a strong criterion in queries, also mark it as fast-search.

Regarding IRC/group, someone was going to set up a Slack channel, I'll find out what happened about that. Would that work for you or do you think there's better option?

from vespa.

shwetanks avatar shwetanks commented on April 28, 2024

Thanks! Slack would be awesome but is a little uncontrolled (and i can't assume how the cost works out)!
i guess any mailing list would do for start (e.g. apache has a lot of them) or even a google group would serve well. SO will soon have a lot of relevant content to attract lesser github issues (i am mostly opening them here on github to catalog initial questions that would otherwise take considerable time to figure from vast documentation and source code, and of'course to help my greed of getting onboard faster!).

your confirmation about how we can openly structure data and yet operate at scale solves a lot of stuff! i can see how this is very different from elasticsearch approach of physical indices and elasticity is more of abstract data segments and several processes keeping watch on underlying system.

some side notes --

i've also compiled Vespa on debian (ubuntu) and apart from few changes, mostly in configuration, it went well (used gcc-6.3.0 and it spews segfaults on parallel build around sets & maps .. single thread build works). i'll however need a nod if there are any unknowns ahead if i'd want to submit a PR for a cleaner approach towards this (or why not to do that)
e.g. i had to add -Wno-format-security to address LOG(debug, <const char *>) going without format specifier (this is at a lot of places) that Debian's default hardening doesn't allow. same is with -Wno-unused-result which is quite lax with CentOS.

thanks for help!

from vespa.

baldersheim avatar baldersheim commented on April 28, 2024

1 - segfaults on parallel builds -> That should not happen. Our internal builds run with -j 51 on our 48 core build machines. With no issues. The travis build I think runs with 4 cores. As far as I know there shall be no issues with parallel builds. If you encounter any that is a bug and needs to be rectified ASAP. Do you have any logs or anything that can help us narrow it down.

2 - We do not actively turn on -Wformat-security. I will give that a try and see what pops out. It sounds like something we should turn on.

3 - Same for -Wunused-result. That one might be a little harder to satisfy. I am not sure if these are OS dependent. If so we need to build on ubuntu too. I know we are doing RHEL 6, RHEL 7 and Centos 7 and Fedora 25/26.

If you find anything that should be improved feel free to submit pull request or give us hint. As long as your CLA is done we are open to any help.

from vespa.

shwetanks avatar shwetanks commented on April 28, 2024
  1. parallel builds on ubuntu 14.04 gcc 6.3.0
    gist with few logs
    https://gist.github.com/shwetanks/6bc1230df2dd5e25e4698073eafae292#file-vespa_parallel_ubuntu-log

  2. e.g.

    LOG(debug, ("Reading resume data for " + fileReference).c_str());

  3. e.g.

    WakeupPipe::write_token()


additionally

Building CXX object vespamalloc/src/vespamalloc/malloc/CMakeFiles/vespamalloc_mallocdst16.dir/threadpooldst.cpp.o

In file included from vespamalloc/src/vespamalloc/malloc/mallocdst16.cpp:32:0:
vespamalloc/src/vespamalloc/malloc/overload.h: In function ‘void* valloc(size_t)’:
vespamalloc/src/vespamalloc/malloc/overload.h:123:7: error: declaration of ‘void* valloc(size_t)’ has a different exception specifier
void *valloc(size_t size)
        ^~~~~~
vespamalloc/src/vespamalloc/malloc/overload.h:122:7: note: from previous declaration ‘void* 
valloc(size_t) throw ()’
void *valloc(size_t size) __attribute__((visibility ("default")));
       ^~~~~~
make[2]: *** 
[vespamalloc/src/vespamalloc/malloc/CMakeFiles/vespamalloc_mallocdst16.dir/mallocdst16.cpp.o] 
Error 1
make[2]: *** Waiting for unfinished jobs....

..just to add, i did compile on centos-7 as well and found none of above.

from vespa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.