Giter VIP home page Giter VIP logo

trinity's People

Contributors

markpapadakis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

trinity's Issues

Phrases Optimizer

  • [(xbox one) OR "xbox one"] => [(xbox one)]
  • [(xbox one) AND "xbox one"] => ["xbox one"]

This should be simple because we collapse sequences into runs.

Exec: matchphrase_impl alt.

In compile()'s third pass (or elsewhere where appropriate), we need to consider matchphrase_impl exec nodes and if the phrase size is say [3, 16], replace it with an alternative which first checks if all decoders seek() return true, and then materialises hits for all of them, as opposed to what we do now, where we seek() and materialize() in the same loop. This way, if we have a long phrase, and say the last phrase's word doesn't match the document, we wouldn't materialise all previous phrase words.

Most phrases are 2-3 terms long though, so for now this is not something worth pursuing, although it should be trivial to implement.

We would just need to remember to account for that in codeblocks where we check for matchphrase_impl.

Query Rewrites: cache rewrite callback result

Depending on the query and the implementation of the callback, we may end up invoking the calback multiple times for the same (runCtx, tokens). We should cache the alternatives generated by the callback and cache by (runCtx, tokens). This may save a lot of processing effort, especially if the callback is doing a lot of work, and otherwise improve performance.

Query Rewrites: budget semantics issue

For a simple Trinity::query_rewrite(), with budget set to 250, and a lambda that simply returns an alternative for a sequence of tokens the concatenation of said tokens (i.e from [mac, book] => macbook), for this query [com plete season steel book edition ps4 game square enix playstation] the re-written query is wrong, because stopping when budget is depleted apparently breaks the process.

We need to either figure out why, or do away with budgets during generation, and then prune the final query after we have constructed it, based on a budget constraint.

For now, budget is not respected until this is fixed.


Private: See Q.cpp

Do you have plans to support real time indexing?

It's not that difficult to support such a feature, just by providing two in-memory segments is enough.
When one in-memory segment is full, just flush it to disk while the other in-memory segment will be used to support data ingestion at the same time. It requires a lock-less design to support higher concurrency, which is not that complicated using std::atomic semantics.

Error when building Trinity

Hi, I'm trying to build Trinity with the help of wiki. When I run make, I get this error
make -C Switch/ext_snappy/ make[1]: Entering directory '/trinity-test/Trinity/Switch/ext_snappy' ar rc libsnappy.a snappy.o snappy-sinksource.o snappy-stubs-internal.o make[1]: Leaving directory '/vnazzaro/trinity-test/Trinity/Switch/ext_snappy' make -C Switch/ext/FastPFor/ make[1]: Entering directory '/vnazzaro/trinity-test/Trinity/Switch/ext/FastPFor' CMake Error: The source directory "/db/storage/Development/Projects/Trinity/Switch/ext/FastPFor" does not exist. Specify --help for usage, or press the help button on the CMake GUI. Makefile:830: recipe for target 'cmake_check_build_system' failed make[1]: *** [cmake_check_build_system] Error 1 make[1]: Leaving directory '/vnazzaro/trinity-test/Trinity/Switch/ext/FastPFor' Makefile:63: recipe for target 'switch' failed make: *** [switch] Error 2

Benchmarks ?

People love benchmarks. Especially against lucene.

Lucene Codec: seek() improvements

Trinity::Codecs::Lucene::Decoder::seek() currently uses the term skip list when it reaches the end of the current block -- we should probably change that to consider seeking immediately, thus saving the iteration cost and potentially providing a great performance boost.

Replication of index data via using rocksdb as backend

Hello,
I will like to replicate my data (one master multiple slaves). Like Lucene , from what I understand , trinity also creates segments (which contains raw data + other related indexed information).
Can I store this segment into rocksdb and use something like - https://github.com/pinterest/rocksplicator
for replication.
Please do tell me your views about this approach or if you have better suggestion.
We will have be creating different indexes for different sources (each will be stored as a seperate database in rocksdb), and for some indexes we dont require ranking at all. Is there a setting which we can use, which will make processing fast (by turning ranking off)

Optimizer: consider distinct ALL matchallterms_impl

Query rewrites may end up generating a query that when compiled may have multiple matchallterms_impl execnodes, like this exec nodes tree representation:

(1 AND ((((3 AND ((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))))) OR (((ALL OF[24,19,26] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))) OR (((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])))) AND 26)) AND 25)) AND 2) OR (ALL OF[29,27] AND ((3 AND ((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))))) OR (((ALL OF[24,19,26] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))) OR (((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])))) AND 26)) AND 25)))))

where e.g ALL OF[24,19,26] appears in multiple OR sub-expressions.
If we could identify those, and then replace matchallterms_impl exec nodes with a suitable alternative, where its ptr points to the list of tokens and wether they have been matched for this document, i.e

struct
{
    docid_t lastConsideredDID;
    bool res;
    uint8_t termsCnt;
    exec_term_id_t terms[0];
};

and we would just check if lastConsideredDID == rctx.curDocID and if so, return res, otherwise compute res, and return it. This could really save a lot of checks.


Try the query [com plete sea son steel book ed ition ps4 ga me sq uare enix play station first ] with the following simple query rewrite lambda:

        Trinity::rewrite_query(q, 250, 3,
                               [](const auto ctx, const auto tokens, const auto cnt, auto &a, auto *const out) {
                                       if (cnt == 1)
                                       {
                                               const auto term = tokens[0];

                                               if (term.size() > 3 && term.p[1] == '-' && isalpha(term.p[0]) && isalpha(term.p[2]))
                                               {
                                                       // t-shirt => ("t shirt" OR tshirt)
                                                       auto p = (char *)a.Alloc(sizeof(char_t) * (term.size() + term.size() + 32)), o = p;

                                                       *o++ = '(';
                                                       *o++ = '"';
                                                       *o++ = term.p[0];
                                                       *o++ = ' ';
                                                       memcpy(o, term.p + 2, (term.size() - 2) * sizeof(char_t));
                                                       o += term.size() - 2;
                                                       *o++ = '"';
                                                       memcpy(o, _S(" OR "));
                                                       o += 4;
                                                       *o++ = term.p[0];
                                                       memcpy(o, term.p + 2, (term.size() - 2) * sizeof(char_t));
                                                       o += term.size() - 2;
                                                       *o++ = ')';

                                                       out->push_back({{p, uint32_t(o - p)}, 1});
                                               }
                                       }
                                       else if (cnt == 2)
                                       {
                                               // concatenation

                                               if (!isdigit(tokens[0].p[0]))
                                               {
                                                       const uint32_t len = tokens[0].size() + tokens[1].size();
                                                       auto p = (Trinity::char_t *)a.Alloc(len * sizeof(Trinity::char_t));

                                                       memcpy(p, tokens[0].data(), tokens[0].size() * sizeof(Trinity::char_t));
                                                       memcpy(p + tokens[0].size(), tokens[1].data(), tokens[1].size() * sizeof(Trinity::char_t));

                                                       out->push_back({{p, len}, 1});
                                               }
                                       }

                               });

Faster checks against masked documents

masked_documents_registry::make() should build a bloom filter from all updated_documents and also track the [min, max) document IDs across all of them.
This is so that masked_documents_registry::test() can check in O(1) if a document is definitely not any of the tracked scanners. This could save potentially dozens of iterations and checks against them.

updated_documents should include a fixed-size bloom filter, created by Trinity::pack_updates(). We already track the [lowest, highest] for them.

Search by category

You link to this article in your docs: https://tech.ebayinc.com/research/making-e-commerce-search-faster/ . I know most of that was about optimization, but as a complete newcomer to this space, I wonder if you have an example of how this category type searching can be done using Trinity? i.e. if I want to mark a document as belonging to a specific category and then search only within that category, is it possible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.