phaistos-networks / trinity Goto Github PK

View Code? Open in Web Editor NEW

235.0 19.0 20.0 3.95 MB

Trinity IR Infrastructure

License: Apache License 2.0

Makefile 0.17% C++ 97.31% C 2.52%

search information-retrieval high-performance searching

trinity's People

Contributors

Stargazers

Watchers

trinity's Issues

Exec: matchphrase_impl alt.

In compile()'s third pass (or elsewhere where appropriate), we need to consider matchphrase_impl exec nodes and if the phrase size is say [3, 16], replace it with an alternative which first checks if all decoders seek() return true, and then materialises hits for all of them, as opposed to what we do now, where we seek() and materialize() in the same loop. This way, if we have a long phrase, and say the last phrase's word doesn't match the document, we wouldn't materialise all previous phrase words.

Most phrases are 2-3 terms long though, so for now this is not something worth pursuing, although it should be trivial to implement.

We would just need to remember to account for that in codeblocks where we check for matchphrase_impl.

Optimizer: consider distinct ALL matchallterms_impl

Query rewrites may end up generating a query that when compiled may have multiple matchallterms_impl execnodes, like this exec nodes tree representation:

(1 AND ((((3 AND ((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))))) OR (((ALL OF[24,19,26] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))) OR (((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])))) AND 26)) AND 25)) AND 2) OR (ALL OF[29,27] AND ((3 AND ((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))))) OR (((ALL OF[24,19,26] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))) OR (((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])))) AND 26)) AND 25)))))

where e.g ALL OF[24,19,26] appears in multiple OR sub-expressions.
If we could identify those, and then replace matchallterms_impl exec nodes with a suitable alternative, where its ptr points to the list of tokens and wether they have been matched for this document, i.e

struct
{
    docid_t lastConsideredDID;
    bool res;
    uint8_t termsCnt;
    exec_term_id_t terms[0];
};

and we would just check if lastConsideredDID == rctx.curDocID and if so, return res, otherwise compute res, and return it. This could really save a lot of checks.

Try the query [com plete sea son steel book ed ition ps4 ga me sq uare enix play station first ] with the following simple query rewrite lambda:

        Trinity::rewrite_query(q, 250, 3,
                               [](const auto ctx, const auto tokens, const auto cnt, auto &a, auto *const out) {
                                       if (cnt == 1)
                                       {
                                               const auto term = tokens[0];

                                               if (term.size() > 3 && term.p[1] == '-' && isalpha(term.p[0]) && isalpha(term.p[2]))
                                               {
                                                       // t-shirt => ("t shirt" OR tshirt)
                                                       auto p = (char *)a.Alloc(sizeof(char_t) * (term.size() + term.size() + 32)), o = p;

                                                       *o++ = '(';
                                                       *o++ = '"';
                                                       *o++ = term.p[0];
                                                       *o++ = ' ';
                                                       memcpy(o, term.p + 2, (term.size() - 2) * sizeof(char_t));
                                                       o += term.size() - 2;
                                                       *o++ = '"';
                                                       memcpy(o, _S(" OR "));
                                                       o += 4;
                                                       *o++ = term.p[0];
                                                       memcpy(o, term.p + 2, (term.size() - 2) * sizeof(char_t));
                                                       o += term.size() - 2;
                                                       *o++ = ')';

                                                       out->push_back({{p, uint32_t(o - p)}, 1});
                                               }
                                       }
                                       else if (cnt == 2)
                                       {
                                               // concatenation

                                               if (!isdigit(tokens[0].p[0]))
                                               {
                                                       const uint32_t len = tokens[0].size() + tokens[1].size();
                                                       auto p = (Trinity::char_t *)a.Alloc(len * sizeof(Trinity::char_t));

                                                       memcpy(p, tokens[0].data(), tokens[0].size() * sizeof(Trinity::char_t));
                                                       memcpy(p + tokens[0].size(), tokens[1].data(), tokens[1].size() * sizeof(Trinity::char_t));

                                                       out->push_back({{p, len}, 1});
                                               }
                                       }

                               });

How to implement like lucene where we can search fieldA = ABC AND fieldB = BCD

Hello,
Say my data consists of a text and it also has some attributes (like Info1 , Info2 etc)
I should be able to do search both on any word present or any of the attributes.
Can you suggest how it can be implemented ?

Faster checks against masked documents

masked_documents_registry::make() should build a bloom filter from all updated_documents and also track the [min, max) document IDs across all of them.
This is so that masked_documents_registry::test() can check in O(1) if a document is definitely not any of the tracked scanners. This could save potentially dozens of iterations and checks against them.

updated_documents should include a fixed-size bloom filter, created by Trinity::pack_updates(). We already track the [lowest, highest] for them.

Do you have plans to support real time indexing?

It's not that difficult to support such a feature, just by providing two in-memory segments is enough.
When one in-memory segment is full, just flush it to disk while the other in-memory segment will be used to support data ingestion at the same time. It requires a lock-less design to support higher concurrency, which is not that complicated using std::atomic semantics.

Phrases Optimizer

[(xbox one) OR "xbox one"] => [(xbox one)]
[(xbox one) AND "xbox one"] => ["xbox one"]

This should be simple because we collapse sequences into runs.

Error when building Trinity

Hi, I'm trying to build Trinity with the help of wiki. When I run make, I get this error
make -C Switch/ext_snappy/ make[1]: Entering directory '/trinity-test/Trinity/Switch/ext_snappy' ar rc libsnappy.a snappy.o snappy-sinksource.o snappy-stubs-internal.o make[1]: Leaving directory '/vnazzaro/trinity-test/Trinity/Switch/ext_snappy' make -C Switch/ext/FastPFor/ make[1]: Entering directory '/vnazzaro/trinity-test/Trinity/Switch/ext/FastPFor' CMake Error: The source directory "/db/storage/Development/Projects/Trinity/Switch/ext/FastPFor" does not exist. Specify --help for usage, or press the help button on the CMake GUI. Makefile:830: recipe for target 'cmake_check_build_system' failed make[1]: *** [cmake_check_build_system] Error 1 make[1]: Leaving directory '/vnazzaro/trinity-test/Trinity/Switch/ext/FastPFor' Makefile:63: recipe for target 'switch' failed make: *** [switch] Error 2

Replication of index data via using rocksdb as backend

Hello,
I will like to replicate my data (one master multiple slaves). Like Lucene , from what I understand , trinity also creates segments (which contains raw data + other related indexed information).
Can I store this segment into rocksdb and use something like - https://github.com/pinterest/rocksplicator
for replication.
Please do tell me your views about this approach or if you have better suggestion.
We will have be creating different indexes for different sources (each will be stored as a seperate database in rocksdb), and for some indexes we dont require ranking at all. Is there a setting which we can use, which will make processing fast (by turning ranking off)

Lucene:AccessProxy::AccessProxy(): control hits.data access

We should probably extend this class to support flags that can in turn control open() and access semantics of hits.data, just like we should do (via an aux.function or a wrapper) for the actual index data. See here for rationale.

Optimizer: consider suitable optimisations for token(s) and any of()

Search for alt in exec.cpp

The original passes were not appropriate and have been disabled, but we should still figure out something appropriate to replace them with.

For e.g
1 AND ANY OF[1,4,10]
and
ALL OF[3,2,1] AND <ANY OF[5,4,1]

Search by category

You link to this article in your docs: https://tech.ebayinc.com/research/making-e-commerce-search-faster/ . I know most of that was about optimization, but as a complete newcomer to this space, I wonder if you have an example of how this category type searching can be done using Trinity? i.e. if I want to mark a document as belonging to a specific category and then search only within that category, is it possible?

Query Rewrites: budget semantics issue

For a simple Trinity::query_rewrite(), with budget set to 250, and a lambda that simply returns an alternative for a sequence of tokens the concatenation of said tokens (i.e from [mac, book] => macbook), for this query [com plete season steel book edition ps4 game square enix playstation] the re-written query is wrong, because stopping when budget is depleted apparently breaks the process.

We need to either figure out why, or do away with budgets during generation, and then prune the final query after we have constructed it, based on a budget constraint.

For now, budget is not respected until this is fixed.

Private: See Q.cpp

Query Rewrites: cache rewrite callback result

Depending on the query and the implementation of the callback, we may end up invoking the calback multiple times for the same (runCtx, tokens). We should cache the alternatives generated by the callback and cache by (runCtx, tokens). This may save a lot of processing effort, especially if the callback is doing a lot of work, and otherwise improve performance.

Lucene Codec: seek() improvements

Trinity::Codecs::Lucene::Decoder::seek() currently uses the term skip list when it reaches the end of the current block -- we should probably change that to consider seeking immediately, thus saving the iteration cost and potentially providing a great performance boost.

Project Name: consider alternatives to "Trinity"

"Trinity Search" seems to be saturated:
https://www.google.com/search?q=trinity+search&ie=utf-8&oe=utf-8

I saw Mark's comment in a Hacker News post and had to google "mark papadakis" to find this repo.

Benchmarks ?

People love benchmarks. Especially against lucene.

phaistos-networks / trinity Goto Github PK

trinity's People

Contributors

Stargazers

Watchers

Forkers

trinity's Issues

Recommend Projects

Recommend Topics

Recommend Org