phaistos-networks / trinity Goto Github PK
View Code? Open in Web Editor NEWTrinity IR Infrastructure
License: Apache License 2.0
Trinity IR Infrastructure
License: Apache License 2.0
Search for alt
in exec.cpp
The original passes were not appropriate and have been disabled, but we should still figure out something appropriate to replace them with.
For e.g
1 AND ANY OF[1,4,10]
and
ALL OF[3,2,1] AND <ANY OF[5,4,1]
We should probably extend this class to support flags that can in turn control open() and access semantics of hits.data, just like we should do (via an aux.function or a wrapper) for the actual index data. See here for rationale.
"Trinity Search" seems to be saturated:
https://www.google.com/search?q=trinity+search&ie=utf-8&oe=utf-8
I saw Mark's comment in a Hacker News post and had to google "mark papadakis" to find this repo.
This should be simple because we collapse sequences into runs.
In compile()'s third pass (or elsewhere where appropriate), we need to consider matchphrase_impl
exec nodes and if the phrase size is say [3, 16], replace it with an alternative which first checks if all decoders seek() return true, and then materialises hits for all of them, as opposed to what we do now, where we seek() and materialize() in the same loop. This way, if we have a long phrase, and say the last phrase's word doesn't match the document, we wouldn't materialise all previous phrase words.
Most phrases are 2-3 terms long though, so for now this is not something worth pursuing, although it should be trivial to implement.
We would just need to remember to account for that in codeblocks where we check for matchphrase_impl
.
Depending on the query and the implementation of the callback, we may end up invoking the calback multiple times for the same (runCtx, tokens). We should cache the alternatives generated by the callback and cache by (runCtx, tokens). This may save a lot of processing effort, especially if the callback is doing a lot of work, and otherwise improve performance.
For a simple Trinity::query_rewrite(), with budget set to 250, and a lambda that simply returns an alternative for a sequence of tokens the concatenation of said tokens (i.e from [mac, book] => macbook), for this query [com plete season steel book edition ps4 game square enix playstation]
the re-written query is wrong, because stopping when budget is depleted apparently breaks the process.
We need to either figure out why, or do away with budgets during generation, and then prune the final query after we have constructed it, based on a budget constraint.
For now, budget is not respected until this is fixed.
Private: See Q.cpp
It's not that difficult to support such a feature, just by providing two in-memory segments is enough.
When one in-memory segment is full, just flush it to disk while the other in-memory segment will be used to support data ingestion at the same time. It requires a lock-less design to support higher concurrency, which is not that complicated using std::atomic semantics.
Hi, I'm trying to build Trinity with the help of wiki. When I run make, I get this error
make -C Switch/ext_snappy/ make[1]: Entering directory '/trinity-test/Trinity/Switch/ext_snappy' ar rc libsnappy.a snappy.o snappy-sinksource.o snappy-stubs-internal.o make[1]: Leaving directory '/vnazzaro/trinity-test/Trinity/Switch/ext_snappy' make -C Switch/ext/FastPFor/ make[1]: Entering directory '/vnazzaro/trinity-test/Trinity/Switch/ext/FastPFor' CMake Error: The source directory "/db/storage/Development/Projects/Trinity/Switch/ext/FastPFor" does not exist. Specify --help for usage, or press the help button on the CMake GUI. Makefile:830: recipe for target 'cmake_check_build_system' failed make[1]: *** [cmake_check_build_system] Error 1 make[1]: Leaving directory '/vnazzaro/trinity-test/Trinity/Switch/ext/FastPFor' Makefile:63: recipe for target 'switch' failed make: *** [switch] Error 2
People love benchmarks. Especially against lucene.
Trinity::Codecs::Lucene::Decoder::seek()
currently uses the term skip list when it reaches the end of the current block -- we should probably change that to consider seeking immediately, thus saving the iteration cost and potentially providing a great performance boost.
Hello,
I will like to replicate my data (one master multiple slaves). Like Lucene , from what I understand , trinity also creates segments (which contains raw data + other related indexed information).
Can I store this segment into rocksdb and use something like - https://github.com/pinterest/rocksplicator
for replication.
Please do tell me your views about this approach or if you have better suggestion.
We will have be creating different indexes for different sources (each will be stored as a seperate database in rocksdb), and for some indexes we dont require ranking at all. Is there a setting which we can use, which will make processing fast (by turning ranking off)
Query rewrites may end up generating a query that when compiled may have multiple matchallterms_impl
execnodes, like this exec nodes tree representation:
(1 AND ((((3 AND ((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))))) OR (((ALL OF[24,19,26] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))) OR (((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])))) AND 26)) AND 25)) AND 2) OR (ALL OF[29,27] AND ((3 AND ((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))))) OR (((ALL OF[24,19,26] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16]))) OR (((((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])) AND ALL OF[19,4]) OR (ALL OF[24,19,22] AND ((((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND 20) OR (((ALL OF[15,17] AND (ALL OF[9,8] OR ALL OF[10,12,9])) OR (ALL OF[13,15,18] AND (ALL OF[9,8] OR ALL OF[10,12,9]))) AND ALL OF[21,16])))) AND 26)) AND 25)))))
where e.g ALL OF[24,19,26]
appears in multiple OR sub-expressions.
If we could identify those, and then replace matchallterms_impl
exec nodes with a suitable alternative, where its ptr
points to the list of tokens and wether they have been matched for this document, i.e
struct
{
docid_t lastConsideredDID;
bool res;
uint8_t termsCnt;
exec_term_id_t terms[0];
};
and we would just check if lastConsideredDID == rctx.curDocID and if so, return res, otherwise compute res, and return it. This could really save a lot of checks.
Try the query [com plete sea son steel book ed ition ps4 ga me sq uare enix play station first ]
with the following simple query rewrite lambda:
Trinity::rewrite_query(q, 250, 3,
[](const auto ctx, const auto tokens, const auto cnt, auto &a, auto *const out) {
if (cnt == 1)
{
const auto term = tokens[0];
if (term.size() > 3 && term.p[1] == '-' && isalpha(term.p[0]) && isalpha(term.p[2]))
{
// t-shirt => ("t shirt" OR tshirt)
auto p = (char *)a.Alloc(sizeof(char_t) * (term.size() + term.size() + 32)), o = p;
*o++ = '(';
*o++ = '"';
*o++ = term.p[0];
*o++ = ' ';
memcpy(o, term.p + 2, (term.size() - 2) * sizeof(char_t));
o += term.size() - 2;
*o++ = '"';
memcpy(o, _S(" OR "));
o += 4;
*o++ = term.p[0];
memcpy(o, term.p + 2, (term.size() - 2) * sizeof(char_t));
o += term.size() - 2;
*o++ = ')';
out->push_back({{p, uint32_t(o - p)}, 1});
}
}
else if (cnt == 2)
{
// concatenation
if (!isdigit(tokens[0].p[0]))
{
const uint32_t len = tokens[0].size() + tokens[1].size();
auto p = (Trinity::char_t *)a.Alloc(len * sizeof(Trinity::char_t));
memcpy(p, tokens[0].data(), tokens[0].size() * sizeof(Trinity::char_t));
memcpy(p + tokens[0].size(), tokens[1].data(), tokens[1].size() * sizeof(Trinity::char_t));
out->push_back({{p, len}, 1});
}
}
});
masked_documents_registry::make()
should build a bloom filter from all updated_documents
and also track the [min, max) document IDs across all of them.
This is so that masked_documents_registry::test()
can check in O(1) if a document is definitely not any of the tracked scanners. This could save potentially dozens of iterations and checks against them.
updated_documents
should include a fixed-size bloom filter, created by Trinity::pack_updates()
. We already track the [lowest, highest] for them.
Hello,
Say my data consists of a text and it also has some attributes (like Info1 , Info2 etc)
I should be able to do search both on any word present or any of the attributes.
Can you suggest how it can be implemented ?
You link to this article in your docs: https://tech.ebayinc.com/research/making-e-commerce-search-faster/ . I know most of that was about optimization, but as a complete newcomer to this space, I wonder if you have an example of how this category type searching can be done using Trinity? i.e. if I want to mark a document as belonging to a specific category and then search only within that category, is it possible?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.