Comments (8)
If nobody else is working on this, I think I'd like to take it!
from lucene.
@msfroh - I was looking into this as well and had some thoughts about how to do it.
We could replace ParallelTaxonomyArrays
with a new interface that offers three operations for each of the arrays:
interface ChunkedParallelTaxonomyArrays {
/* Record new entry. */
public void appendParent(int parent);
/* Retrieve this ordinal's parent. From the user's perspective, this is like an array look-up. */
public int getParent(int ord);
/* There are some places where we need to know how many parents exist in total. */
public int sizeParents();
// Same for children and siblings
...
}
To implement this inteface, we could use an IntBlockPool
. We would allocate new int buffers in the block pool as needed and preserve the block pool across DirectoryTaxonomyReader
refreshes.
There are definitely some disadvantages with the block pool idea:
- We're preserving a mutable data-structure across taxonomy refreshes. There is precedent though, with the caches in
DirectoryTaxonomyReader
. - We would be slightly overallocating by having the last buffer in the pool not be completely used, but I think this is a good trade-off to take for the increased efficiency and simplicity.
What do you think, did you have something else in mind?
from lucene.
What do you think, did you have something else in mind?
Oh -- I didn't have anything in mind. I just saw the issue and thought, "Hey, I could figure out how to do that!" Sounds like you've got it in hand, though!
from lucene.
I'd be happy to work together on it! If we go the route I was proposing, there's a non-trivial amount of work to do:
- Create the new interface for taxonomy arrays and use it with taxonomy facets and in the taxo reader.
- Augment the
IntBlockPool
with conveience methods that support this new use-case and implement the new taxo array interface using theIntBlockPool
.
1 and 2 can be done independently, so we could each take one of those work streams. I'll start on it in the next few days, but feel free to jump in if you get the chance.
from lucene.
I took a look and I think we might be able to do it a little easier:
public abstract class ParallelTaxonomyArrays {
public class ChunkedArray {
private final int chunkSize;
private final int[][] chunks;
public ChunkedArray(int chunkSize, int[][] chunks) {
this.chunkSize = chunkSize;
this.chunks = chunks;
}
public int get(int i) {
int chunkNum = i / chunkSize;
return chunks[chunkNum][i - (chunkNum * chunkSize)];
}
public int length() {
return chunkSize * chunks.length;
}
}
/** Sole constructor. */
public ParallelTaxonomyArrays() {}
public abstract ChunkedArray parents();
public abstract ChunkedArray children();
public abstract ChunkedArray siblings();
}
Then within TaxonomyIndexArrays
, we can focus on building int[][]
instances that get wrapped in ChunkedArray
. I feel like IntBlockPool
might be overkill?
from lucene.
I ended up running with that idea (sort of) and implemented this: #12995
The unit tests pass, but I don't think any of them allocate more than 8192 ordinals (the size of chunk that I set).
from lucene.
Thanks @msfroh! The PR looks neat and you might be right that, while IntBlockPool
basically maintains an int[][]
like our ChunkedArray
, it is a bit inconveient to work with.
I left more detailed comments on the PR, but the high-level question is if we've actually reduced the memory footprint during taxonomy refreshes. It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.
from lucene.
It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.
What you've missed is that I'm a big dum-dum 😁
Thanks for catching that! I refactored some code into a shared method (between the "reuse old arrays" case and the "start fresh with a TaxonomyReader" case) and foolishly applied the "start fresh" logic every time. I've fixed it in a subsequent commit (allocating chunks only starting from the index of the last chunk of the old array).
I also incorporated several of the other changes that you suggested. Thanks a lot!
from lucene.
Related Issues (20)
- Doc out of order issue from Lucene 8.10.1 HOT 7
- IndexWriter loses track of parent field when index is empty HOT 4
- How to run tests with the Panama Vector implementation HOT 20
- Null exception occured when click on Luke desktop browser button
- Make intra tasks in IndexingChain.flush parallel execute. HOT 3
- Significant drop in recall for int8 scalar quantization using maximum_inner_product HOT 17
- TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader failure due to DWPT race condition
- NRT failure due to FieldInfo & File mismatch HOT 2
- Merges sometimes do lots of work even after being aborted
- Enhance DisjunctionMaxQuery explanation to include details in case there was no match
- Reproducible failure org.apache.lucene.search.TestBlockMaxConjunction
- Reproducible failure TestOrdinalMap.testRamBytesUsed
- DataOutput.writeGroupVInts throws IntegerOverflow exception during merging HOT 5
- Improve LongRange execution for ranges that have min value as Long.MIN_VALUE and max value as Long.MAX_VALUE or min as 0, and max as 2^64-1 HOT 7
- Reproducible failure TestIndexWriterExceptions2.testBasics HOT 1
- Reproducible failure TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults HOT 3
- Inconsistency Vector Search Cosine Similarity HOT 1
- Support for criteria based DWPT selection inside DocumentWriter HOT 4
- Add method to `Intervals#noIntervals(String reason)` to `Intervals` class HOT 1
- qweight.matches(LeafReaderContext ctx, int doc) can be prohibitively slow for large TermInSet queries HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lucene.