I've seen apps that have a good number of fields – hundreds. The O(log(N)) of TreeMap

All Fields.terms(fld) impls should be O(1) not O(log(N)) [LUCENE-8041] about stargazers-migration-test HOT 17 CLOSED

mikemccand commented on June 12, 2024

All Fields.terms(fld) impls should be O(1) not O(log(N)) [LUCENE-8041]

from stargazers-migration-test.

Comments (17)

mikemccand commented on June 12, 2024

It doesn't need to be all all fields.terms impls. It is enough to optimize the default codec.

TreeMap is a good simple default, all the various alternative terms dicts can continue to use it.
But the default codec should optimize for the access behavior that matters: accessing a field randomly.

I don't think we should remove field iteration/Fields unless we remove the ability to change term vectors "per-doc". It is currently needed (e.g. by CheckIndex) to know what fields were truly indexed for a specific document with vectors, since that may disagree with FieldInfos. If we fixed that, then it would truly be unnecessary and FieldInfos would be all we need.

[Legacy Jira: Robert Muir (@rmuir) on Nov 06 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I submitted a simple patch that replaces TreeMap by HashMap in BlockTreeTermsReader and PerFieldPostingsFormat#FieldsReader. Can someone please update the title of the issue? It should be O(1) instead of O(N). Thanks!

[Legacy Jira: Nhat Nguyen (@dnhatn) on Jun 11 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

This has the downside that it sorts all fields on every call to iterator(). My concern is mainly that it will introduce performance problems down the line, ones that are difficult to find/debug because of java's syntactic sugar around iterator(). Especially if someone is using MultiFields (slow-wrapper crap), they will be doing a bunch of sorts on each segment, then merging those, and all hidden behind a single call to iterator().

I still feel the best would be to remove this map entirely: then you can be sure there aren't traps. The only thing blocking this is the fact that term-vector options are configurable per-doc, which doesnt make sense anyway.

[Legacy Jira: Robert Muir (@rmuir) on Jun 11 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I don't think we should remove field iteration/Fields unless we remove the ability to change term vectors "per-doc". It is currently needed (e.g. by CheckIndex) to know what fields were truly indexed for a specific document with vectors, since that may disagree with FieldInfos. If we fixed that, then it would truly be unnecessary and FieldInfos would be all we need.

That sounds like the cart leading the horse (allowing how CheckIndex works today prevent us from remaking how we want Lucene to be tomorrow). Can't we just relax what CheckIndex checks here – like have it check but report a warning if only some docs have TVs and others not which is generally not normal? I think that's what you're getting at but I'm not sure. I've only looked at CheckIndex in passing.

The only thing blocking this is the fact that term-vector options are configurable per-doc, which doesnt make sense anyway.

+1 I agree; if this feature is a casualty of the refactor then I'm fine with it going away. I haven't looked close enough to see how much these things are linked (i.e. can we really not have it both ways).

[Legacy Jira: David Smiley (@dsmiley) on Jun 11 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

That sounds like the cart leading the horse (allowing how CheckIndex works today prevent us from remaking how we want Lucene to be tomorrow). Can't we just relax what CheckIndex checks here – like have it check but report a warning if only some docs have TVs and others not which is generally not normal? I think that's what you're getting at but I'm not sure. I've only looked at CheckIndex in passing.

That's absolutely not the case at all. The user is allowed to do this, hence checkindex must validate it. Please don't make checkindex the bad guy here, its not. The problem is related to indexwriter allowing users to do this.

[Legacy Jira: Robert Muir (@rmuir) on Jun 12 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

+1 to make term vectors consistent across the index; it has always been strange that Lucene allows this. Maybe open a separate issue for that?

[Legacy Jira: Michael McCandless (@mikemccand) on Jun 12 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

+1 to make term vectors consistent across the index; it has always been strange that Lucene allows this. Maybe open a separate issue for that?

This has the downside that it sorts all fields on every call to iterator(). My concern is mainly that it will introduce performance problems down the line, ones that are difficult to find/debug because of java's syntactic sugar around iterator(). Especially if someone is using MultiFields (slow-wrapper crap), they will be doing a bunch of sorts on each segment, then merging those, and all hidden behind a single call to iterator().

I do wonder if an intermediate step would be to have a map + a Iterable<String> so we don't need to do this sorting over and over again?

[Legacy Jira: Simon Willnauer (@s1monw) on Jun 12 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I do wonder if an intermediate step would be to have a map + a Iterable so we don't need to do this sorting over and over again?

Absolutely; there's no need to re-sort. I'm working with [~[email protected]] who came up with this tidy solution. In BlockTreeTermsReader, the "fields" field becomes a HashMap. Then, in the constructor after it's populated, there's a few lines to build up the list:

ArrayList<String> fieldsNames = new ArrayList<String>(fields.keySet());
Collections.sort(fieldsNames);
fieldsNamesIterable = Collections.unmodifiableCollection(fieldsNames);

fieldsNamesIterable is a new declared field.
Very similar logic goes in PerFieldsPostingsFormat.

I think it's possible to avoid the sort() in BlockTreeTermsReader if you know you're reading data written pre-sorted.

[Legacy Jira: David Smiley (@dsmiley) on Jun 12 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

+1 to make term vectors consistent across the index; it has always been strange that Lucene allows this. Maybe open a separate issue for that?

This issue specifically asks why there is an iterator at all in the description, thats why i explained it.

But i also am concerned about this issue because i don't think its a real bottleneck for anyone. I don't want us doing anything risky that could potentially hurt ordinary users for some esoteric abuse case with a million fields: it would be better to just stay with treemap.

It is fine to sort a list in the constructor, or use a linkedhashmap. This won't hurt ordinary users, it will just cost more ram for abuse cases, so I am fine. I really don't want to see sneaky optimizations trying to avoid sorts or any of that, it does not belong here, this needs to be simple, clear, and safe. Instead any serious effort should go into trying to remove the problematic api (term vectors stuff), then it can even simpler since we won't need two data structures.

[Legacy Jira: Robert Muir (@rmuir) on Jun 12 2018]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I created a patch that use LinkedHashMap to maintain field name to Fields mapping.

LUCENE-8041-LinkedHashMap.patch

[Legacy Jira: Huy Le on May 16 2019]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I have a maybe better suggestion instead of:

Build a TreeMap like before and name it sortedField()
Then rebuild it as LinkedHashMap

The problem here is that we have to build a possibly huge map, just to copy it to a LinkedHashMap, and hotspot can't easily optimize away the creation on heap. The result is fine (as said before): The LinkedHashMap needs a bit more heap space than a plain HashMap, but should be of almost identical size to a TreeMap (which uses a lot of small inner object instances).

My suggestion would be: You just iterate over the fields, do some transformations and then finally add them to a TreeMap. Rewrite that to use Java Streams - it may not work for all cases (especially if the iteration has I/O involved and the order is important), but e.g. for direct fields it may work (possibly the IOException needs to be wrapped):

     public DirectFields(SegmentReadState state, Fields fields, int minSkipCount, int lowFreqCutoff) throws IOException {
       this.fields = StreamSupport.stream(fields.spliterator(), false)
        .sorted()   // <== that's the trick
        .collect(Collectors.toMap(Function.identity(), field -> new DirectField(state, field, fields.terms(field), minSkipCount, lowFreqCutoff), (u,v) -> throw new IllegalArgumentException("Duplicate field name"), LinkedHashMap::new));
     }

This is totally untested, just as an idea! The merge function there is just stupid, but it's never called as there should be no duplicate keys. This actually is a bit more strict, if Field is wrongly implemented and returns the same field name multiple times in its iterator.

[Legacy Jira: Uwe Schindler (@uschindler) on May 16 2019]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I don't think that we can avoid paying the additional O(N) memory cost by using Stream#sorted trick because under the hood, the stream implementation also creates an Array, sort it then copy it into the LinkedHashMap in the terminal stage (see SortedOps.SizedRefSortingSink). Apart from DirectFields, it is hard to implement your approach on other cases. The patch I provided has advantage that it is simple, clear and safe.

As we are talking about number of fields, it is unlikely that there is a index with million fields perhaps ten of thousands is more realistic.

[Legacy Jira: Huy Le on May 17 2019]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

@dsmiley is there anyway we can move forward with this ticket ?

[Legacy Jira: Huy Le on Oct 16 2019]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

My recommendation is not to use a TreeMap at all. Use a plain HashMap. In the constructor, save away a Iterable<String> using a method such as the following (taken from a fork of Lucene at work):

    private static Iterable<String> sortedFieldNames(Collection<String> unsortedFields) {
      List<String> fieldsNames = new ArrayList<>(unsortedFields);
      Collections.sort(fieldsNames);
      return Collections.unmodifiableCollection(fieldsNames);
    }

You could just do this for PerFieldPostingsFormat's reader and Lucene50PostingsReader as these are the common ones, or extend this idea to others if you wish. UniformSplit is already doing this approach. Maybe just do those 2 up front for code review. You could put these few lines of code into the affected files, or consider adding this as a protected method on Fields.

One day we can get rid of iterator() and that day it'll be less change if we use a HashMap for the fields, which is what we'll want then.

[Legacy Jira: David Smiley (@dsmiley) on Oct 17 2019]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I'm sorry I created and worked on a kind of duplicate Jira issue LUCENE-9045 (now linked to this one as a child). I just heard about this one now.

The mentioned Jira issue fixed the problem for BlockTree and PerFieldPostingsFormat only.

I read in the thread that we should work on making term vectors consistent across the index. Should I create another Jira issue specific to that (and close this one as dupicate)? Or should I keep this one and maybe rename it?

[Legacy Jira: Bruno Roustant (@bruno-roustant) on Dec 02 2019]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I created LUCENE-9078 "Term vectors options should not be configurable per-doc".

@dsmiley should we close this one?

[Legacy Jira: Bruno Roustant (@bruno-roustant) on Dec 03 2019]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

I'm closing as won't-fix. It's debatable how to close it or what resolution to fix. It can always be re-opened if we wish.

[Legacy Jira: David Smiley (@dsmiley) on Dec 03 2019]

from stargazers-migration-test.

All Fields.terms(fld) impls should be O(1) not O(log(N)) [LUCENE-8041] about stargazers-migration-test HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent