Giter VIP home page Giter VIP logo

Comments (4)

sopel39 avatar sopel39 commented on May 11, 2024 1

It depends for what column, and what type of partitions. For daily rolling partitions NDV will be likely overlapping.

Yup. Discussion from original issue: let's say you partition by date and you have two data columns:

  • userId
  • eventId

Then for 1) you should rather take max (users come from the same id pool), even though userId NDV could be approximate the same as partition row count
For 2) you should sum because each eventId is unique.

from trino.

arhimondr avatar arhimondr commented on May 11, 2024

Should we add extrapolate NDVs instead?

Currently we chose the maximum NDV. If we decided to sum NDV's then we would need to extrapolate.

It seems that partitions might often be different chunks of data so that NDVs don't overlap.

It depends for what column, and what type of partitions. For daily rolling partitions NDV will be likely overlapping.

Alternatively we could store HLL state per column as an auxiliary partition property and calculate extrapolation based on merged HLLs

I think that's the only options we have. In order to do so - we need to store HLL states per columns in the partition properties, as the Mestastore API doesn't allow you to store arbitrary statistics.

from trino.

arhimondr avatar arhimondr commented on May 11, 2024

@findepi I'm not sure if we should call it a "bug". The decision of taking MAX NDV was a thoughtful decision. Let's change the label to "enhancment".

from trino.

sopel39 avatar sopel39 commented on May 11, 2024

I think that's the only options we have. In order to do so - we need to store HLL states per columns in the partition properties, as the Mestastore API doesn't allow you to store arbitrary statistics.

We could store HLL in table/partition properties

from trino.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.