Comments (4)
It depends for what column, and what type of partitions. For daily rolling partitions NDV will be likely overlapping.
Yup. Discussion from original issue: let's say you partition by date and you have two data columns:
userId
eventId
Then for 1) you should rather take max (users come from the same id pool), even though userId NDV could be approximate the same as partition row count
For 2) you should sum because each eventId is unique.
from trino.
Should we add extrapolate NDVs instead?
Currently we chose the maximum NDV. If we decided to sum NDV's then we would need to extrapolate.
It seems that partitions might often be different chunks of data so that NDVs don't overlap.
It depends for what column, and what type of partitions. For daily rolling partitions NDV will be likely overlapping.
Alternatively we could store HLL state per column as an auxiliary partition property and calculate extrapolation based on merged HLLs
I think that's the only options we have. In order to do so - we need to store HLL states per columns in the partition properties, as the Mestastore API doesn't allow you to store arbitrary statistics.
from trino.
@findepi I'm not sure if we should call it a "bug". The decision of taking MAX NDV was a thoughtful decision. Let's change the label to "enhancment".
from trino.
I think that's the only options we have. In order to do so - we need to store HLL states per columns in the partition properties, as the Mestastore API doesn't allow you to store arbitrary statistics.
We could store HLL in table/partition properties
from trino.
Related Issues (20)
- Auto-Coercion involving decimal types in Hive Parquet table return incorrect result in Trino HOT 2
- Write support for Bloom filters in Iceberg and Delta Lake connectors
- [Feature Request] Resource group does not report limiting factor
- Query with limit on s3 is not working optimally HOT 1
- Add support for Kafka transport in Trino Open lineage Plugin
- Unexpected behavior: Columns referenced by access control rules are always included in `eventlistener.TableInfo.columns` HOT 2
- Major issue: After changing the table field type from bigint to decimal (28,5), the data in the historical partition table cannot be queried HOT 1
- Flaky TestHiveTransactionalTable.testBucketedUnpartitionedDelete HOT 2
- Trino has a bug of Merge Statement from Iceberg Catalog HOT 2
- Fix join pushdown in SQL server connector
- How to find unsed table/view in trino
- Some Redshift tests are broken
- Test `TestDistributedEngineOnlyQueries.testAssignUniqueId` fails: incorrect results HOT 1
- Planning is not deterministic HOT 3
- java.lang.NoClassDefFoundError: io/trino/plugin/base/metrics/LongCount HOT 2
- Flaky test `TestHiveConnectorTest.testCreateTableWithEmptyBucketsAndCompressionCodec`: "Target directory for table already exists" HOT 1
- Table has exceeded max number of active streams error happens in BigQuery connector
- "Table has exceeded max number of active streams" error happens in BigQuery connector
- Deployment trino-server-445 NullPointerException HOT 1
- Proposal to Optimize Trino Hive Metastore Query Latency by Caching createMetastoreClient()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from trino.