Giter VIP home page Giter VIP logo

Comments (7)

zhongqishang avatar zhongqishang commented on September 22, 2024 1

still think that it would be useful for minor compaction to have a threshold based on pos delete files count (why eq deletes should be any different?).

minor mainly does two things,

  1. Merge files smaller than 16m
  2. Convert eq delete file to position delete file

So when a 128m data file is associated with an eq delete file, the eq delete file will be convert into a position delete through minor.
If rewrite position in minor, the file will be read and written very frequently, which is also very expensive.

In short, there is rewrite position file, which is in the major stage. In order to reduce the cost of reading and writing, a threshold is added (self-optimizing.major.trigger.duplicate-ratio).

from amoro.

zhongqishang avatar zhongqishang commented on September 22, 2024

@deniskuzZ Thank you for propose the issue.

The posDeleteFileCount not included in isMinorNecessary method is expected.

(Assume target size is 128m and min target ratio is 0.75)

Minor mainly include the follow two:

  • Merge data files smaller than 16m
  • Convert eq delete associated with files larger than 16m

Major mainly include the follow two:

  • Merge data files in 16~96m
  • Merge data files larger than 96m and pos files to generate new data files.

from amoro.

deniskuzZ avatar deniskuzZ commented on September 22, 2024

@zhongqishang, thanks for checking!
Positional deletes are very expensive for read operations, so many customers with read-heavy use cases migrate to COW mode. Without proper caching on the executor side situation is even worse. In Spark Iceberg there is RewritePositionDeleteFilesSparkAction specifically created for that.
It's common for transactional workflows to generate multiple small delete files that cause the perf penalties unless compacted.

from amoro.

zhongqishang avatar zhongqishang commented on September 22, 2024

@deniskuzZ Yes, I agree the point.

Amoro provides the ability to continuously merge data files and delete files, including position delete file.

In Amoro, rewrite position delete file needs to trigger rules. Rewirte data file and position delete file will be performed under the following two conditions.

  • The number of data files smaller than 16m + eq delete file count exceeds self-optimizing.minor.trigger.file-count[1]
  • The position delete record count associated with a data file exceeding 16m is greater than 10% of the data file record count (self-optimizing.major.trigger.duplicate-ratio)[2]

The triggering rule is to reduce the rewrite amplification problem caused by continuous data writing. Maybe a periodic rewrite all file (self-optimizing.full.rewrite-all-files)might help you[3].

[1] : https://github.com/apache/amoro/blob/master/amoro-ams/amoro-ams-server/src/main/java/org/apache/amoro/server/optimizing/plan/CommonPartitionEvaluator.java#L321
[2] : https://github.com/apache/amoro/blob/master/amoro-ams/amoro-ams-server/src/main/java/org/apache/amoro/server/optimizing/plan/CommonPartitionEvaluator.java#L198-L214
[3] : https://amoro.apache.org/docs/0.6.1/configurations/#data-cleaning-configurations

from amoro.

deniskuzZ avatar deniskuzZ commented on September 22, 2024

thanks @zhongqishang!
still think that it would be useful for minor compaction to have a threshold based on pos delete files count (why eq deletes should be any different?).
self-optimizing.full.rewrite-all-files helps, however, it's more expensive.
I guess I'll close the issue then if you see no benefit in this.

from amoro.

ayushtkn avatar ayushtkn commented on September 22, 2024

(No skin in the game opinion)
@zhongqishang Have been exploring Amoro to work with Hive-4.0, & what we see when reading iceberg tables via Hive, reading Positional Delete files is quite costly and is one of the major bottleneck for performance while executing queries via Hive.

Do you think it would be worth having it, but behind a config by default turned false, so no existing folks get impacted, but if someone has a usecase like that, they can leverage it by explicitly asking for it by turning the config on.

WDYT?

from amoro.

zhongqishang avatar zhongqishang commented on September 22, 2024

@ayushtkn
Great to hear that you have been exploring Amoro to work with Hive-4.0.

If I understand correctly, after optimizer compaction is executed, you expected is to keep only the data files (no pos delete files), right?

I think this is a reasonable demand scenario and best read performance can also be achieved.
But if the iceberg table with flink stream writer, you need to keep rewriting the data file, which will cause large IO consumption.

BTW, only add the count for posDeleteFiles trigger will not be rewrite data file and eq delete file.
In the current implementation, only self-optimizing.major.trigger.duplicate-ratio can generate Major task(rewrite pos delete file).

If possible, could you try setting the self-optimizing.major.trigger.duplicate-ratio property of the table to 0?

from amoro.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.