<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

(No skin in the game opinion) <a class="user-mention notranslate" data-hovercard-t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

isMinorNecessary() doesn't account for posDeleteFiles about amoro HOT 7 CLOSED

deniskuzZ commented on September 22, 2024

isMinorNecessary() doesn't account for posDeleteFiles

from amoro.

Comments (7)

zhongqishang commented on September 22, 2024 1

still think that it would be useful for minor compaction to have a threshold based on pos delete files count (why eq deletes should be any different?).

minor mainly does two things,

Merge files smaller than 16m
Convert eq delete file to position delete file

So when a 128m data file is associated with an eq delete file, the eq delete file will be convert into a position delete through minor.
If rewrite position in minor, the file will be read and written very frequently, which is also very expensive.

In short, there is rewrite position file, which is in the major stage. In order to reduce the cost of reading and writing, a threshold is added (self-optimizing.major.trigger.duplicate-ratio).

from amoro.

zhongqishang commented on September 22, 2024

@deniskuzZ Thank you for propose the issue.

The posDeleteFileCount not included in isMinorNecessary method is expected.

(Assume target size is 128m and min target ratio is 0.75)

Minor mainly include the follow two:

Merge data files smaller than 16m
Convert eq delete associated with files larger than 16m

Major mainly include the follow two:

Merge data files in 16～96m
Merge data files larger than 96m and pos files to generate new data files.

from amoro.

deniskuzZ commented on September 22, 2024

@zhongqishang, thanks for checking!
Positional deletes are very expensive for read operations, so many customers with read-heavy use cases migrate to COW mode. Without proper caching on the executor side situation is even worse. In Spark Iceberg there is RewritePositionDeleteFilesSparkAction specifically created for that.
It's common for transactional workflows to generate multiple small delete files that cause the perf penalties unless compacted.

from amoro.

zhongqishang commented on September 22, 2024

@deniskuzZ Yes, I agree the point.

Amoro provides the ability to continuously merge data files and delete files, including position delete file.

In Amoro, rewrite position delete file needs to trigger rules. Rewirte data file and position delete file will be performed under the following two conditions.

The number of data files smaller than 16m + eq delete file count exceeds self-optimizing.minor.trigger.file-count[1]
The position delete record count associated with a data file exceeding 16m is greater than 10% of the data file record count (self-optimizing.major.trigger.duplicate-ratio)[2]

The triggering rule is to reduce the rewrite amplification problem caused by continuous data writing. Maybe a periodic rewrite all file (self-optimizing.full.rewrite-all-files)might help you[3].

[1] : https://github.com/apache/amoro/blob/master/amoro-ams/amoro-ams-server/src/main/java/org/apache/amoro/server/optimizing/plan/CommonPartitionEvaluator.java#L321
[2] : https://github.com/apache/amoro/blob/master/amoro-ams/amoro-ams-server/src/main/java/org/apache/amoro/server/optimizing/plan/CommonPartitionEvaluator.java#L198-L214
[3] : https://amoro.apache.org/docs/0.6.1/configurations/#data-cleaning-configurations

from amoro.

deniskuzZ commented on September 22, 2024

thanks @zhongqishang!
still think that it would be useful for minor compaction to have a threshold based on pos delete files count (why eq deletes should be any different?).
self-optimizing.full.rewrite-all-files helps, however, it's more expensive.
I guess I'll close the issue then if you see no benefit in this.

from amoro.

ayushtkn commented on September 22, 2024

(No skin in the game opinion)
@zhongqishang Have been exploring Amoro to work with Hive-4.0, & what we see when reading iceberg tables via Hive, reading Positional Delete files is quite costly and is one of the major bottleneck for performance while executing queries via Hive.

Do you think it would be worth having it, but behind a config by default turned false, so no existing folks get impacted, but if someone has a usecase like that, they can leverage it by explicitly asking for it by turning the config on.

WDYT?

from amoro.

zhongqishang commented on September 22, 2024

@ayushtkn
Great to hear that you have been exploring Amoro to work with Hive-4.0.

If I understand correctly, after optimizer compaction is executed, you expected is to keep only the data files (no pos delete files), right?

I think this is a reasonable demand scenario and best read performance can also be achieved.
But if the iceberg table with flink stream writer, you need to keep rewriting the data file, which will cause large IO consumption.

BTW, only add the count for posDeleteFiles trigger will not be rewrite data file and eq delete file.
In the current implementation, only self-optimizing.major.trigger.duplicate-ratio can generate Major task(rewrite pos delete file).

If possible, could you try setting the self-optimizing.major.trigger.duplicate-ratio property of the table to 0?

from amoro.

isMinorNecessary() doesn't account for posDeleteFiles about amoro HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent