Some recent discussions revealed that the memory footprint is a bit unpredictable and

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Two factors: When user is doing stuff in interactive environme

More accurate cache size determination and cache API about unroot.jl HOT 7 CLOSED

juliahep commented on May 21, 2024

More accurate cache size determination and cache API

from unroot.jl.

Comments (7)

Moelf commented on May 21, 2024 1

There are some information from TBranch we may be able to use, I could try something later. We may need to fly away from @memoize and DIY more.

from unroot.jl.

Moelf commented on May 21, 2024

refine the cache size calculation in LRU as it might be not accurate for nested objects.

complicated algorithm will slow down this by a lot (tested). We can simply not have cache (other than the mandatory, latest-basket cache) by default, and allow user to turn on cache if they're doing interactive work with files ~ RAM?

I know how to achieve certain objectives but I don't know what we want, because as it is now, it's unlikely to overrun users' memory, even though it's technically not 100% accurate.

Example objective: run over these files in single loop, limit peak memory to below XX; run over these files in parallel, limit peak memory to below XX.

from unroot.jl.

tamasgal commented on May 21, 2024

Yes, I fully agree with your points.

I think that some utility functions like current estimated cache size and cache clearing is definitely something we need.

Regarding of the default settings, I think that caching should be part of it.

from unroot.jl.

grasph commented on May 21, 2024

Hello @Moelf and @tamasgal, I confirm that reducing the LRU cache has solved the memory issue I had.

Using a custom cache instead of @memoize sounds a good idea. Why do you need to cache more than the latest read Basket ?

from unroot.jl.

Moelf commented on May 21, 2024

Two factors:

When user is doing stuff in interactive environment, they get much better experience, especially if the file lives on a HDD or even /eos where the speed is rather unreliable.
On top of 1., it doesn't slow down much even if it's "useless" (i.e. the total size in each loop is bigger than cache size).

Basically I aimed for something automatically make "quick check" type work smooth while posing no danger in full-scale usage, albeit if you look at memory usage it may "appear" to be wasteful[1].

[1]: still less wasteful than submitting O(100) jobs and download that many .root files to each worker machine (different every time, high cache miss) many times right?

from unroot.jl.

grasph commented on May 21, 2024

For the interactive environment, you mean in the case of use of an event loop, that will be repeated several times? In the case of dataframe, the cache of baskets will be useless, as data will already be stored in the dataframe/typedtable. Am I correct?

Beware that implementing cache can be counterproductive. Indeed on Linux, the kernel uses the memory for the I/O buffer. So the memory used by the cache will be take from the one available for the kernel I/O buffer. On the other hand, data retrieved from I/O needs to be uncompressed, while in your own cache you can be uncompressed data.

Final remark, for interactive use case, you might prefer to cache the first N events instead of the last read ones. Indeed this will provide a good responsiveness, when testing the code with a limited number of events even after having make a run on the full sample.

from unroot.jl.

Moelf commented on May 21, 2024

as data will already be stored in the dataframe/typedtable.

no, both DataFrame and our internal TypedTable are lazy, they only know the column's type and eltype. So these cache will benefit everything because everything goes through getindex(::LazyBranch, ...)

while in your own cache you can be uncompressed data.

we are caching uncompressed data, that's what readbasketseek returns.

Final remark, for interactive use case, you might prefer to cache the first N events instead of the last read ones.

this is very true and we were talking about this, I will update this when I look at the cache very soon.

from unroot.jl.

More accurate cache size determination and cache API about unroot.jl HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent