Giter VIP home page Giter VIP logo

Comments (7)

Moelf avatar Moelf commented on May 21, 2024 1

There are some information from TBranch we may be able to use, I could try something later. We may need to fly away from @memoize and DIY more.

from unroot.jl.

Moelf avatar Moelf commented on May 21, 2024

refine the cache size calculation in LRU as it might be not accurate for nested objects.

complicated algorithm will slow down this by a lot (tested). We can simply not have cache (other than the mandatory, latest-basket cache) by default, and allow user to turn on cache if they're doing interactive work with files ~ RAM?

I know how to achieve certain objectives but I don't know what we want, because as it is now, it's unlikely to overrun users' memory, even though it's technically not 100% accurate.

Example objective: run over these files in single loop, limit peak memory to below XX; run over these files in parallel, limit peak memory to below XX.

from unroot.jl.

tamasgal avatar tamasgal commented on May 21, 2024

Yes, I fully agree with your points.

I think that some utility functions like current estimated cache size and cache clearing is definitely something we need.

Regarding of the default settings, I think that caching should be part of it.

from unroot.jl.

grasph avatar grasph commented on May 21, 2024

Hello @Moelf and @tamasgal, I confirm that reducing the LRU cache has solved the memory issue I had.

Using a custom cache instead of @memoize sounds a good idea. Why do you need to cache more than the latest read Basket ?

from unroot.jl.

Moelf avatar Moelf commented on May 21, 2024

Two factors:

  1. When user is doing stuff in interactive environment, they get much better experience, especially if the file lives on a HDD or even /eos where the speed is rather unreliable.
  2. On top of 1., it doesn't slow down much even if it's "useless" (i.e. the total size in each loop is bigger than cache size).

Basically I aimed for something automatically make "quick check" type work smooth while posing no danger in full-scale usage, albeit if you look at memory usage it may "appear" to be wasteful[1].

[1]: still less wasteful than submitting O(100) jobs and download that many .root files to each worker machine (different every time, high cache miss) many times right?

from unroot.jl.

grasph avatar grasph commented on May 21, 2024

For the interactive environment, you mean in the case of use of an event loop, that will be repeated several times? In the case of dataframe, the cache of baskets will be useless, as data will already be stored in the dataframe/typedtable. Am I correct?

Beware that implementing cache can be counterproductive. Indeed on Linux, the kernel uses the memory for the I/O buffer. So the memory used by the cache will be take from the one available for the kernel I/O buffer. On the other hand, data retrieved from I/O needs to be uncompressed, while in your own cache you can be uncompressed data.

Final remark, for interactive use case, you might prefer to cache the first N events instead of the last read ones. Indeed this will provide a good responsiveness, when testing the code with a limited number of events even after having make a run on the full sample.

from unroot.jl.

Moelf avatar Moelf commented on May 21, 2024

as data will already be stored in the dataframe/typedtable.

no, both DataFrame and our internal TypedTable are lazy, they only know the column's type and eltype. So these cache will benefit everything because everything goes through getindex(::LazyBranch, ...)

while in your own cache you can be uncompressed data.

we are caching uncompressed data, that's what readbasketseek returns.

Final remark, for interactive use case, you might prefer to cache the first N events instead of the last read ones.

this is very true and we were talking about this, I will update this when I look at the cache very soon.

from unroot.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.