Comments (7)
There are some information from TBranch
we may be able to use, I could try something later. We may need to fly away from @memoize
and DIY more.
from unroot.jl.
refine the cache size calculation in LRU as it might be not accurate for nested objects.
complicated algorithm will slow down this by a lot (tested). We can simply not have cache (other than the mandatory, latest-basket cache) by default, and allow user to turn on cache if they're doing interactive work with files ~ RAM?
I know how to achieve certain objectives but I don't know what we want, because as it is now, it's unlikely to overrun users' memory, even though it's technically not 100% accurate.
Example objective: run over these files in single loop, limit peak memory to below XX; run over these files in parallel, limit peak memory to below XX.
from unroot.jl.
Yes, I fully agree with your points.
I think that some utility functions like current estimated cache size and cache clearing is definitely something we need.
Regarding of the default settings, I think that caching should be part of it.
from unroot.jl.
Hello @Moelf and @tamasgal, I confirm that reducing the LRU cache has solved the memory issue I had.
Using a custom cache instead of @memoize
sounds a good idea. Why do you need to cache more than the latest read Basket ?
from unroot.jl.
Two factors:
- When user is doing stuff in interactive environment, they get much better experience, especially if the file lives on a HDD or even
/eos
where the speed is rather unreliable. - On top of 1., it doesn't slow down much even if it's "useless" (i.e. the total size in each loop is bigger than cache size).
Basically I aimed for something automatically make "quick check" type work smooth while posing no danger in full-scale usage, albeit if you look at memory usage it may "appear" to be wasteful[1].
[1]: still less wasteful than submitting O(100) jobs and download that many .root
files to each worker machine (different every time, high cache miss) many times right?
from unroot.jl.
For the interactive environment, you mean in the case of use of an event loop, that will be repeated several times? In the case of dataframe, the cache of baskets will be useless, as data will already be stored in the dataframe/typedtable. Am I correct?
Beware that implementing cache can be counterproductive. Indeed on Linux, the kernel uses the memory for the I/O buffer. So the memory used by the cache will be take from the one available for the kernel I/O buffer. On the other hand, data retrieved from I/O needs to be uncompressed, while in your own cache you can be uncompressed data.
Final remark, for interactive use case, you might prefer to cache the first N events instead of the last read ones. Indeed this will provide a good responsiveness, when testing the code with a limited number of events even after having make a run on the full sample.
from unroot.jl.
as data will already be stored in the dataframe/typedtable.
no, both DataFrame
and our internal TypedTable
are lazy, they only know the column's type and eltype. So these cache will benefit everything because everything goes through getindex(::LazyBranch, ...)
while in your own cache you can be uncompressed data.
we are caching uncompressed data, that's what readbasketseek
returns.
Final remark, for interactive use case, you might prefer to cache the first N events instead of the last read ones.
this is very true and we were talking about this, I will update this when I look at the cache very soon.
from unroot.jl.
Related Issues (20)
- Opening `km3net_online.root` causes huge memory usage spike HOT 2
- `LazyTree()` hang regression in 0.10.16
- Pre-compilation failure after upgrading to v1.9.3 HOT 6
- Performance for trees with a large number of branches HOT 13
- Fix Documentation due to their 1.0 release
- `RNTuple` reading extremely slow
- `nanoAOD_ttbar` latency HOT 26
- CI broken on nighly due to MD5.jl using SHA.jl internals
- RNTuple RC2 compatibility
- Do not manage to read a TTree with a structure of arrays of basic types HOT 17
- Cannot read empty collections from a RNTuple file HOT 1
- ConcurrencyViolationError when reading with XRootD HOT 2
- [RNTuple] Wrong offset `Index32/Index64` array when read from multiple pages HOT 7
- [RNTuple] accessing nested structs is not lazy enough HOT 1
- [WIP] 0.11.0 breaking changes items
- Re-write resources with `Base.Lockable`
- [RNTuple] OutOfMemoryError in show() HOT 2
- Reading a file with Branches and Leafs HOT 11
- [RNTuple] miss-aligned column in DAOD_TLA with RNTuple RC2 HOT 2
- xrootd doesn't handle XCache
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unroot.jl.