Comments (14)
Thanks. It's returning a sparse array, where the simple version without dims just converts to dense array.
Turns out SparseArrays defines its own version of reducedims_initarray
that calls Array
instead of similar
, for the speed improvements you have pointed out.
https://github.com/JuliaLang/julia/blob/34d2b87b65b1643b1055b10aa5ea7d2bdbcf6cd2/stdlib/SparseArrays/src/sparsematrix.jl#L1698
But we are calling mean
on a DimensionalArray
wrapper that is running similar
, and in turn calling similar to the parent array. I guess these methods should actually call mean on the sparse array directly - the problem is similar handles dimension rebuilding for us.
I'm not sure what the cleanest solution is, we might need to add a method for reducedims_initarray(A::DimensionalArray{T,N,SparseArray})
to override the base case. Nothing else in base has a method for that so its a one off, but I hope nothing in the ecosystem does either...
I'm not sure the approach I've taken to mean
etc and dim rebuilding makes sense in the long term, it was an experiment that worked surprisingly well for simple things but might break due to my lack of knowledge of base methods.
from dimensionaldata.jl.
By the way ["var$i" for i in 1:10000]
isn't a very efficient lookup (ie string comparisons), why not just use a range?
from dimensionaldata.jl.
Thanks for the quick response, and the package! It's really nice to see progress being made on labelled arrays in Julia.
I don't think I've been using julia heavily enough recently to provide much insight for the implementation. That said, I was expecting the to look much more like DimensionalArray(mean(a.data, dimnum(a, dims)), #= calculated output dimensions =#)
. I think this would make sense for a number of cases, especially since there would be some space to deal with removing labels for reduced dimension (e.g. if you reduce over latitude, I wouldn't expect any latitude label in the output).
Are you sure the reduction is causing the performance issue? I agree it explains the difference in output type, but there are similar performance issues with other operations like copy
or broadcast scalar functions (abs.
):
`copy` benchmark
julia> @benchmark copy($sparsear)
BenchmarkTools.Trial:
memory estimate: 152.64 MiB
allocs estimate: 7
--------------
minimum time: 29.745 ms (2.43% GC)
median time: 40.765 ms (30.34% GC)
mean time: 41.141 ms (31.30% GC)
maximum time: 91.379 ms (68.68% GC)
--------------
samples: 122
evals/sample: 1
julia> @benchmark copy($sparsed)
BenchmarkTools.Trial:
memory estimate: 152.64 MiB
allocs estimate: 8
--------------
minimum time: 3.843 s (0.02% GC)
median time: 3.849 s (0.20% GC)
mean time: 3.849 s (0.20% GC)
maximum time: 3.855 s (0.38% GC)
--------------
samples: 2
evals/sample: 1
`abs.` benchmark
julia> @benchmark abs.(sparsear)
BenchmarkTools.Trial:
memory estimate: 152.64 MiB
allocs estimate: 9
--------------
minimum time: 41.412 ms (1.47% GC)
median time: 52.996 ms (25.29% GC)
mean time: 53.401 ms (25.88% GC)
maximum time: 110.823 ms (65.40% GC)
--------------
samples: 94
evals/sample: 1
julia> @benchmark abs.(sparsed)
BenchmarkTools.Trial:
memory estimate: 763.09 MiB
allocs estimate: 13
--------------
minimum time: 1.969 s (0.49% GC)
median time: 2.025 s (3.01% GC)
mean time: 2.028 s (2.98% GC)
maximum time: 2.090 s (5.29% GC)
--------------
samples: 3
evals/sample: 1
To me, the memory usage suggests an intermediate dense array is being created.
By the way ["var$i" for i in 1:10000] isn't a very efficient lookup (ie string comparisons), why not just use a range?
For my use case, the observations are barcodes for cells and the variable are gene identifiers. Access via string names is important for user interface, though any heavy computation should be accessing the values by position.
I think I was expecting xarray/ pandas like behavior, where an index (probably hash table) would be used for string labelled dimensions.
from dimensionaldata.jl.
The sparse array created in similar() is very likely to be causing the performance issue for mean. That's why base avoids it. But there may be more issues with copy etc. For sure. I haven't even thought about sparse arrays before this conversation because I don't use them. There might need to be some architectural changes.
The dims do actually change with mean
, but julia arrays don't drop the dimension, just reduce it to size 1. So the lat dim has to stay, with a length of 1. What will eventually happen is there will be a cell size field on the dim that shows that the single index now covers the whole original lattitude span in one cell, which is good information to keep (and will show on the plot).
The lookup is very rudimentary at the moment. It will be fast enough with numbers. If you want fast hashes, AcceleratedArrays.jl should work inside dims, and trying it would be interesting. But I haven't worked on that yet. Xarray and Pandas have developer teams and this was just me hacking away for a week lol!! I just need GeoData.jl for the modelling packages that I'm actually paid to write and others I do research with, and DimensionalData.jl is a bonus.
It wont be practically comparable for a few years, and only then if enough people jump in and help (or the same happens with an alternative package).
from dimensionaldata.jl.
Thanks for recommending AcceleratedArrays.jl! I've been wondering if a package like that existed for a while now.
Xarray and Pandas have developer teams and this was just me hacking away for a week lol!!
For sure! I'm definitely not expecting a single developer v0.1.1
package to be totally done.
The dims do actually change with mean, but julia arrays don't drop the dimension, just reduce it to size 1. So the lat dim has to stay, with a length of 1.
Definitely agree that the dimension should stay. The main thing was that a single label on that dimension stayed, which you've got a plan for.
The sparse array created in similar() is very likely to be causing the performance issue for mean.
I did a little more digging into what's happening in mean, and I think there is a small effect from using similar
, but it there's a more general issue of dispatch. Swapping out the call to reducedims_initarray
with a precomputed value shows very similar result times:
Benchmarks
julia> ddinit = Base.reducedim_init(t -> t/2, +, sparsed, 1)
sainit = Base.reducedim_init(t -> t/2, +, sparsear, 1);
julia> @benchmark Statistics.mean!($ddinit, $sparsed)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.666 s (0.00% GC)
median time: 1.673 s (0.00% GC)
mean time: 1.674 s (0.00% GC)
maximum time: 1.685 s (0.00% GC)
--------------
samples: 3
evals/sample: 1
julia> @benchmark Statistics.mean!($sainit, $sparsed)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.677 s (0.00% GC)
median time: 1.677 s (0.00% GC)
mean time: 1.696 s (0.00% GC)
maximum time: 1.734 s (0.00% GC)
--------------
samples: 3
evals/sample: 1
It looks like the big change in time is actually coming from the sum!
call inside mean!
, and that time is independent on the preallocated output array.
Benchmarks
julia> @benchmark sum!($sainit, $sparsed)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.672 s (0.00% GC)
median time: 1.675 s (0.00% GC)
mean time: 1.680 s (0.00% GC)
maximum time: 1.692 s (0.00% GC)
--------------
samples: 3
evals/sample: 1
julia> @benchmark sum!($ddinit, $sparsear)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.222 ms (0.00% GC)
median time: 4.337 ms (0.00% GC)
mean time: 4.365 ms (0.00% GC)
maximum time: 7.741 ms (0.00% GC)
--------------
samples: 1144
evals/sample: 1
The main time discrepancy ends up coming from different methods being called by mapreduce
. The one argument sum
(which sees a performance difference and was easier to trace) calls Base._mapreduce(f, op, ::Base.IndexCartesian, A::SparseMatrixCSC{T})
for the sparse array, while mapfoldl(f, op, A)
ends up being called on the DimensionalArray. When dimension is provided sum calls _mapreducedim!(f, op, R::AbstractArray, A::AbstractArray)
on the DimensionalArray, while _mapreducedim!(f, op, R::AbstractArray, A::SparseMatrixCSC{T})
is called on the sparse array.
Why I say this is a more general issue of dispatch is methods specialized on the wrapped array aren't being called. I imagine similar issues will crop up with other array types, since there isn't really a step to allow dispatch based on the array holding the data. This does seem tough to solve in an elegant way.
Btw, AxisArrays seems to have the same performance issue 😉.
from dimensionaldata.jl.
Thanks for digging into that. That makes sense. It's specialised methods being dispatched on the SparseArray vs general methods on AbstractDimensionalArray. Fixing this will be a design goal when I rewrite the math/stats methods. But I feel better if AxisArrays does it too lol
Let me know how you go using AcceleratedArrays, and do put in an issue if any changes need to happen here for it to work.
from dimensionaldata.jl.
This should be fixed by fda792b
from dimensionaldata.jl.
Thanks for that! It fixes some cases but others, like mean(sparsed)
and copy(sparsed)
, still dispatch to an inefficient AbstractArray method.
Without using something like Cassette, can you think of anyway this could be done less manually? Maybe just a public interface for "registering" a function with DimensionalData could make this easier?
BTW, I gave AcceleratedArrays a shot, but didn't see a speedup. I think this is due to how DimensionalData.at
is written. It looks simple enough to fix, but I think it might be worth waiting for the next release of AcceleratedArrays since the API is meant to change.
from dimensionaldata.jl.
Really?
https://travis-ci.org/rafaqz/DimensionalData.jl/jobs/597286088#L312
https://github.com/rafaqz/DimensionalData.jl/blob/master/test/benchmarks.jl#L110
It should be even closer using Var()
instead of Var
(which isn't type stable).
Also thanks for checking out AcceleratedArrays. Yeah At
needs work, Near
is at least using binary search allready. Once the API changes I'll have a look, although I'm concerned about depending on AccelleratedArrays, maybe a glue package will be required if it can't be generic.
from dimensionaldata.jl.
Ok copy
still needs a method, but mean
should be fine now.
Edit: Copy should also be fixed in #13.
Let me know if any other methods give you trouble. I guess this process has to be piecemeal as its hard to know where generic methods aren't good enough and I don't want bloat for no reason, or have time to test everything.
from dimensionaldata.jl.
Benchmarks ok
https://travis-ci.org/rafaqz/DimensionalData.jl/jobs/597477387#L313
from dimensionaldata.jl.
Ok copy still needs a method, but mean should be fine now.
Oh, I mean the single argument version of mean
. For the same variables defined above, using f9eeb37:
Benchmark
julia> @benchmark mean($sparsed)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 6
--------------
minimum time: 1.614 s (0.00% GC)
median time: 1.623 s (0.00% GC)
mean time: 1.623 s (0.00% GC)
maximum time: 1.633 s (0.00% GC)
--------------
samples: 4
evals/sample: 1
julia> @benchmark mean($sparsear)
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 4
--------------
minimum time: 4.018 ms (0.00% GC)
median time: 4.112 ms (0.00% GC)
mean time: 4.185 ms (0.00% GC)
maximum time: 9.738 ms (0.00% GC)
--------------
samples: 1193
evals/sample: 1
I think it just needs something like Statistics.mean(A::DimensionalArray) = mean(A.data)
. I think other reductions could use this as well, at least sum
has similar issues.
It should be even closer using Var() instead of Var (which isn't type stable).
Huh. Why isn't Var
type-stable? I'm not sure I see how an instance would be, but the type wouldn't be.
from dimensionaldata.jl.
Oh right without dims. Those are just missing methods. There are probably 50 of them lets make a list... Obviously everything I allready have dims methods for.
Var
can have issues if it's put in a tuple. UnionAll dispatch or something. I should be clearer on why... It might not apply here, but often does in DD.
from dimensionaldata.jl.
Ok I've added single argument methods for everything that already had dims methods to that pull request.
Thanks for the input, keep them coming :)
from dimensionaldata.jl.
Related Issues (20)
- StackOverflow when constructing table from DimArray with single dimension HOT 2
- stable docs link is broken HOT 3
- support Julia 1.9
- error showing DimArray HOT 6
- DimensionMismatch with `cat` HOT 10
- Improvements to docstrings HOT 12
- Cannot get my own `Categorical` order to work HOT 2
- docs/stable toggle in readme returns 404 HOT 2
- broadcast_dims.(*, ..., ...) no method matching order(::Vector{Vector{Int64}}) HOT 3
- Typo in function name HOT 3
- Accessing the dimension combinations HOT 9
- add title to DimArray HOT 3
- Where and isnan issues HOT 1
- `using DimensionalData.Lookups` in docs does not work HOT 4
- can't `cat` two empty vectors HOT 1
- Dimension labels do not plot in Makie on 0.27.0
- Use DataAPI.jl metadata function HOT 1
- Optimise selectors like `Near` and `Contains` on `Regular` lookups
- Copy of Dictionary of DimArrays creates views HOT 2
- Plot of one dimensional DimArray ignores the lookup values of the dimensions HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dimensionaldata.jl.