Giter VIP home page Giter VIP logo

Comments (14)

rafaqz avatar rafaqz commented on May 27, 2024

Thanks. It's returning a sparse array, where the simple version without dims just converts to dense array.

Turns out SparseArrays defines its own version of reducedims_initarray that calls Array instead of similar, for the speed improvements you have pointed out.
https://github.com/JuliaLang/julia/blob/34d2b87b65b1643b1055b10aa5ea7d2bdbcf6cd2/stdlib/SparseArrays/src/sparsematrix.jl#L1698

But we are calling mean on a DimensionalArray wrapper that is running similar, and in turn calling similar to the parent array. I guess these methods should actually call mean on the sparse array directly - the problem is similar handles dimension rebuilding for us.

I'm not sure what the cleanest solution is, we might need to add a method for reducedims_initarray(A::DimensionalArray{T,N,SparseArray}) to override the base case. Nothing else in base has a method for that so its a one off, but I hope nothing in the ecosystem does either...

I'm not sure the approach I've taken to mean etc and dim rebuilding makes sense in the long term, it was an experiment that worked surprisingly well for simple things but might break due to my lack of knowledge of base methods.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

By the way ["var$i" for i in 1:10000] isn't a very efficient lookup (ie string comparisons), why not just use a range?

from dimensionaldata.jl.

ivirshup avatar ivirshup commented on May 27, 2024

Thanks for the quick response, and the package! It's really nice to see progress being made on labelled arrays in Julia.

I don't think I've been using julia heavily enough recently to provide much insight for the implementation. That said, I was expecting the to look much more like DimensionalArray(mean(a.data, dimnum(a, dims)), #= calculated output dimensions =#). I think this would make sense for a number of cases, especially since there would be some space to deal with removing labels for reduced dimension (e.g. if you reduce over latitude, I wouldn't expect any latitude label in the output).

Are you sure the reduction is causing the performance issue? I agree it explains the difference in output type, but there are similar performance issues with other operations like copy or broadcast scalar functions (abs.):

`copy` benchmark
julia> @benchmark copy($sparsear)
BenchmarkTools.Trial: 
  memory estimate:  152.64 MiB
  allocs estimate:  7
  --------------
  minimum time:     29.745 ms (2.43% GC)
  median time:      40.765 ms (30.34% GC)
  mean time:        41.141 ms (31.30% GC)
  maximum time:     91.379 ms (68.68% GC)
  --------------
  samples:          122
  evals/sample:     1

julia> @benchmark copy($sparsed)
BenchmarkTools.Trial: 
  memory estimate:  152.64 MiB
  allocs estimate:  8
  --------------
  minimum time:     3.843 s (0.02% GC)
  median time:      3.849 s (0.20% GC)
  mean time:        3.849 s (0.20% GC)
  maximum time:     3.855 s (0.38% GC)
  --------------
  samples:          2
  evals/sample:     1
`abs.` benchmark
julia> @benchmark abs.(sparsear)
BenchmarkTools.Trial: 
  memory estimate:  152.64 MiB
  allocs estimate:  9
  --------------
  minimum time:     41.412 ms (1.47% GC)
  median time:      52.996 ms (25.29% GC)
  mean time:        53.401 ms (25.88% GC)
  maximum time:     110.823 ms (65.40% GC)
  --------------
  samples:          94
  evals/sample:     1

julia> @benchmark abs.(sparsed)
BenchmarkTools.Trial: 
  memory estimate:  763.09 MiB
  allocs estimate:  13
  --------------
  minimum time:     1.969 s (0.49% GC)
  median time:      2.025 s (3.01% GC)
  mean time:        2.028 s (2.98% GC)
  maximum time:     2.090 s (5.29% GC)
  --------------
  samples:          3
  evals/sample:     1

To me, the memory usage suggests an intermediate dense array is being created.

By the way ["var$i" for i in 1:10000] isn't a very efficient lookup (ie string comparisons), why not just use a range?

For my use case, the observations are barcodes for cells and the variable are gene identifiers. Access via string names is important for user interface, though any heavy computation should be accessing the values by position.

I think I was expecting xarray/ pandas like behavior, where an index (probably hash table) would be used for string labelled dimensions.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

The sparse array created in similar() is very likely to be causing the performance issue for mean. That's why base avoids it. But there may be more issues with copy etc. For sure. I haven't even thought about sparse arrays before this conversation because I don't use them. There might need to be some architectural changes.

The dims do actually change with mean, but julia arrays don't drop the dimension, just reduce it to size 1. So the lat dim has to stay, with a length of 1. What will eventually happen is there will be a cell size field on the dim that shows that the single index now covers the whole original lattitude span in one cell, which is good information to keep (and will show on the plot).

The lookup is very rudimentary at the moment. It will be fast enough with numbers. If you want fast hashes, AcceleratedArrays.jl should work inside dims, and trying it would be interesting. But I haven't worked on that yet. Xarray and Pandas have developer teams and this was just me hacking away for a week lol!! I just need GeoData.jl for the modelling packages that I'm actually paid to write and others I do research with, and DimensionalData.jl is a bonus.

It wont be practically comparable for a few years, and only then if enough people jump in and help (or the same happens with an alternative package).

from dimensionaldata.jl.

ivirshup avatar ivirshup commented on May 27, 2024

Thanks for recommending AcceleratedArrays.jl! I've been wondering if a package like that existed for a while now.

Xarray and Pandas have developer teams and this was just me hacking away for a week lol!!

For sure! I'm definitely not expecting a single developer v0.1.1 package to be totally done.

The dims do actually change with mean, but julia arrays don't drop the dimension, just reduce it to size 1. So the lat dim has to stay, with a length of 1.

Definitely agree that the dimension should stay. The main thing was that a single label on that dimension stayed, which you've got a plan for.

The sparse array created in similar() is very likely to be causing the performance issue for mean.

I did a little more digging into what's happening in mean, and I think there is a small effect from using similar, but it there's a more general issue of dispatch. Swapping out the call to reducedims_initarray with a precomputed value shows very similar result times:

Benchmarks
julia> ddinit = Base.reducedim_init(t -> t/2, +, sparsed, 1)
       sainit = Base.reducedim_init(t -> t/2, +, sparsear, 1);

julia> @benchmark Statistics.mean!($ddinit, $sparsed)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.666 s (0.00% GC)
  median time:      1.673 s (0.00% GC)
  mean time:        1.674 s (0.00% GC)
  maximum time:     1.685 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> @benchmark Statistics.mean!($sainit, $sparsed)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.677 s (0.00% GC)
  median time:      1.677 s (0.00% GC)
  mean time:        1.696 s (0.00% GC)
  maximum time:     1.734 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

It looks like the big change in time is actually coming from the sum! call inside mean!, and that time is independent on the preallocated output array.

Benchmarks
julia> @benchmark sum!($sainit, $sparsed)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.672 s (0.00% GC)
  median time:      1.675 s (0.00% GC)
  mean time:        1.680 s (0.00% GC)
  maximum time:     1.692 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> @benchmark sum!($ddinit, $sparsear)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.222 ms (0.00% GC)
  median time:      4.337 ms (0.00% GC)
  mean time:        4.365 ms (0.00% GC)
  maximum time:     7.741 ms (0.00% GC)
  --------------
  samples:          1144
  evals/sample:     1

The main time discrepancy ends up coming from different methods being called by mapreduce. The one argument sum (which sees a performance difference and was easier to trace) calls Base._mapreduce(f, op, ::Base.IndexCartesian, A::SparseMatrixCSC{T}) for the sparse array, while mapfoldl(f, op, A) ends up being called on the DimensionalArray. When dimension is provided sum calls _mapreducedim!(f, op, R::AbstractArray, A::AbstractArray) on the DimensionalArray, while _mapreducedim!(f, op, R::AbstractArray, A::SparseMatrixCSC{T}) is called on the sparse array.

Why I say this is a more general issue of dispatch is methods specialized on the wrapped array aren't being called. I imagine similar issues will crop up with other array types, since there isn't really a step to allow dispatch based on the array holding the data. This does seem tough to solve in an elegant way.

Btw, AxisArrays seems to have the same performance issue 😉.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

Thanks for digging into that. That makes sense. It's specialised methods being dispatched on the SparseArray vs general methods on AbstractDimensionalArray. Fixing this will be a design goal when I rewrite the math/stats methods. But I feel better if AxisArrays does it too lol

Let me know how you go using AcceleratedArrays, and do put in an issue if any changes need to happen here for it to work.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

This should be fixed by fda792b

from dimensionaldata.jl.

ivirshup avatar ivirshup commented on May 27, 2024

Thanks for that! It fixes some cases but others, like mean(sparsed) and copy(sparsed), still dispatch to an inefficient AbstractArray method.

Without using something like Cassette, can you think of anyway this could be done less manually? Maybe just a public interface for "registering" a function with DimensionalData could make this easier?

BTW, I gave AcceleratedArrays a shot, but didn't see a speedup. I think this is due to how DimensionalData.at is written. It looks simple enough to fix, but I think it might be worth waiting for the next release of AcceleratedArrays since the API is meant to change.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

Really?
https://travis-ci.org/rafaqz/DimensionalData.jl/jobs/597286088#L312
https://github.com/rafaqz/DimensionalData.jl/blob/master/test/benchmarks.jl#L110

It should be even closer using Var() instead of Var (which isn't type stable).

Also thanks for checking out AcceleratedArrays. Yeah At needs work, Near is at least using binary search allready. Once the API changes I'll have a look, although I'm concerned about depending on AccelleratedArrays, maybe a glue package will be required if it can't be generic.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

Ok copy still needs a method, but mean should be fine now.

Edit: Copy should also be fixed in #13.

Let me know if any other methods give you trouble. I guess this process has to be piecemeal as its hard to know where generic methods aren't good enough and I don't want bloat for no reason, or have time to test everything.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

Benchmarks ok
https://travis-ci.org/rafaqz/DimensionalData.jl/jobs/597477387#L313

from dimensionaldata.jl.

ivirshup avatar ivirshup commented on May 27, 2024

Ok copy still needs a method, but mean should be fine now.

Oh, I mean the single argument version of mean. For the same variables defined above, using f9eeb37:

Benchmark
julia> @benchmark mean($sparsed)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  6
  --------------
  minimum time:     1.614 s (0.00% GC)
  median time:      1.623 s (0.00% GC)
  mean time:        1.623 s (0.00% GC)
  maximum time:     1.633 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark mean($sparsear)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  4
  --------------
  minimum time:     4.018 ms (0.00% GC)
  median time:      4.112 ms (0.00% GC)
  mean time:        4.185 ms (0.00% GC)
  maximum time:     9.738 ms (0.00% GC)
  --------------
  samples:          1193
  evals/sample:     1

I think it just needs something like Statistics.mean(A::DimensionalArray) = mean(A.data). I think other reductions could use this as well, at least sum has similar issues.

It should be even closer using Var() instead of Var (which isn't type stable).

Huh. Why isn't Var type-stable? I'm not sure I see how an instance would be, but the type wouldn't be.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

Oh right without dims. Those are just missing methods. There are probably 50 of them lets make a list... Obviously everything I allready have dims methods for.

Var can have issues if it's put in a tuple. UnionAll dispatch or something. I should be clearer on why... It might not apply here, but often does in DD.

from dimensionaldata.jl.

rafaqz avatar rafaqz commented on May 27, 2024

Ok I've added single argument methods for everything that already had dims methods to that pull request.

Thanks for the input, keep them coming :)

from dimensionaldata.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.