Giter VIP home page Giter VIP logo

Comments (6)

hadley avatar hadley commented on August 15, 2024

This is a pathological example because there are so many groups - ~90,000 for 100,000 observations. This is a situation where customised transform_by and summarise_by functions would be useful, because the output types are known in advance.

from plyr.

hadley avatar hadley commented on August 15, 2024

The idempotent data frame helps somewhat here:

> system.time(ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
103.711  21.499 125.523 
> i <- idata.frame(d)
> system.time(ddply(i, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
 73.654   0.251  74.008 

from plyr.

hadley avatar hadley commented on August 15, 2024

Naive use of ave is better, but using interaction directly is best:

> system.time({
+   d$avx <- ave(d$x, list(d$grp1, d$grp2))
+   d$avy <- ave(d$y, list(d$grp1, d$grp2))
+ })
   user  system elapsed 
 39.300   0.279  40.809 
> 
> system.time({
+   d$avx <- ave(d$x, interaction(d$grp1, d$grp2, drop = T))
+   d$avy <- ave(d$y, interaction(d$grp1, d$grp2, drop = T))
+ })
   user  system elapsed 
  6.735   0.209   7.064 

from plyr.

dkulp2 avatar dkulp2 commented on August 15, 2024

I'm not sure if this is related or just exacerbated by the large number of groups. If using an un-named function as the summary then it is fast. Using summarize is OK, but it hangs for a long time at the end of the computation. If you try to give a name to a function then ddply typically runs so long and uses up so much memory that it's not worth timing.

> m <- function(df) { mean(df$x) }
> system.time(foo <- ddply(d, .(grp1, grp2), m,.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
 45.826  21.846  69.497 
> system.time(foo <- ddply(d, .(grp1, grp2), avx=m,.progress='text'))
(USER INTERRUPT)
Timing stopped at: 97.059 51.123 150.527 
> system.time(foo <- ddply(d, .(grp1, grp2), summarise, avx = mean(x),.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
115.335  63.607 180.758 

from plyr.

hadley avatar hadley commented on August 15, 2024

Yes, this is know because much of the overhead is creating the data frames. (Because you've (incorrectly) named the argument in the second form, it uses the default, identity, which will be v. slow because it doesn't do any reduction)

from plyr.

hadley avatar hadley commented on August 15, 2024

See dplyr for a solution to this.

from plyr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.