Comments (6)
This is a pathological example because there are so many groups - ~90,000 for 100,000 observations. This is a situation where customised transform_by
and summarise_by
functions would be useful, because the output types are known in advance.
from plyr.
The idempotent data frame helps somewhat here:
> system.time(ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
user system elapsed
103.711 21.499 125.523
> i <- idata.frame(d)
> system.time(ddply(i, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
user system elapsed
73.654 0.251 74.008
from plyr.
Naive use of ave is better, but using interaction directly is best:
> system.time({
+ d$avx <- ave(d$x, list(d$grp1, d$grp2))
+ d$avy <- ave(d$y, list(d$grp1, d$grp2))
+ })
user system elapsed
39.300 0.279 40.809
>
> system.time({
+ d$avx <- ave(d$x, interaction(d$grp1, d$grp2, drop = T))
+ d$avy <- ave(d$y, interaction(d$grp1, d$grp2, drop = T))
+ })
user system elapsed
6.735 0.209 7.064
from plyr.
I'm not sure if this is related or just exacerbated by the large number of groups. If using an un-named function as the summary then it is fast. Using summarize is OK, but it hangs for a long time at the end of the computation. If you try to give a name to a function then ddply typically runs so long and uses up so much memory that it's not worth timing.
> m <- function(df) { mean(df$x) }
> system.time(foo <- ddply(d, .(grp1, grp2), m,.progress='text'))
|========================================================================| 100%
user system elapsed
45.826 21.846 69.497
> system.time(foo <- ddply(d, .(grp1, grp2), avx=m,.progress='text'))
(USER INTERRUPT)
Timing stopped at: 97.059 51.123 150.527
> system.time(foo <- ddply(d, .(grp1, grp2), summarise, avx = mean(x),.progress='text'))
|========================================================================| 100%
user system elapsed
115.335 63.607 180.758
from plyr.
Yes, this is know because much of the overhead is creating the data frames. (Because you've (incorrectly) named the argument in the second form, it uses the default, identity
, which will be v. slow because it doesn't do any reduction)
from plyr.
See dplyr for a solution to this.
from plyr.
Related Issues (20)
- a_ply warning: duplicated names/ levels
- error in adply HOT 2
- "could not find function" error in shiny app when calling ddply
- error in *dply when column name is "" HOT 1
- Error in plyr function results in infinite loop while debugging
- plyr not showing up in CRAN repository (R Studio Server 1.1.383) HOT 2
- Error in `[.data.frame`(col, i) : undefined columns selected HOT 1
- x86_64-conda_cos6-linux-gnu-c++ not found HOT 1
- Problem with using summarise_all(n()) HOT 3
- Corner case error with plyr::count.
- floor not working for few values in round_any()
- R 3.6.0: .parallel=TRUE failures HOT 6
- aaply permutes indices
- adply .id ignored + unwanted columns
- adply() does not work with mean() or median() function
- NEWS.md is not on CRAN HOT 1
- loop_apply ?
- Release plyr 1.8.6
- Release plyr 1.8.9
- Question about parallelizing plyr
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from plyr.