hadley / plyr Goto Github PK
View Code? Open in Web Editor NEWA R package for splitting, applying and combining large problems into simpler problems
Home Page: plyr.had.co.nz
License: Other
A R package for splitting, applying and combining large problems into simpler problems
Home Page: plyr.had.co.nz
License: Other
mods <- dlply(mtcars, "cyl", lm, formula = mpg ~ wt)
ldply(mods, function(x) coef(summary(x)))
ldply(mods, function(x) as.data.frame(coef(summary(x))))
Should also capture rownames?
The following small example (extracted from the actual data) misbehaves for me.
m1<-data.frame(cl=c(1,2), file=c("hi", "low"))
m2<-data.frame(file=c("1776.txt", "About.txt"), actual=c(11.5, 4.5), stringsAsFactors=F)
join(m1, m2, "file")
*** caught segfault ***
address 0x4000000d, cause 'memory not mapped'
Traceback:
1: .Call("split_indices", index, group, as.integer(n))
2: split_indices(seq_along(keys$y), keys$y, keys$n)
3: join_ids(x, y, by, all = TRUE)
4: join_all(x, y, by, type)
5: join(m1, m2, "file")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: Selection: 3
Warning message:
In [<-.factor
(*tmp*
, rng, value = c(1L, 2L, NA, NA)) :
invalid factor level, NAs generated
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.6
What are the consequences of http://stackoverflow.com/questions/3685492/r-speeding-up-group-by-operations for making faster aggregation functions?
From Tim Bates:
a<-data.frame(x=1:5,y=1:5)
# x y
#1 1 1
#2 2 2
rename(a, c(x= "y"))
# might be nice to warn the user they just duplicated a column name (if they ask for warnings)
rename(a, c(foo= "y"))
# might be nice to warn the user that nothing happened
This is not a bug but a feature request.
It would be nice (for me, at least) to make rdiff.fill work with lists, not only with data.frames, as in the following example:
l<- list( list(x=c(1,2), y=c(1,3)), list(x=4, z=4))
rbind.fill(l)
x y z
1 1 1 NA
2 2 3 NA
3 4 NA 4
In theory, the behavior of rbind.fill(l) , when l is a list of list, should be equivalent to rdiff.fill(lapply(l, as.data.frame)), but the latter is much slower (I am working on a program which needs to apply rbind.fill to lists of about 10000 elements). Even using quickdf instead of as.data.frame, the overhead is not negligible.
I actually made a quick and dirty hack to make rbind.fill work in my case, which is to replace the function empty with
empty <- function(l) {
is.null(l) || length(l)==0 || (is.data.frame(l) && .row_names_info(l) == 0)
}
and the line
rows <- unlist(lapply(dfs,.row_name_info, 2L))
with
rows <- unlist(lapply(dfs, function(df) if (is.data.frame(df)) .row_names_info(df,2L) else length(df[[1]])))
It seems to work fine. If you are interested in applying these changes, I can debug them more deeply and produce a patch.
The newly promoted rename function in plry 1.5 does not map the new names correctly. A renaming of the requested columns to rename is made, but not to the correct new names
This code example is drawn from coord-cartesian-flipped.r in ggplot2 which exposed the bug.
data <- structure(list(ymax = 0.819148936170213, ymin = 0.525145067698259,
x = 0.17741935483871, xmin = 0.0564516129032258, xmax = 0.298387096774194,
colour = "grey20", size = 0.5, linetype = 1, group = 1, alpha = 1,
fill = "#FFFFFFFF"), .Names = c("ymax", "ymin", "x", "xmin",
"xmax", "colour", "size", "linetype", "group", "alpha", "fill"
), row.names = c(NA, -1L), class = "data.frame")
plyr::rename(data, c(
x = "y", y = "x",
xend = "yend", yend = "xend",
xmin = "ymin", ymin = "xmin",
xmax = "ymax", ymax = "xmax")
)
y x yend xend ymin colour size linetype group
1 0.8191489 0.5251451 0.1774194 0.05645161 0.2983871 grey20 0.5 1 1
alpha fill
1 1 #FFFFFFFF
xmax xmin y ymin ymax colour size linetype group
1 0.8191489 0.5251451 0.1774194 0.05645161 0.2983871 grey20 0.5 1 1
alpha fill
1 1 #FFFFFFFF
Note that this is the result given by reshape::rename with the same arguments.
The new names that are given are pulled from the new names in the order they are listed, not in the order that they match the old names.
To fix the problem, in plyr::reshape the line
setNames(x, ifelse(is.na(name_match), old_names, new_names))
should be
setNames(x, ifelse(is.na(name_match), old_names, new_names[name_match]))
None known.
Inside the ggplot2 code, rename uses plyr::rename, while at the command prompt, rename uses reshape::rename. This is why the explicit package notation is needed in the function calls.
R version 2.13.0 RC (2011-04-10 r55401)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggplot2_0.8.9 proto_0.3-9.1 reshape_0.8.4 plyr_1.5
loaded via a namespace (and not attached):
[1] tools_2.13.0
qq <- data.frame(foo=character(), bar=character())
[1] foo bar
<0 rows> (or 0-length row.names)
d_ply(qq, .(foo))
Error in tapply(1:nrow(data), splitv, list) :
arguments must have same length
Expected behavior: d*ply should return a 0-length value of the appropriate type (df, list, array), and d_ply should return nothing, but there should not be a runtime error.
I'm not sure if this is fixed in the upcoming 1.0, but I don't see anything in the NEWS that would suggest it is. Thanks!
It can be useful
Normally, ddply will simply omit from the resulting data frame any split portions that return NULL:
> my_df <- data.frame(a = 1:10, b = 11:20)
> ddply(my_df, .(a), function (x) {
+ if (x$a < 5) data.frame(foo = 'bar') else NULL
+ })
a foo
1 1 bar
2 2 bar
3 3 bar
4 4 bar
However, if all of the splits are NULL:
> ddply(my_df, .(a), function (x) {
+ if (x$a < 0) data.frame(foo = 'bar') else NULL
+ })
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 10, 0
I would expect this to either return an empty data frame or NULL.
Edit: I dug a bit further and found that this is just a symptom of a problem with ldply, where it will tolerate all of the .fun results being NULL only if the items of the input list are unlabeled:
> ldply(list('foo'), function (x) NULL)
data frame with 0 columns and 0 rows
> ldply(list(a='foo'), function (x) NULL)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1, 0
This, in turn, appears to be an issue with the internal list_to_dataframe function:
> list_to_dataframe(list(1,2))
V1
1 1
2 2
> list_to_dataframe(list(1,2), data.frame(foo=101:102))
foo V1
1 101 1
2 102 2
> list_to_dataframe(list(NULL,NULL))
data frame with 0 columns and 0 rows
> list_to_dataframe(list(NULL,NULL), data.frame(foo=101:102))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 2, 0
I believe all that would need to be done to fix this is to adjust the first check in list_to_dataframe to read as follows, but I'm still something of an R newbie so I'm probably missing something:
if (length(res) == 0 || all(is.null(unlist(res))))
return(data.frame())
Using plyr 1.5.2, running:
dat <- data.frame(x = factor(LETTERS[1:3],ordered=TRUE),y=rep(1,3))
#TRUE
is.ordered(dat$x)
#This calls ddply but does nothing
datCopy <- ddply(dat,.(x),.fun = function(x){x})
#FALSE
is.ordered(datCopy$x)
Data frame returned by ddply should preserve ordered factors...thanks!
d1 = data.frame(a=runif(10), b=runif(10))
d2 = data.frame(a=runif(10), b=runif(10))
attr(d1$b, "foo") = "one"
attr(d2$b, "foo") = "two"
attributes(rbind(d1, d2)$b)
attributes(rbind.fill(d1, d2)$b)
rbind returns "foo", rbind.fill returns NULL. This is particularly a problem for objects of class "circular" which store a number of properties (rotation direction, units, etc.) as attributes.
Currently, sapply(df,f) includes rownames in the result, but colwise(f)(df) does not.
Suggest adding a (possibly optional) rowname column in colwise.
d1 = data.frame(a=1:3, b=1:3)
d2 = data.frame(a=1:4, c=1:4)
library(plyr)
join(d1, d2, type="right")
join(d2, d1, type="left")
The two joins should be equivalent except for the order of the columns. But they are not: the right join produces a NA instead of the number 4 in column "a".
rbind.fill returns null for non-dataframe arguments. It should either give an error or return a useful result.
Base rbind works for vectors and lists:
rbind(c(a=1),c(b=2)) => matrix(1:2,2,1,dimnames=list(NULL,"a")) == as.matrix(data.frame(a=1:2))
but
rbind.fill(c(a=1),c(b=2)) => NULL
Shouldn't it give something like
matrix(c(1,NA,NA,2),2,2,dimnames=list(NULL,c("a","b")))
or
data.frame(a=c(1,NA),b=c(NA,2))
---------------test---------------
# Column order not defined by rbind.fill spec
sort_by_name <- function(a) if (length(names(a))==0) stop("sort_by_name: arg has no names") else a[order(names(a))]
expect_that( sort_by_name ( as.data.frame( rbind.fill(c(a=1),c(b=2)))),
is_identical_to( data.frame ( a=c(1, NA), b=c(NA,2) ) )
expect_that( sort_by_name ( as.data.frame( rbind.fill(c(a=1),c(b=2),c(a=3,b=4)))),
is_identical_to( data.frame ( a=c(1, NA,3), b=c(NA,2,4) ) )
# mix data frame and vector?
expect_that( sort_by_name ( as.data.frame( rbind.fill(data.frame(a=1),c(b=2),c(a=3,b=4)))),
is_identical_to( data.frame ( a=c(1, NA,3), b=c(NA,2,4) ) )
# support lists?
expect_that( sort_by_name ( as.data.frame( rbind.fill(list(a=1),list(b=2)))),
is_identical_to( data.frame ( a=c(1, NA), b=c(NA,2) ) )
# mixed lists and vectors?
expect_that( sort_by_name ( as.data.frame( rbind.fill(list(a=1),c(b=2)))),
is_identical_to( data.frame ( a=c(1, NA), b=c(NA,2) ) )
summarize(data.frame(a=(1:5)^2),q=median(a),r=mean(a)-q)
As reported by Stavros Macrakis
Hi,
the join help made me believe join is like merge, just faster.
join indeed is faster, but it's not exactly like merge:
> merge(data.frame(x=1:2), data.frame(x=c(1,1), y=1:2), all=T)
x y
1 1 1
2 1 2
3 2 NA
> join(data.frame(x=1:2), data.frame(x=c(1,1), y=1:2), type="full")
Joining by: x
x y
1 1 1
2 2 NA
R version 2.11.1 Patched (2010-06-17 r52313)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid tcltk stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggplot2_0.8.8 reshape_0.8.3 plyr_1.2.1
[4] sqldf_0.3-5 chron_2.3-38 gsubfn_0.5-5
[7] proto_0.3-8 RSQLite.extfuns_0.0.1 RSQLite_0.9-4
[10] DBI_0.2-5
loaded via a namespace (and not attached):
[1] tools_2.11.1
this looks like a bug either in the docs or in the code.
S.
Thank you for the tremendously good work on this essential package.
My current script that causes the crash is too bulky for upload. I am working on an example script that will cause the same crash.
join() crashes my R session with:
*** caught segfault ***
address 0x0, cause 'memory not mapped'
Traceback:
1: .Call("split_indices", index, group, as.integer(n))
2: split_indices(seq_along(keys$y), keys$y, keys$n)
3: join_ids(x, y, by, all = TRUE)
4: join_all(x, y, by, type)
5: join(counts.transplant, counts.clamy, by = "Water.plot")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
within RStudio, this causes the whole app to crash.
Thank you and have an excellent day,
Etienne
Since it doesn't work.
Because plyr is a foundational package, it needs to export the absolute minimum of functions to avoid conflicts with other packages.
Although it does work for POSIX classes for example
A <- data.frame(time=Sys.time()+1:5, x=1:5)
B <- data.frame(time=Sys.time()+1:5, y=6:10)
llply(A, class)
llply(B, class)
llply(rbind(A,rename(B, c(y="x"))), class)
llply(rbind.fill(A,B), class)
# does honor POSIXct class
A <- data.frame(angle=circular(seq(0,pi,length=5)), x=1:5)
B <- data.frame(angle=circular(seq(0,-pi,length=5)), y=6:10)
llply(A, class)
llply(B, class)
llply(rbind(A,rename(B, c(y="x"))), class)
llply(rbind.fill(A,B), class)
# does not respect circular class
library(plyr)
# Generate some data
set.seed(321)
myD <- data.frame(
Place = sample(c("AWQ","DFR", "WEQ"), 10, replace=T),
Light = sample(LETTERS[1:2], 15, replace=T),
value=rnorm(30)
)
myfunc <- function(xdf, sctype) {
force(sctype)
ddply(xdf, .(Place, Light), transform,
rng = paste(value, sctype))
}
myfunc(myD, "range")
Maybe join_all
?
When given a data.frame where one of the columns in a 1d array, rbind.fill will fail with the error:
Error in class(output[[var]]) <- class(value) :
cannot set class to "array" unless the dimension attribute has length > 0
This error is the (or at least a) root of the error in ggplot2 causing margins=TRUE in facet_grid to fail.
DF <- data.frame(x=1)
DF$x <- array(1, 1)
# note that DF <- data.frame(x=array(1,1)) converts the x column to a simple vector
rbind.fill(DF,DF)
x
1 1
2 1
Error in class(output[[var]]) <- class(value) :
cannot set class to "array" unless the dimension attribute has length > 0
The problem is in output_template
in rbind.r
. The error is thrown at line 87 when class(value)
is "array"
. value
does not match any of the previous is
's (though is.array(value)
is TRUE
) and output[[var]]
is a vector of NA
s which can not be set to class array.
Possibly an additional else if block for is.array, though I don't know what form it should take.
None that I know of, other than making sure that 1d arrays are converted to vectors before rbind.fill is called.
So the following works readily
df <- data.frame(trt = rep(c("a", "b"), 10), a = 1:20, b = runif(20))
ddply(df, c("trt"), colwise(mean))
e.g. (from Paul Hiemstra)
bla = function(x) {
x*y
}
y = 10
ldply(dat, .(category), bla, .parallel = TRUE, .export = "y")
Could you modify create progress bar to accept additional arguments?
create_progress_bar<-function(name = "none",...) {
if (!is.character(name)) return(name)
name <- paste("progress", name, sep="_")
if (!exists(name, mode = "function")) {
warning("Cannot find progress bar ", name, call. = FALSE)
progress_none()
} else {
match.fun(name)(...)
}
}
That way titles and other details can be passed on when using create_progress_bar directly
We agreed that the best order would be: common columns, x-only columns, y-only columns. Currently the order is determined by the direction of the join, not by the order of the arguments.
With
> d1 = data.frame(a=1:3, b=1:3)
> d2 = data.frame(a=1:4, c=1:4)
> d1
a b
1 1 1
2 2 2
3 3 3
> d2
a c
1 1 1
2 2 2
3 3 3
4 4 4
Current
> join(d1, d2, type="right")
Joining by: a
a c b
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
Desired
> join(d1, d2, type="right")
Joining by: a
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 NA 4
Is redundant - replace with incrementing counter in C code. Could probably also turn counts from an R vector to a C array for an additional small speed boost.
I noticed this first while using ddply(), but the code below demonstrates the issue with ldply(). I'm not sure if it extends to all *dply() functions or not, nor do I know if this extends beyond my mac setup.
> sessionInfo() R version 2.11.1 (2010-05-31) x86_64-apple-darwin9.8.0 locale: [1] en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base > library(doMC) Loading required package: foreach Loading required package: iterators Loading required package: codetools foreach: simple, scalable parallel programming from REvolution Computing Use REvolution R for scalability, fault tolerance and more. http://www.revolution-computing.com Loading required package: multicore > registerDoMC(2) > library(plyr) > x = iris[which(iris[, 5] != "setosa"), c(1, 5)] > iterations = 1e4 #at 1e4, each serial run will take about a minute > f = function(...){ + ind = sample(100,100,replace=T) + result1 = glm( + x[ind,2] ~ x[ind,1] + , family = binomial(logit) + ) + return(coefficients(result1)) + } > #check to see if foreach works, first run the serial variant > system.time( + foreach(icount(iterations), .combine = cbind) %do% f() + ) user system elapsed 75.629 1.670 77.535 > #now run the parallel %dopar% variant > system.time( + foreach(icount(iterations), .combine = cbind) %dopar% f() + ) user system elapsed 44.533 1.304 50.126 > #%dopar% is faster, so we *can* do parallel computing... > #now run a serial plyr job > system.time( + ldply( + .data = 1:iterations + , .fun = f + ) + ) user system elapsed 71.201 1.450 72.995 > #is the parallel version faster?... > system.time( + ldply( + .data = 1:iterations + , .fun = f + , .parallel = T + ) + ) user system elapsed 71.214 1.407 72.994 > #no! (indeed, my cpu monitor shows only 1cpu used) > #now run a serial plyr job using llply > system.time( + llply( + .data = 1:iterations + , .fun = f + ) + ) user system elapsed 70.815 1.411 72.458 > #is the parallel version faster?... > system.time( + llply( + .data = 1:iterations + , .fun = f + , .parallel = T + ) + ) user system elapsed 87.049 2.611 53.604 > #yes! (indeed, my cpu monitor shows 2cpus used) > #Strange!
library(plyr)
n<-100000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2)
ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y))
I have run into an issue where the first call to daply() is consistently faster than later calls. I verified that the issue exists under R 2.12.1 and plyr 1.4 on x86_64-pc-linux-gnu, i386-apple-darwin9.8.0, and x86_64-apple-darwin9.8.0 using the following code:
library(plyr)
dfData = data.frame(a=1:1e5, b=1)
system.time(daply(dfData, 'a', sum))
system.time(daply(dfData, 'a', sum))
system.time(daply(dfData, 'a', sum))
From a fresh R instance, the first call executes substantially faster than the later two (40s vs 65s on one system, 95s vs. 130s on another, 100s vs. 140s on a third). The problem persists when dfData
is enlarged so that there are multiple b
values per distinct a
value.
i.e. should each row be treated as a single output dimension, or should each splitting variable become it's own variable.
Suggested argument: .expand
Suggestion from Shawn Conway:
# required packages
library(plyr)
library(foreach)
# these two are optional (could instead be something like multicore, etc.)
library(snow)
library(doSNOW)
# parallel version of llply
llply <- function(.data, .fun = NULL, ..., .progress = "none", .inform = FALSE) {
if (inherits(.data, "split")) {
pieces <- .data
} else {
pieces <- as.list(.data)
}
if (is.null(.fun)) return(as.list(pieces))
n <- length(pieces)
if (n == 0) return(list())
if (is.character(.fun)) .fun <- each(.fun)
# .fun <- each(.fun)
if (!is.function(.fun)) stop(".fun is not a function.")
progress <- create_progress_bar(.progress)
progress$init(n)
on.exit(progress$term())
result <- vector("list", n)
do.ply <- function(i) {
piece <- pieces[[i]]
# Display informative error messages, if desired
if (.inform) {
res <- try(.fun(piece, ...))
if (inherits(res, "try-error")) {
piece <- paste(capture.output(print(piece)), collapse = "\n")
stop("with piece ", i, ": \n", piece, call. = FALSE)
}
} else {
res <- .fun(piece, ...)
}
progress$step()
if (!is.null(res)) return(res)
}
result <- foreach(i=seq_len(n)) %dopar% do.ply(i)
attributes(result)[c("split_type", "split_labels")] <- attributes(pieces)[c("split_type", "split_labels")]
names(result) <- names(pieces)
# Only set dimension if not null, otherwise names are removed
if (!is.null(dim(pieces))) {
dim(result) <- dim(pieces)
}
result
}
# run a quite test of different versions
cl <- makeCluster(2)
registerDoSNOW(cl)
system.time(x <- llply(baseball, summary))
system.time(y <- plyr::llply(baseball, summary))
stopCluster(cl)
identical(x, y)
library("plyr")
d <- data.frame(x=runif(5), y=runif(5))
aaply(d, 1, mean)
I would have expected the result to be
aaply(as.matrix(d), 1, mean)
just as
apply(d, 1, mean)
i.e. simply treat the data.frame as an array since I use a*ply. Basically it seems to be doing some sort of contingency table with additional computaiton (i.e. the mean) from the data.frame, but I don't understand why it would do this from the documentation.
e.g.
summarise(mtcars, cyl, vs)
Sorry if this is the wrong place, but I wondered why d_ply doesn't have a parallel argument, I'm splitting up a data frame and drawing graphs with it, seems a good use for parallelization ... did a quick search of the list and Hadley says it isn't supported back in Feb. Is this something that might be supported?
Thanks,
James
x <- array(1:(30*5*5),dim=c(30,5,5))
y <- aaply(x, c(1,2), mean)
all.equal(rowMeans(y), aaply(y, 1, mean))
all.equal(colMeans(y), aaply(y, 2, mean))
all.equal(apply(y, 1, mean), aaply(y, 1, mean))
all.equal(apply(y, 2, mean), aaply(y, 2, mean))
all.equal(apply(y, 3, mean), aaply(y, 3, mean))
That executes statements iteratively, not all at once.
d = data.frame(a=rep(1:3, each=5), x=runif(15))
ddply(d, ~a, function(x) {
mean(x$x)
})
ddply(d, ~a, function(x) {
if (any(x$a != 2)) {
mean(x$x)
}
})
There is a columns "a" in the first case but not in the second one. There used to be one (based on some of my previous code, it worked 7 months ago... even though that's not very precise).
This
dlply(d, ~a, function(x) {
if (any(x$a != 2)) {
mean(x$x)
}
})
works so I guess the bug is in simplify-data-frame.R
Just a small annoyance. The latest stable version of plyr sorts in the order in which the arguments are given, e.g.
head(ddply(diamonds, ~cut+clarity, nrow), 10)
cut clarity V1
1 Fair I1 210
2 Fair SI2 466
3 Fair SI1 408
4 Fair VS2 261
5 Fair VS1 170
6 Fair VVS2 69
7 Fair VVS1 17
8 Fair IF 9
9 Good I1 96
10 Good SI2 1081
the devel version sorts in the reverse order, e.g.
head(ddply(diamonds, ~cut+clarity, nrow), 10)
cut clarity V1
1 Fair I1 210
2 Good I1 96
3 Very Good I1 84
4 Premium I1 205
5 Ideal I1 146
6 Fair SI2 466
7 Good SI2 1081
8 Very Good SI2 2100
9 Premium SI2 2949
10 Ideal SI2 2598
For the sake of backward consistency and also because it looks, to me, to be more natural in the original order, could this be changed?
a<-data.frame(x=1:11,y=c(1:5,NA,1:5))
var(a[,"y"],use="com") # works
ddply(a, "x", summarise, alsoworks = var(y,na.rm=T))
ddply(a, "x", summarise, fails = var(y,use="com"))
summarise(a[,"y"], alsofails = var(y,use="com"))
With some kind of attribute so that they can easily be filted off
I got a weird error when using aaply on a data frame. Applying the same function on the data.frame transformed in matrix format works just fine
library(plyr)
x <- matrix(rnorm(100), ncol=10)
aaply(x, 1, mean)
1 2 3 4 5 6 7
0.03872459 -0.18346066 -0.04029930 -0.55565866 0.51274312 -1.00488549 -0.15646995
8 9 10
-0.12874699 -0.84689038 -0.11837196
This is ok. Now, the error.
xdf <- as.data.frame(x)
aaply(xdf, 1, mean)
Error in 1:n : result would be too long a vector
In addition: Warning message:
In id(rev(labels)) : NAs introduced by coercion
Here is my sessionInfo()
sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.2.1
e.g.
mF <- read.table(textConnection("
bearID YEAR Season SEX line54
5 1900 8 3 0 16.3923519
11 2270 5 1 0 233.7414014
12 2271 5 1 0 290.8207652
13 2271 5 2 0 244.7820844
15 2291 5 1 0 0.0000000
16 2291 5 2 0 14.5037795
17 2291 6 1 0 0.0000000
18 2293 5 2 0 144.7440752
19 2293 5 3 0 0.0000000
20 2293 6 1 0 16.0592270
21 2293 6 2 0 30.1383426
28 2298 5 1 0 0.9741067
29 2298 5 2 0 9.6641018
30 2298 6 2 0 8.6533828
31 2309 5 2 0 85.9781303
32 2325 6 1 0 110.8892153
35 2331 6 1 0 26.7335562
44 2390 7 2 0 7.1690620
45 2390 8 2 0 44.1109897
46 2390 8 3 0 503.9074898
47 2390 9 2 0 8.4393660
54 2416 7 3 0 48.6910907
58 2418 8 2 0 5.7951139"), header = TRUE)
# Test data...add multiple numeric variables
test <- data.frame(mF[, -5], x1 = rnorm(23), x2 = rnorm(23),
x3 = rnorm(23))
# Want min, mean and max of each numeric column vector
# Use numcolwise composed with each:
library(plyr)
summ <- ddply(test, .(Season), numcolwise(each(min, mean, max)))
# Add vector of names to distinguish stats in each group
summ$stat <- rep(c('Min', 'Mean', 'Max'), 3)
summ <- summ[, c(1, 5, 2:4)] # column rearrangement
summ
# Season stat x1 x2 x3
#1 1 Min -1.1084957 -0.89586438 -2.07369239
#2 1 Mean -0.2590727 -0.00485722 -0.05164301
#3 1 Max 1.6924681 0.65256782 1.61998433
#4 2 Min -1.6862610 -0.94842919 -0.43556155
#5 2 Mean 0.4655741 0.37197098 0.31221804
#6 2 Max 2.0972202 2.63259653 1.19390094
#7 3 Min -1.0935199 -0.68866127 -0.04847558
#8 3 Mean -0.2049273 0.49534667 0.15200570
#9 3 Max 0.3900610 2.42638021 0.43551022
Using plyr 1.5.2:
aa <- data.frame(aa=1:3, bb=4:6)
bb <- data.frame(aa=1:10, cc=20:29)
join(aa,bb)
join(subset(aa,subset=aa==0), bb)
For the last line, I get:
Joining by: aa aa bb cc 1 NA NA NA 2 NA NA NA 3 NA NA NA 4 NA NA NA 5 NA NA NA 6 NA NA NA 7 NA NA NA 8 NA NA NA 9 NA NA NA 10 NA NA NA Warning message: In data.frame(..., check.names = FALSE) : row names were found from a short variable and have been discarded
I was expecting to get an empty dataframe back.
Thanks!
I would expect
x <- data.frame(a=1)
d_ply(x, character(), print)
to be equivalent to
print(x)
but, I get this error instead (with version 1.2.1):
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: .Call("split_indices", index, group, as.integer(n))
2: split_indices(seq_along(splitv), as.integer(splitv), attr(splitv, "n"))
3: splitter_d(.data, .variables)
4: d_ply(x, character(), print)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
On the other hand,
d_ply(data.frame(a=c(1,2)), character(), print)
does not error, but loops infinitely.
compare
arrange(mtcars, cyl, desc(disp))
mpg cyl disp hp drat wt qsec vs am gear carb
1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
2 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2mtcars[with(mtcars, order(cyl, disp)), ]
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Compare splitter_a
to split
:
df <- data.frame(x = sample(100, 1e3, rep = T))
system.time(x <- as.list(splitter_a(df, 1)))
system.time(y <- split(df, 1:nrow(df)))
all.equal(x, y)
Should be able to speed up by specialising for common cases
"[[.indexed_array" <- function(x, i) {
if (ncol(x$index) == 2) {
x1 <- x$index[[1]]
x2 <- x$index[[2]]
if (x$subs[1] == "[") {
return(x$env$data[x1[i], x1[2], drop = x$drop])
}
if (x$subs[2] == "[[") {
return(x$env$data[[x1[i], x1[2], drop = x$drop]])
}
}
indices <- x$index[i, ,drop=TRUE]
call <- paste("x$env$data", x$subs[1], paste(indices, collapse = ","), ",
drop = ", x$drop, x$subs[2], sep = "")
eval(parse(text = call))
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.