hadley / plyr Goto Github PK

A R package for splitting, applying and combining large problems into simpler problems

License: Other

R 98.69% C++ 1.04% C 0.27%

plyr's Issues

ldply fails to coerce to data frame

mods <- dlply(mtcars, "cyl", lm, formula = mpg ~ wt)
ldply(mods, function(x) coef(summary(x)))
ldply(mods, function(x) as.data.frame(coef(summary(x))))

Should also capture rownames?

join() causes R to seg. fault

The following small example (extracted from the actual data) misbehaves for me.

m1<-data.frame(cl=c(1,2), file=c("hi", "low"))
m2<-data.frame(file=c("1776.txt", "About.txt"), actual=c(11.5, 4.5), stringsAsFactors=F)
join(m1, m2, "file")

*** caught segfault ***
address 0x4000000d, cause 'memory not mapped'

Traceback:
1: .Call("split_indices", index, group, as.integer(n))
2: split_indices(seq_along(keys$y), keys$y, keys$n)
3: join_ids(x, y, by, all = TRUE)
4: join_all(x, y, by, type)
5: join(m1, m2, "file")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: Selection: 3
Warning message:
In [<-.factor(*tmp*, rng, value = c(1L, 2L, NA, NA)) :
invalid factor level, NAs generated

sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] plyr_1.6

Better aggregation

What are the consequences of http://stackoverflow.com/questions/3685492/r-speeding-up-group-by-operations for making faster aggregation functions?

Rename should (optionally) warn about mismatching keys

From Tim Bates:

a<-data.frame(x=1:5,y=1:5)
#   x y
#1 1 1
#2 2 2

rename(a, c(x= "y"))
# might be nice to warn the user they just duplicated a column name (if they ask for warnings)

rename(a, c(foo= "y"))
# might be nice to warn the user that nothing happened

make rbind.fill work with lists

This is not a bug but a feature request.

It would be nice (for me, at least) to make rdiff.fill work with lists, not only with data.frames, as in the following example:

l<- list( list(x=c(1,2), y=c(1,3)), list(x=4, z=4))
rbind.fill(l)
x y z
1 1 1 NA
2 2 3 NA
3 4 NA 4

In theory, the behavior of rbind.fill(l) , when l is a list of list, should be equivalent to rdiff.fill(lapply(l, as.data.frame)), but the latter is much slower (I am working on a program which needs to apply rbind.fill to lists of about 10000 elements). Even using quickdf instead of as.data.frame, the overhead is not negligible.

I actually made a quick and dirty hack to make rbind.fill work in my case, which is to replace the function empty with

empty <- function(l) {
is.null(l) || length(l)==0 || (is.data.frame(l) && .row_names_info(l) == 0)
}

and the line

rows <- unlist(lapply(dfs,.row_name_info, 2L))

with

rows <- unlist(lapply(dfs, function(df) if (is.data.frame(df)) .row_names_info(df,2L) else length(df[[1]])))

It seems to work fine. If you are interested in applying these changes, I can debug them more deeply and produce a patch.

rename maps wrong new names

Summary

The newly promoted rename function in plry 1.5 does not map the new names correctly. A renaming of the requested columns to rename is made, but not to the correct new names

Reproducible code

This code example is drawn from coord-cartesian-flipped.r in ggplot2 which exposed the bug.

data <- structure(list(ymax = 0.819148936170213, ymin = 0.525145067698259, 
  x = 0.17741935483871, xmin = 0.0564516129032258, xmax = 0.298387096774194, 
  colour = "grey20", size = 0.5, linetype = 1, group = 1, alpha = 1, 
  fill = "#FFFFFFFF"), .Names = c("ymax", "ymin", "x", "xmin", 
  "xmax", "colour", "size", "linetype", "group", "alpha", "fill"
  ), row.names = c(NA, -1L), class = "data.frame")
plyr::rename(data, c(
  x = "y",       y = "x", 
  xend = "yend", yend = "xend", 
  xmin = "ymin", ymin = "xmin",
  xmax = "ymax", ymax = "xmax")
)

Actual results

          y         x      yend       xend      ymin colour size linetype group
1 0.8191489 0.5251451 0.1774194 0.05645161 0.2983871 grey20  0.5        1     1
  alpha      fill
1     1 #FFFFFFFF

Expected results

       xmax      xmin         y       ymin      ymax colour size linetype group
1 0.8191489 0.5251451 0.1774194 0.05645161 0.2983871 grey20  0.5        1     1
  alpha      fill
1     1 #FFFFFFFF

Note that this is the result given by reshape::rename with the same arguments.

Source of problem

The new names that are given are pulled from the new names in the order they are listed, not in the order that they match the old names.

To fix the problem, in plyr::reshape the line

setNames(x, ifelse(is.na(name_match), old_names, new_names))

should be

setNames(x, ifelse(is.na(name_match), old_names, new_names[name_match]))

Workarounds

None known.

Caveats

Inside the ggplot2 code, rename uses plyr::rename, while at the command prompt, rename uses reshape::rename. This is why the explicit package notation is needed in the function calls.

Session info

R version 2.13.0 RC (2011-04-10 r55401)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] ggplot2_0.8.9 proto_0.3-9.1 reshape_0.8.4 plyr_1.5

loaded via a namespace (and not attached):
[1] tools_2.13.0

d*ply returns error on 0-length dataframe

qq <- data.frame(foo=character(), bar=character())
qq
[1] foo bar
<0 rows> (or 0-length row.names)
d_ply(qq, .(foo))
Error in tapply(1:nrow(data), splitv, list) :
arguments must have same length

Expected behavior: d*ply should return a 0-length value of the appropriate type (df, list, array), and d_ply should return nothing, but there should not be a runtime error.

I'm not sure if this is fixed in the upcoming 1.0, but I don't see anything in the NEWS that would suggest it is. Thanks!

Parallelise m*ply

It can be useful

ddply allows NULL for some splits, but not all

Normally, ddply will simply omit from the resulting data frame any split portions that return NULL:

> my_df <- data.frame(a = 1:10, b = 11:20)
> ddply(my_df, .(a), function (x) {
+   if (x$a < 5) data.frame(foo = 'bar') else NULL
+ })
  a foo
1 1 bar
2 2 bar
3 3 bar
4 4 bar

However, if all of the splits are NULL:

> ddply(my_df, .(a), function (x) {
+   if (x$a < 0) data.frame(foo = 'bar') else NULL
+ })
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 10, 0

I would expect this to either return an empty data frame or NULL.

Edit: I dug a bit further and found that this is just a symptom of a problem with ldply, where it will tolerate all of the .fun results being NULL only if the items of the input list are unlabeled:

> ldply(list('foo'), function (x) NULL)
data frame with 0 columns and 0 rows
> ldply(list(a='foo'), function (x) NULL)
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 1, 0

This, in turn, appears to be an issue with the internal list_to_dataframe function:

> list_to_dataframe(list(1,2))
  V1
1  1
2  2
> list_to_dataframe(list(1,2), data.frame(foo=101:102))
  foo V1
1 101  1
2 102  2
> list_to_dataframe(list(NULL,NULL))
data frame with 0 columns and 0 rows
> list_to_dataframe(list(NULL,NULL), data.frame(foo=101:102))
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 2, 0

I believe all that would need to be done to fix this is to adjust the first check in list_to_dataframe to read as follows, but I'm still something of an R newbie so I'm probably missing something:

if (length(res) == 0 || all(is.null(unlist(res)))) 
  return(data.frame())

ddply doesn't preserve ordered factors

Using plyr 1.5.2, running:

dat <- data.frame(x = factor(LETTERS[1:3],ordered=TRUE),y=rep(1,3))

#TRUE
is.ordered(dat$x)

#This calls ddply but does nothing
datCopy <- ddply(dat,.(x),.fun = function(x){x})

#FALSE
is.ordered(datCopy$x)

Data frame returned by ddply should preserve ordered factors...thanks!

rbind.fill does not preserve attributes

d1 = data.frame(a=runif(10), b=runif(10))
d2 = data.frame(a=runif(10), b=runif(10))

attr(d1$b, "foo") = "one"
attr(d2$b, "foo") = "two"

attributes(rbind(d1, d2)$b)

attributes(rbind.fill(d1, d2)$b)

rbind returns "foo", rbind.fill returns NULL. This is particularly a problem for objects of class "circular" which store a number of properties (rotation direction, units, etc.) as attributes.

Add rowname column to colwise

Currently, sapply(df,f) includes rownames in the result, but colwise(f)(df) does not.

Suggest adding a (possibly optional) rowname column in colwise.

Missing values in joining column when using type="right"

d1 = data.frame(a=1:3, b=1:3)
d2 = data.frame(a=1:4, c=1:4)

library(plyr)
join(d1, d2, type="right")
join(d2, d1, type="left")

The two joins should be equivalent except for the order of the columns. But they are not: the right join produces a NA instead of the number 4 in column "a".

rbind.fill should work on vectors with names (or give an error)

rbind.fill returns null for non-dataframe arguments. It should either give an error or return a useful result.

Base rbind works for vectors and lists:

rbind(c(a=1),c(b=2)) => matrix(1:2,2,1,dimnames=list(NULL,"a")) == as.matrix(data.frame(a=1:2))

but

 rbind.fill(c(a=1),c(b=2)) => NULL

Shouldn't it give something like

 matrix(c(1,NA,NA,2),2,2,dimnames=list(NULL,c("a","b")))

or
data.frame(a=c(1,NA),b=c(NA,2))

---------------test---------------
# Column order not defined by rbind.fill spec
sort_by_name <- function(a) if (length(names(a))==0) stop("sort_by_name: arg has no names") else a[order(names(a))]

expect_that(  sort_by_name ( as.data.frame( rbind.fill(c(a=1),c(b=2)))), 
       is_identical_to( data.frame ( a=c(1, NA), b=c(NA,2) ) )

expect_that(  sort_by_name ( as.data.frame( rbind.fill(c(a=1),c(b=2),c(a=3,b=4)))), 
       is_identical_to( data.frame ( a=c(1, NA,3), b=c(NA,2,4) ) )

# mix data frame and vector?
expect_that(  sort_by_name ( as.data.frame( rbind.fill(data.frame(a=1),c(b=2),c(a=3,b=4)))), 
       is_identical_to( data.frame ( a=c(1, NA,3), b=c(NA,2,4) ) )

# support lists?
expect_that(  sort_by_name ( as.data.frame( rbind.fill(list(a=1),list(b=2)))), 
       is_identical_to( data.frame ( a=c(1, NA), b=c(NA,2) ) )

# mixed lists and vectors?
expect_that(  sort_by_name ( as.data.frame( rbind.fill(list(a=1),c(b=2)))), 
       is_identical_to( data.frame ( a=c(1, NA), b=c(NA,2) ) )

Summarise should work sequentially

summarize(data.frame(a=(1:5)^2),q=median(a),r=mean(a)-q)

As reported by Stavros Macrakis

join vs merge

Hi,

the join help made me believe join is like merge, just faster.
join indeed is faster, but it's not exactly like merge:

merge

> merge(data.frame(x=1:2), data.frame(x=c(1,1), y=1:2), all=T)
  x  y
1 1  1
2 1  2
3 2 NA

join

> join(data.frame(x=1:2), data.frame(x=c(1,1), y=1:2), type="full")
Joining by: x
  x  y
1 1  1
2 2 NA

session info

R version 2.11.1 Patched (2010-06-17 r52313)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      tcltk     stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ggplot2_0.8.8         reshape_0.8.3         plyr_1.2.1
 [4] sqldf_0.3-5           chron_2.3-38          gsubfn_0.5-5
 [7] proto_0.3-8           RSQLite.extfuns_0.0.1 RSQLite_0.9-4
[10] DBI_0.2-5

loaded via a namespace (and not attached):
[1] tools_2.11.1

this looks like a bug either in the docs or in the code.

join() crash

Thank you for the tremendously good work on this essential package.

My current script that causes the crash is too bulky for upload. I am working on an example script that will cause the same crash.

join() crashes my R session with:

*** caught segfault ***
address 0x0, cause 'memory not mapped'

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

within RStudio, this causes the whole app to crash.

Thank you and have an excellent day,

Etienne

Improve `match_df` documentation

the 'match' criterion is not specified. It is apparently '==' and not 'identical'
it doesn't work for some data types, e.g.
al <- data.frame(b=I(as.list(1:3)))
match_df(al,al) => ERROR
...same for as.raw(...)
if the result is one column, it returns the column, not the data frame
describing the function as a 'match' implies that the semantics will be similar to the match function, but they aren't -- the match function returns NAs in case of no match

No progress if parallel = TRUE

Since it doesn't work.

Only export necessary functions

Because plyr is a foundational package, it needs to export the absolute minimum of functions to avoid conflicts with other packages.

rbind.fill does not preserve circular class

Although it does work for POSIX classes for example

A <- data.frame(time=Sys.time()+1:5, x=1:5)
B <- data.frame(time=Sys.time()+1:5, y=6:10)
llply(A, class)
llply(B, class)
llply(rbind(A,rename(B, c(y="x"))), class)
llply(rbind.fill(A,B), class)
# does honor POSIXct class

A <- data.frame(angle=circular(seq(0,pi,length=5)), x=1:5)
B <- data.frame(angle=circular(seq(0,-pi,length=5)), y=6:10)
llply(A, class)
llply(B, class)
llply(rbind(A,rename(B, c(y="x"))), class)
llply(rbind.fill(A,B), class)
# does not respect circular class

Fix scoping issues

library(plyr)

# Generate some data
set.seed(321)
myD <- data.frame( 
  Place = sample(c("AWQ","DFR", "WEQ"), 10, replace=T),
  Light = sample(LETTERS[1:2], 15, replace=T),
  value=rnorm(30) 
)

myfunc <- function(xdf, sctype) {
  force(sctype)
  ddply(xdf, .(Place, Light), transform, 
    rng = paste(value, sctype))
}

myfunc(myD, "range")

Add equivalent of merge_all

Maybe join_all?

rbind.fill does not handle 1d arrarys as columns of data frames

Summary

When given a data.frame where one of the columns in a 1d array, rbind.fill will fail with the error:

Error in class(output[[var]]) <- class(value) : 
  cannot set class to "array" unless the dimension attribute has length > 0

This error is the (or at least a) root of the error in ggplot2 causing margins=TRUE in facet_grid to fail.

Reproducible code:

DF <- data.frame(x=1)
DF$x <- array(1, 1)
# note that DF <- data.frame(x=array(1,1)) converts the x column to a simple vector
rbind.fill(DF,DF)

Expected outcome

  x
1 1
2 1

Actual outcome

Error in class(output[[var]]) <- class(value) : 
  cannot set class to "array" unless the dimension attribute has length > 0

Source of problem

The problem is in output_template in rbind.r. The error is thrown at line 87 when class(value) is "array". value does not match any of the previous is's (though is.array(value) is TRUE) and output[[var]] is a vector of NAs which can not be set to class array.

Fix

Possibly an additional else if block for is.array, though I don't know what form it should take.

Workaround

None that I know of, other than making sure that 1d arrays are converted to vectors before rbind.fill is called.

colwise functions should remove split vars

So the following works readily

df <- data.frame(trt = rep(c("a", "b"), 10), a = 1:20, b = runif(20))
ddply(df, c("trt"), colwise(mean))

Expose more foreach settings

e.g. (from Paul Hiemstra)

bla = function(x) {
   x*y
}
y = 10
ldply(dat, .(category), bla, .parallel = TRUE, .export =  "y")

extra ... arguments to create_progress_bar()

Could you modify create progress bar to accept additional arguments?


create_progress_bar<-function(name = "none",...) {
  if (!is.character(name)) return(name)
  name <- paste("progress", name, sep="_")
  
  if (!exists(name, mode = "function")) {
    warning("Cannot find progress bar ", name, call. = FALSE)
    progress_none()
  } else {
    match.fun(name)(...)
  }
}

That way titles and other details can be passed on when using create_progress_bar directly

Order of columns in join()

We agreed that the best order would be: common columns, x-only columns, y-only columns. Currently the order is determined by the direction of the join, not by the order of the arguments.

With

> d1 = data.frame(a=1:3, b=1:3)
> d2 = data.frame(a=1:4, c=1:4)
> d1
  a b
1 1 1
2 2 2
3 3 3
> d2
  a c
1 1 1
2 2 2
3 3 3
4 4 4

Current

> join(d1, d2, type="right")
Joining by: a
  a c  b
1 1 1  1
2 2 2  2
3 3 3  3
4 4 4 NA

Desired

> join(d1, d2, type="right")
Joining by: a
  a  b  c
1 1  1  1
2 2  2  2
3 3  3  3
4 4 NA  4

(wrong issue, ignore)

split_indices index

Is redundant - replace with incrementing counter in C code. Could probably also turn counts from an R vector to a C array for an additional small speed boost.

.parallel=T uses only one cpu in ldply() and ddply()

I noticed this first while using ddply(), but the code below demonstrates the issue with ldply(). I'm not sure if it extends to all *dply() functions or not, nor do I know if this extends beyond my mac setup.

> sessionInfo()
R version 2.11.1 (2010-05-31) 
x86_64-apple-darwin9.8.0 

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
foreach: simple, scalable parallel programming from REvolution Computing
Use REvolution R for scalability, fault tolerance and more.
http://www.revolution-computing.com
Loading required package: multicore
> registerDoMC(2)
> library(plyr)
> x = iris[which(iris[, 5] != "setosa"), c(1, 5)]
> iterations = 1e4 #at 1e4, each serial run will take about a minute
> f = function(...){
+ ind = sample(100,100,replace=T)
+ result1 = glm(
+ x[ind,2] ~ x[ind,1]
+ , family = binomial(logit)
+ )
+ return(coefficients(result1))
+ }
> #check to see if foreach works, first run the serial variant
> system.time(
+ foreach(icount(iterations), .combine = cbind) %do% f()
+ )
   user  system elapsed 
 75.629   1.670  77.535 
> #now run the parallel %dopar% variant
> system.time(
+ foreach(icount(iterations), .combine = cbind) %dopar% f()
+ )
   user  system elapsed 
 44.533   1.304  50.126 
> #%dopar% is faster, so we *can* do parallel computing...
> #now run a serial plyr job
> system.time(
+ ldply(
+ .data = 1:iterations
+ , .fun = f
+ )
+ )
   user  system elapsed 
 71.201   1.450  72.995 
> #is the parallel version faster?...
> system.time(
+ ldply(
+ .data = 1:iterations
+ , .fun = f
+ , .parallel = T
+ )
+ )
   user  system elapsed 
 71.214   1.407  72.994 
> #no! (indeed, my cpu monitor shows only 1cpu used)
> #now run a serial plyr job using llply
> system.time(
+ llply(
+ .data = 1:iterations
+ , .fun = f
+ )
+ )
   user  system elapsed 
 70.815   1.411  72.458 
> #is the parallel version faster?...
> system.time(
+ llply(
+ .data = 1:iterations
+ , .fun = f
+ , .parallel = T
+ )
+ )
   user  system elapsed 
 87.049   2.611  53.604 
> #yes! (indeed, my cpu monitor shows 2cpus used)
> #Strange!

Slow example

library(plyr)

n<-100000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2)

ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y))

Repeated daply slowdown

I have run into an issue where the first call to daply() is consistently faster than later calls. I verified that the issue exists under R 2.12.1 and plyr 1.4 on x86_64-pc-linux-gnu, i386-apple-darwin9.8.0, and x86_64-apple-darwin9.8.0 using the following code:
library(plyr)
dfData = data.frame(a=1:1e5, b=1)
system.time(daply(dfData, 'a', sum))
system.time(daply(dfData, 'a', sum))
system.time(daply(dfData, 'a', sum))

From a fresh R instance, the first call executes substantially faster than the later two (40s vs 65s on one system, 95s vs. 130s on another, 100s vs. 140s on a third). The problem persists when dfData is enlarged so that there are multiple b values per distinct a value.

dply and mply need option to control expansion of splits

i.e. should each row be treated as a single output dimension, or should each splitting variable become it's own variable.

Suggested argument: .expand

Parallelise with foreach

Suggestion from Shawn Conway:

# required packages
library(plyr)
library(foreach)

# these two are optional (could instead be something like multicore, etc.)
library(snow)
library(doSNOW)

# parallel version of llply
llply <- function(.data, .fun = NULL, ..., .progress = "none", .inform = FALSE) {
  if (inherits(.data, "split")) {
    pieces <- .data
  } else {
    pieces <- as.list(.data)
  }
  if (is.null(.fun)) return(as.list(pieces))
  n <- length(pieces)
  if (n == 0) return(list())

  if (is.character(.fun)) .fun <- each(.fun)
  # .fun <- each(.fun)
  if (!is.function(.fun)) stop(".fun is not a function.")

  progress <- create_progress_bar(.progress)
  progress$init(n)
  on.exit(progress$term())

  result <- vector("list", n)
  do.ply <- function(i) {
    piece <- pieces[[i]]

    # Display informative error messages, if desired
    if (.inform) {
      res <- try(.fun(piece, ...))
      if (inherits(res, "try-error")) {
        piece <- paste(capture.output(print(piece)), collapse = "\n")
        stop("with piece ", i, ": \n", piece, call. = FALSE)
      }      
    } else {
      res <- .fun(piece, ...)
    }
    progress$step()
    if (!is.null(res)) return(res) 
  }
  result <- foreach(i=seq_len(n)) %dopar% do.ply(i)

  attributes(result)[c("split_type", "split_labels")] <- attributes(pieces)[c("split_type", "split_labels")]
  names(result) <- names(pieces)

  # Only set dimension if not null, otherwise names are removed
  if (!is.null(dim(pieces))) {
    dim(result) <- dim(pieces)    
  }

  result
}

# run a quite test of different versions
cl <- makeCluster(2)
registerDoSNOW(cl)
system.time(x <- llply(baseball, summary))
system.time(y <- plyr::llply(baseball, summary))
stopCluster(cl)
identical(x, y)

aaply on a data.frame is not consistent with apply

library("plyr")

d <- data.frame(x=runif(5), y=runif(5))
aaply(d, 1, mean)

I would have expected the result to be

aaply(as.matrix(d), 1, mean)

just as

apply(d, 1, mean)

i.e. simply treat the data.frame as an array since I use a*ply. Basically it seems to be doing some sort of contingency table with additional computaiton (i.e. the mean) from the data.frame, but I don't understand why it would do this from the documentation.

summarise should capture variable names

e.g.
summarise(mtcars, cyl, vs)

Request for *_ply to take parallel?

Sorry if this is the wrong place, but I wondered why d_ply doesn't have a parallel argument, I'm splitting up a data frame and drawing graphs with it, seems a good use for parallelization ... did a quick search of the list and Hadley says it isn't supported back in Feb. Is this something that might be supported?

Thanks,
James

Arrays not combined in correct order

x <- array(1:(30*5*5),dim=c(30,5,5))
y <- aaply(x, c(1,2), mean)

all.equal(rowMeans(y), aaply(y, 1, mean))
all.equal(colMeans(y), aaply(y, 2, mean))
all.equal(apply(y, 1, mean), aaply(y, 1, mean))
all.equal(apply(y, 2, mean), aaply(y, 2, mean))
all.equal(apply(y, 3, mean), aaply(y, 3, mean))

New version of transform

That executes statements iteratively, not all at once.

No identifying column when the function returns NULL in ddply

d = data.frame(a=rep(1:3, each=5), x=runif(15))

ddply(d, ~a, function(x) {
    mean(x$x)
})

ddply(d, ~a, function(x) {
    if (any(x$a != 2)) {
        mean(x$x)
    }
})

There is a columns "a" in the first case but not in the second one. There used to be one (based on some of my previous code, it worked 7 months ago... even though that's not very precise).

This

dlply(d, ~a, function(x) {
    if (any(x$a != 2)) {
        mean(x$x)
    }
})

works so I guess the bug is in simplify-data-frame.R

Change in the order of sorting of the resulting data.frame after ddply

Just a small annoyance. The latest stable version of plyr sorts in the order in which the arguments are given, e.g.

head(ddply(diamonds, ~cut+clarity, nrow), 10)

    cut clarity   V1
1  Fair      I1  210
2  Fair     SI2  466
3  Fair     SI1  408
4  Fair     VS2  261
5  Fair     VS1  170
6  Fair    VVS2   69
7  Fair    VVS1   17
8  Fair      IF    9
9  Good      I1   96
10 Good     SI2 1081

the devel version sorts in the reverse order, e.g.

head(ddply(diamonds, ~cut+clarity, nrow), 10)

         cut clarity   V1
1       Fair      I1  210
2       Good      I1   96
3  Very Good      I1   84
4    Premium      I1  205
5      Ideal      I1  146
6       Fair     SI2  466
7       Good     SI2 1081
8  Very Good     SI2 2100
9    Premium     SI2 2949
10     Ideal     SI2 2598

For the sake of backward consistency and also because it looks, to me, to be more natural in the original order, could this be changed?

summarise action fails (also when inside ddply)

a<-data.frame(x=1:11,y=c(1:5,NA,1:5))

var(a[,"y"],use="com") # works
ddply(a, "x", summarise, alsoworks = var(y,na.rm=T))

but

ddply(a, "x", summarise, fails = var(y,use="com"))
summarise(a[,"y"], alsofails = var(y,use="com"))

d*ply should label split variables

With some kind of attribute so that they can easily be filted off

error in aaply on a data.frame

I got a weird error when using aaply on a data frame. Applying the same function on the data.frame transformed in matrix format works just fine

library(plyr)
x <- matrix(rnorm(100), ncol=10)
aaply(x, 1, mean)
          1           2           3           4           5           6           7 
 0.03872459 -0.18346066 -0.04029930 -0.55565866  0.51274312 -1.00488549 -0.15646995 
          8           9          10 
-0.12874699 -0.84689038 -0.11837196

This is ok. Now, the error.

xdf <- as.data.frame(x)
aaply(xdf, 1, mean)
Error in 1:n : result would be too long a vector
In addition: Warning message:
In id(rev(labels)) : NAs introduced by coercion

Here is my sessionInfo()
sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.2.1

colwise and each should work better together

e.g.

mF <- read.table(textConnection("
  bearID YEAR Season SEX      line54
5    1900    8      3   0  16.3923519
11   2270    5      1   0 233.7414014
12   2271    5      1   0 290.8207652
13   2271    5      2   0 244.7820844
15   2291    5      1   0   0.0000000
16   2291    5      2   0  14.5037795
17   2291    6      1   0   0.0000000
18   2293    5      2   0 144.7440752
19   2293    5      3   0   0.0000000
20   2293    6      1   0  16.0592270
21   2293    6      2   0  30.1383426
28   2298    5      1   0   0.9741067
29   2298    5      2   0   9.6641018
30   2298    6      2   0   8.6533828
31   2309    5      2   0  85.9781303
32   2325    6      1   0 110.8892153
35   2331    6      1   0  26.7335562
44   2390    7      2   0   7.1690620
45   2390    8      2   0  44.1109897
46   2390    8      3   0 503.9074898
47   2390    9      2   0   8.4393660
54   2416    7      3   0  48.6910907
58   2418    8      2   0   5.7951139"), header = TRUE)

# Test data...add multiple numeric variables
test <- data.frame(mF[, -5], x1 = rnorm(23), x2 = rnorm(23), 
                   x3 = rnorm(23))

# Want min, mean and max of each numeric column vector
# Use numcolwise composed with each:

library(plyr)
summ <- ddply(test, .(Season), numcolwise(each(min, mean, max)))
# Add vector of names to distinguish stats in each group
summ$stat <- rep(c('Min', 'Mean', 'Max'), 3)   
summ <- summ[, c(1, 5, 2:4)]   # column rearrangement
summ
#   Season stat         x1          x2          x3
#1      1  Min -1.1084957 -0.89586438 -2.07369239
#2      1 Mean -0.2590727 -0.00485722 -0.05164301
#3      1  Max  1.6924681  0.65256782  1.61998433
#4      2  Min -1.6862610 -0.94842919 -0.43556155
#5      2 Mean  0.4655741  0.37197098  0.31221804
#6      2  Max  2.0972202  2.63259653  1.19390094
#7      3  Min -1.0935199 -0.68866127 -0.04847558
#8      3 Mean -0.2049273  0.49534667  0.15200570
#9      3  Max  0.3900610  2.42638021  0.43551022

left-joining to empty data.frame yields NAs instead of empty data.frame

Using plyr 1.5.2:

aa <- data.frame(aa=1:3, bb=4:6)
bb <- data.frame(aa=1:10, cc=20:29)
join(aa,bb)
join(subset(aa,subset=aa==0), bb)

For the last line, I get:

Joining by: aa
   aa bb cc
1  NA NA NA
2  NA NA NA
3  NA NA NA
4  NA NA NA
5  NA NA NA
6  NA NA NA
7  NA NA NA
8  NA NA NA
9  NA NA NA
10 NA NA NA
Warning message:
In data.frame(..., check.names = FALSE) :
  row names were found from a short variable and have been discarded

I was expecting to get an empty dataframe back.

Thanks!

d*ply errors on 0-length .variables arg

I would expect

x <- data.frame(a=1)
d_ply(x, character(), print)

to be equivalent to

print(x)

but, I get this error instead (with version 1.2.1):

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
1: .Call("split_indices", index, group, as.integer(n))
2: split_indices(seq_along(splitv), as.integer(splitv), attr(splitv,     "n"))
3: splitter_d(.data, .variables)
4: d_ply(x, character(), print)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

On the other hand,

d_ply(data.frame(a=c(1,2)), character(), print)

does not error, but loops infinitely.

plyr::arrange() does not preserve row names

compare

arrange(mtcars, cyl, desc(disp))
mpg cyl disp hp drat wt qsec vs am gear carb
1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
2 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

mtcars[with(mtcars, order(cyl, disp)), ]
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

Make `[[.indedex_array faster`

Compare splitter_a to split:

df <- data.frame(x = sample(100, 1e3, rep = T))
system.time(x <- as.list(splitter_a(df, 1)))
system.time(y <- split(df, 1:nrow(df)))
all.equal(x, y)

Should be able to speed up by specialising for common cases

"[[.indexed_array" <- function(x, i) {

  if (ncol(x$index) == 2) {
    x1 <- x$index[[1]]
    x2 <- x$index[[2]]
    if (x$subs[1] == "[") {
      return(x$env$data[x1[i], x1[2], drop = x$drop])
    }
    if (x$subs[2] == "[[") {
      return(x$env$data[[x1[i], x1[2], drop = x$drop]])
    }
  }

  indices <- x$index[i, ,drop=TRUE]
  call <- paste("x$env$data", x$subs[1], paste(indices, collapse = ","), ",
    drop = ", x$drop, x$subs[2], sep = "")
  eval(parse(text = call))
}

hadley / plyr Goto Github PK

plyr's Issues

Summary

Reproducible code

Actual results

Expected results

Source of problem

Workarounds

Caveats

Session info

merge

join

session info

Summary

Reproducible code:

Expected outcome

Actual outcome

Source of problem

Fix

Workaround

but

Recommend Projects

Recommend Topics

Recommend Org