Giter VIP home page Giter VIP logo

plyrmr's People

Contributors

piccolbo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

plyrmr's Issues

different types of grouping

right now we have grouping by column or function, gather everything together and we have the recursive variety and an advance windowing construct that works with regular interval time series. What is important next? Resampling? Binning? Hashing? What are the use cases? Some of these ideas are being developed also for the dplyr project.

ordering

In a distributed context, with the data in general partitioned, it's not clear what it means to sort exactly. In the case of terasort, it means that the data is partitioned by range and then sorted within each partition. What are alternatives to ordering and what are the use cases that these solve? Hive has an order by and sort by. Sort is within each reducer. Order has to be followed by a limit clause, which makes it equivalent to top k, which plyrmr has already.

merge spark dev branches

This is a complex merge which requires careful planning.

First the rationale:

  • pros:
    • single code base to maintain with maybe 80% shared
    • single syntax and semantics to document, teach or learn, but there will be backend specific variations, hopefully just in the form of specific features
    • avoid progressive drift between the two branches (both expected to change quickly)
    • multiple backend implementation proved good modularity prod and bug detector (because of tests of equivalence between backends)
  • cons:
    • single code base more complex than each of the separate ones
    • merge is going to break things that currently work (hopefully this is temporary and will be confined in a dedicated branch)
    • merge will take a lot of work
    • SparkR is still incomplete, dev preview, not ready for prime time. Merged package won't be ready for release will have to be maintained in parallel for an unpredictable amount of time

backend specific features

  • sparkR
    • cache
    • indexed partitions
    • not available yet: accumulators
  • rmr2:
    • local backend
    • io formats (for now)
    • writing files (for now)

Merge issues

  • What do we do with deps unique to each branch? Import all of them creates a burden on the installer that needs only one backend. Installing rmr2 means installing hadoop, installing SparkR means installing spark. If we go with the suggests: field, we may end up with an installation that doesn't work. This is also the dplyr approach, but the local mode sort of comes built in. Here it is part of rmr2 and can't be unbundled.
  • Same for namespace issues, what do we do with functions and methods that are backend specific? In this case there is no suggest alternative, but we can probably keep them all without harm
  • the big.data class and big.data.R, albeit a misnomer at this point, can be used to interface with rmr2. We could rename it in a more specific way for clarity.
  • the pipe class and pipe.R file are the tough ones. Let's break it down by function or group thereof.
    • comp is a little functional programming helper used for delayed evaluation. Can go to common.R or anywhere
    • options env is spark only, but it was on the todo list for the dev branch to also have configurable options and shield the user from accessing rmr2 options directly
    • drop.gather changes based on different representation of keyvals in the two backends
    • make*fun functions are rmr2 specific
    • is.data is probably dead code, as.character.pipe for sure
    • set.context should become part of the options subsystem, spark only
    • print.pipe is backend independent (for short SS is spark specific, RS is rmr2 specific and BI is backend independent) but git diff didn't catch that
    • make.f1 is BI
    • mergeable, vectorized and related funcs are BI
    • gapply and group.f and ungroup have been completely rewritten for spark
    • the keyval related functions are SS and replace the keyval logic in rmr2. There is a little duplication but unless we spin off keyvals in a separate package what else can we do
    • group is BI
    • run, mrexec mr.options are RS
    • output can't be implemented in Spark right now
    • the various as. conversions are mostly BS (backend specifc) in their implementation
    • the tutorial has been modified to avoid writing to a file in SparkR, can be factored out but would make
      tutorial less intuitive, probably best is to keep two version and kick this down the road until the output feature is in SparkR

Merge preparations

If we just did a merge of one branch into the other as it is, it could be a disaster. It's probably best o have BS code in separate files and have very few points of selection where the control depends on the backend. This could be implemented the OO way with two classes inherting from pipe, one pipespark and one pipemr for instance the differ in methods gapply group.f and ungroup. Once this preparations are completed in both branches, the merge should be a lot easier. Another possible preparation could be to somwhat normalize the order of function as git diff seems to be confounded by simple code reordering.

where with an empty result fails when grouped

as.data.frame(where(group(input(mtcars), carb), FALSE))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1, 0
as.data.frame(where(input(mtcars), FALSE))
[1] mpg cyl disp hp drat wt qsec vs am gear carb
<0 rows> (or 0-length row.names)

more flexible powerful summaries

right now we have counting and quantiles. They can not be mixed and they are done column by column. With counting it is possible to want to count combination of columns, specified possibly with a formula type language.

extending the magic.wand function

The idea is to have the simple cases of conversion from squential in memory computation to MR covered and to chip away at more complex examples.magic.wand covers the simplest level, map only, $f(rbind(x_1, \ldots, x_n)) = rbind(f(x_1), \ldots, f(x_n))$ Can we extend it to cover the easiest summaries such as sums? Let's say we have a data.frame method for sum, can we have magic.wand(sum) do the right thing on big data? It should work in combination with grouping, that is magic.wand(sum)(group(data, col1, col2)) should return a sum for each group. What else can we hope to automagically port to the big data case?

magic.wand issue with summarize

One needs to magic wand summarise or both summarize and summarise, because @hadley implemented summarize as UseMethod("summarise"). I think this will become default included in the package so that the user doesn't have to deal with it and that will make the issue redundant.

mergeabilty information

We've moved from recursive grouping being an option to group to it being a possible type of grouping selected when at least the first transformation on the reduce side as the mergeability property. The mergeability property is set with with the mergeable function and examples in the documentation and tutorial have been upgraded to use it. Now we need to look at all other uses of group where recursive was on and provide the mergeabilty information and encapsulate it at the correct level.

select col by vector

As hand as the ... args are, sometimes one would like to specify a range of cols like 3:5 or has the col names in a char vector. To support those cases we could add a .columns argument that would be used as in

x[,.columns]

or some such

Error in transform function

I tried loading HDFS file and do some transformations on it as you mention in your document and it is giving some errors. please find below image for the same
image

partial ungroup

additional argument that change the grouping, don't reset it

Problems reading from HDFS

Hi,

I have a problem reading csv files using plyrmr package and the input() function. It is working if I put the R object directly as input object (see example code). I get the following error in the stderr output. Any solution idea?

Error: !is.null(template) is not TRUE
No traceback available
Error during wrapup:
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:365)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

MY Rcode:

library(plyrmr)
Sys.setenv(HADOOP_CMD="/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar")

mtcars

transform(mtcars, carb.per.cyl = carb/cyl)

working

as.data.frame( transform(input(mtcars), carb.per.cyl = carb/cyl) )

write.table(mtcars, file = "/mapr/my.cluster.com/tmp/test.csv", sep = ",", row.names=TRUE, col.names=FALSE)

air_format =
make.input.format(
"csv",
sep = ","
)

not working

as.data.frame( transform(input("/mapr/my.cluster.com/tmp/test.csv", input.format=air_format), carb.per.cyl = carb/cyl) )
packageJobJar: [/tmp/Rtmpg1L4rx/rmr-local-env2ec8d6a90cb, /tmp/Rtmpg1L4rx/rmr-global-env2ec810159853, /tmp/Rtmpg1L4rx/rmr-streaming-map2ec831a0084b, /tmp/hadoop-schmidbm/hadoop-unjar1606066649349702739/] [] /tmp/streamjob4216188614721465077.jar tmpDir=null
14/03/24 12:39:34 INFO fs.JobTrackerWatcher: Current running JobTracker is: ex4s-dev01.devproof.org/144.76.60.132:9001
14/03/24 12:39:36 INFO mapred.FileInputFormat: Total input paths to process : 1
14/03/24 12:39:36 INFO mapred.JobClient: Creating job's output directory at maprfs:/tmp/file2ec88e61485
14/03/24 12:39:36 INFO mapred.JobClient: Creating job's user history location directory at maprfs:/tmp/file2ec88e61485/_logs
14/03/24 12:39:37 INFO streaming.StreamJob: getLocalDirs(): [/tmp/mapr-hadoop/mapred/local]
14/03/24 12:39:37 INFO streaming.StreamJob: Running job: job_201403201120_0338
14/03/24 12:39:37 INFO streaming.StreamJob: To kill this job, run:
14/03/24 12:39:37 INFO streaming.StreamJob: /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201403201120_0338
14/03/24 12:39:37 INFO streaming.StreamJob: Tracking URL: http://ex4s-dev01.devproof.org:50030/jobdetails.jsp?jobid=job_201403201120_0338
14/03/24 12:39:39 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:40:14 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:40:21 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:40:26 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:40:36 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:41:02 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:41:08 INFO streaming.StreamJob: map 100% reduce 100%
14/03/24 12:41:08 INFO streaming.StreamJob: To kill this job, run:
14/03/24 12:41:08 INFO streaming.StreamJob: /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201403201120_0338
14/03/24 12:41:08 INFO streaming.StreamJob: Tracking URL: http://ex4s-dev01.devproof.org:50030/jobdetails.jsp?jobid=job_201403201120_0338
14/03/24 12:41:08 ERROR streaming.StreamJob: Job not successful. Error: NA
14/03/24 12:41:08 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Deleted maprfs:/tmp/file2ec86959dfd6
Deleted maprfs:/tmp/file2ec85d6e85fe

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C
[6] LC_MESSAGES=C LC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] plyrmr_0.1.0 hydroPSO_0.3-3 R.methodsS3_1.6.1 pryr_0.1 rmr2_3.1.0 caTools_1.16
[7] plyr_1.8.1 stringr_0.6.2 reshape2_1.2.2 functional_0.4 digest_0.6.4 bitops_1.0-6
[13] RJSONIO_1.0-3 Rcpp_0.11.1

loaded via a namespace (and not attached):
[1] Formula_1.1-1 Hmisc_3.14-3 RColorBrewer_1.0-5 cluster_1.15.1 codetools_0.2-8 grid_3.0.1
[7] lattice_0.20-27 latticeExtra_0.6-26 sp_1.0-14 splines_3.0.1 survival_2.37-7 tools_3.0.1
[13] zoo_1.7-11

Complete Hadoop logs:
Loading objects:
.Random.seed
air_format
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
vectorized.reduce
verbose
work.dir
Loading required package: plyrmr
Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: methods
Loading required package: bitops
Loading required package: digest
Loading required package: reshape2
Loading required package: stringr
Loading required package: plyr
Loading required package: caTools
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
Loading required package: hydroPSO
(C) 2011-2013 M. Zambrano-Bigiarini and R. Rojas (GPL >=2 license)
Type 'citation('hydroPSO')' to see how to cite this package

Attaching package: ‘plyrmr’

The following object is masked from ‘package:pryr’:

where

The following object is masked from ‘package:rmr2’:

gather

The following objects are masked from ‘package:plyr’:

mutate, summarize

The following object is masked from ‘package:reshape2’:

dcast

The following objects are masked from ‘package:base’:

intersect, rbind, sample, union

Error: !is.null(template) is not TRUE
No traceback available
Error during wrapup:
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:365)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

syslog logs

2014-03-24 12:41:03,837 INFO org.apache.hadoop.mapred.Child: JVM: jvm_201403201120_0338_m_-442050359 pid: 12813
2014-03-24 12:41:04,175 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-global-env2ec810159853 <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-global-env2ec810159853
2014-03-24 12:41:04,176 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-local-env2ec8d6a90cb <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-local-env2ec8d6a90cb
2014-03-24 12:41:04,176 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/job.jar <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/job.jar
2014-03-24 12:41:04,177 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/.job.jar.crc <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/.job.jar.crc
2014-03-24 12:41:04,177 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-streaming-map2ec831a0084b <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-streaming-map2ec831a0084b
2014-03-24 12:41:04,198 INFO org.apache.hadoop.mapred.Child: Starting task attempt_201403201120_0338_m_000001_3
2014-03-24 12:41:04,199 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2014-03-24 12:41:04,308 INFO org.apache.hadoop.mapreduce.util.ProcessTree: setsid exited with exit code 0
2014-03-24 12:41:04,311 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: /proc//status does not have information about swap space used(VmSwap). Can not track swap usage of a task.
2014-03-24 12:41:04,312 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.mapreduce.util.LinuxResourceCalculatorPlugin@57271b36
2014-03-24 12:41:04,449 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded
2014-03-24 12:41:04,502 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/Rscript, --vanilla, ./rmr-streaming-map2ec831a0084b]
2014-03-24 12:41:04,559 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2014-03-24 12:41:04,563 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2014-03-24 12:41:06,941 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2014-03-24 12:41:06,941 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!

dfs.mv function not found

R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(plyrmr)
Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.4.4 (2013-05-19) successfully loaded. See ?R.methodsS3 for help.

Attaching package:plyrmrThe following object is masked frompackage:plyr:

    mutate, summarize

> rmr.options(backend = 'local')
NULL
> big.mtcars = input(mtcars)
> as.data.frame(big.mtcars)
Error in as.big.data.pipe(x) : could not find function "dfs.mv"

Basic data frame functions fill in

In a quest to run before we walk, we seem to have forgotten basics like nrow, ncol, mean and some such. It is more glamorous to implement quantile but we can't leave the obvious behind. Please comment here if there are basic functions from base that should be extended to big data sets ASAP.

implementation of nested %|%

Doesn't do the right thing, for instance this should fail

4 %|% sqrt(9 %|% sqrt(..))

but it simply ignores the 4. Maybe recursive implementation of find.. needs to be recovered an improved

backend unsettled at start

it is set to "hadoop" but appears to be set to a list of 2 elements if inspected.

> plyrmr.options("backend")
2014-07-24 15:29:54.454 java[57317:1003] Unable to load realm info from SCDynamicStore
14/07/24 15:29:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[[1]]
[1] "local"

[[2]]
[1] "hadoop"

implicit as.data.frame for guaranteed small result operations

So far the plyrmr approach has been the same as rmr2's, which is no implicit DFS to master RAM transfers and back. But what if the result of an operation is guaranteed to fit main memory? Take gather: groups everything into one, hence it must fit in memory for the last reduce call. NO! gather could be composed with another grouping operation, hence there could be multiple reduce calls hence the output could not fit in RAM. Score one for orthogonality! Anyway, let's document here this possibility has been considered but so far rejected. Unlike sparkR's Reduce, which will zap any existing grouping, gather plays nicely with existing grouping (that allows to write operation like quantile that work for group and ungrouped data and use gather; the alternative could be, as it was in the past, that operations check for the grouping state explicitly before applying a gather, but that's boilerplate).

Installation error

I am trying to install plyrmr and it is asking for the package "pryr" as a dependency. I tried over internet and i couldn't find any r package name "pryr". it might me plyr.
image

embed mergeability information

The idea is that we don't want to call mergeable all the time when passing an argument to do. Then it would be an argument to do, not a higher level function. We need to find all data frame operations in plyrmr that are mergeable (not sure there are) and outside, at least the ones of importance. The ones that we own needs to have the mergeable attribute set and for the ones outside, one possibility would be to shadow them, the other to special case them in is.mergeable. The former is cleaner, the latter probably less intrusive.

Doc bug in docs/tutorial.md

In docs/tutorial.md, I see the following in the middle of the page:

...we need the output call, as in:

This is the real deal: we have performed a computation on the cluster, in parallel, and the data...

It looks like there's supposed to be a code example somewhere between those two paragraphs, did it go missing in a patch somewhere?

Conundrum applying magic wand to summarize

The idea is that data frame operations that are used in plyrmr can be marked with the mergeable attribute if amenable for the application of a combiner, no further user action needed. magic wand allows to create methods with this attribute. Unfortunately summarise is general enough that it can be sometimes mergeable and sometimes not. Transmute takes a .mergeable argument to delay this decision. Example.

summarize(mtcars, sum(cyl)) #mergeable
summarize(mtcars, mean(cyl)) #not mergeable

functions can be marked mergeable with the higher order function mergeable and rum will inspect the mergeable attribute and act accordingly. Maybe we need a mergeable.function and a mergeable.pipe method of a generic mergeable and delay this decision to every time summarise is called, that is

mergeable(summarize(mtcars, sum(cyl))) #mergeable
summarize(mtcars, mean(cyl)) #not mergeable

row names modified

it appears that some row names are sanitzed going through do

as.data.frame(do(input(mtcars), identity))

spaces become "." and "(" become something else. Legal row names should not change unless the user changes them, or when they don't make sense (like in a group.by, it would make more sense to have the row named after the group)

delayed evaluation and environments

Delayed evaluation can have odd interactions with mutable environments. Look at this one

> p = mutate(input(data.frame(x = 1:10)), x = x + delta)
> as.data.frame(p)
     x
X1   1
X2   2
X3   3
X4   4
X5   5
X6   6
X7   7
X8   8
X9   9
X10 10
> delta = -10
> as.data.frame(p)
     x
X1  -9
X2  -8
X3  -7
X4  -6
X5  -5
X6  -4
X7  -3
X8  -2
X9  -1
X10  0

We want the value of delta to be frozen to when the pipe is defined. There is a workaround by creating an environment to hold the delta at pipe definition time

> p1 = mutate(input(data.frame(x = 1:10)), (function() {delta = 0; x + delta})())
> as.data.frame(p1)
     x
X1   1
X2   2
X3   3
X4   4
X5   5
X6   6
X7   7
X8   8
X9   9
X10 10
> delta
[1] -10

The question is can we make this transparent to the user?

more powerful recycling for select

since it uses data.frame, it can only do whole number recycling. Case for using the more powerful fractional recycling of cbind. Use case mark the even rows of a data frame

select(mtcars, even = c(F,T))
works only with even row number.

fast vectorized summaries

To program plyrmr efficiently in the face of small groups one has the option to go with the vectorized reduce option. Then each call to the reducer will get in input multiple groups. In plyrmr this takes the form of a data frame with multiple groups. The problem is to process such data frame. If one goes for a split or some such, it is as slow as the non-vectorized mode because the split creates many small data frames and is normally followed by a lapply which calls an interpreted R function. So no point going down that path.

The idea is to provide a few predefined fast vectorized reducers. The specific form of this idea is to leverage explored in this issue is to use dplyr to help with that. dplyr has a system for handling summaries that sidesteps the interpreter for simple functions (vaguely described as handlers by the dplyr crew)

The general transformation of a non vectorized reduce to a vectorized one could look like:

plyrmr::do(plyrmr::group(input(path), vars), f) => 
plyrmr::do(plyrmr::group(input(path), vars), function(x) dplyr::do(dplyr::group_by(x, vars), f))

Unfortunately, the semantics of dplyr::do is different from plyrmr:do (it returns lists, weird) and described by @Romain as work in progress. So the only dplyr function we can use is summarize AFAICT. So the above transformation would become

plyrmr::summarize(plyrmr::group(input(path), vars), exp) =>
plyrmr::summarize(plyrmr::group(input(path), vars), function(x) dplyr::summarize(dplyr::group_by(x, vars), f))

I am not totally clear what the generality of the handler mechanism is. These are some experiments:

> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), mysum(x)))
   user  system elapsed 
  0.201   0.012   0.211 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), mysum(x)))
   user  system elapsed 
  0.007   0.001   0.008 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), sum(x)))
   user  system elapsed 
  0.090   0.005   0.094 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), sum(x)))
   user  system elapsed 
  0.006   0.000   0.006 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), median(x)))
   user  system elapsed 
  2.589   0.009   2.597 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), median(x)))
   user  system elapsed 
  0.008   0.001   0.008 

We are contrasting the performance on two vs 10^5 groups, while keeping the amount of data fixed. mysum is just function(x) sum(x)

API cleanup

First select is a misnomer, the fact hat SQL calls it that way doesn't make it better. We could rename the current select with a name in the tradition of transform and mutate (both of which are way too general for what the functions do). It could be calculate, compute, transmogrify, columns, new.columns, not sure. A select devoted to just selection could be based on dplyr select, with -colname and colname:colname syntax that have different meanings in the current select

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.