plyrmr's People
Forkers
dsivaji dataquerent pcarolan ferry2174 fjfd jfivelsdal vtkingc emaasit alisheikh oima mehenry andrewzhang1 carol270 akino1976 solkem rajeshgk hlin09 certxg pabloesc afey emira20 micseb rainstar82 luffychan888 brslgrp1 piccolbo viennachen avin8233 coolguyankit perezrathke algoskynet jharnerplyrmr's Issues
bind.cols should be identity when no args provided
different types of grouping
right now we have grouping by column or function, gather everything together and we have the recursive variety and an advance windowing construct that works with regular interval time series. What is important next? Resampling? Binning? Hashing? What are the use cases? Some of these ideas are being developed also for the dplyr project.
ordering
In a distributed context, with the data in general partitioned, it's not clear what it means to sort exactly. In the case of terasort, it means that the data is partitioned by range and then sorted within each partition. What are alternatives to ordering and what are the use cases that these solve? Hive has an order by and sort by. Sort is within each reducer. Order has to be followed by a limit clause, which makes it equivalent to top k, which plyrmr has already.
help file sqlish.Rd needs a rewrite
in the details section
find new default for argument R in moving window
merge spark dev branches
This is a complex merge which requires careful planning.
First the rationale:
- pros:
- single code base to maintain with maybe 80% shared
- single syntax and semantics to document, teach or learn, but there will be backend specific variations, hopefully just in the form of specific features
- avoid progressive drift between the two branches (both expected to change quickly)
- multiple backend implementation proved good modularity prod and bug detector (because of tests of equivalence between backends)
- cons:
- single code base more complex than each of the separate ones
- merge is going to break things that currently work (hopefully this is temporary and will be confined in a dedicated branch)
- merge will take a lot of work
- SparkR is still incomplete, dev preview, not ready for prime time. Merged package won't be ready for release will have to be maintained in parallel for an unpredictable amount of time
backend specific features
- sparkR
- cache
- indexed partitions
- not available yet: accumulators
- rmr2:
- local backend
- io formats (for now)
- writing files (for now)
Merge issues
- What do we do with deps unique to each branch? Import all of them creates a burden on the installer that needs only one backend. Installing rmr2 means installing hadoop, installing SparkR means installing spark. If we go with the suggests: field, we may end up with an installation that doesn't work. This is also the dplyr approach, but the local mode sort of comes built in. Here it is part of rmr2 and can't be unbundled.
- Same for namespace issues, what do we do with functions and methods that are backend specific? In this case there is no suggest alternative, but we can probably keep them all without harm
- the big.data class and big.data.R, albeit a misnomer at this point, can be used to interface with rmr2. We could rename it in a more specific way for clarity.
- the pipe class and pipe.R file are the tough ones. Let's break it down by function or group thereof.
- comp is a little functional programming helper used for delayed evaluation. Can go to common.R or anywhere
- options env is spark only, but it was on the todo list for the dev branch to also have configurable options and shield the user from accessing rmr2 options directly
- drop.gather changes based on different representation of keyvals in the two backends
- make*fun functions are rmr2 specific
- is.data is probably dead code, as.character.pipe for sure
- set.context should become part of the options subsystem, spark only
- print.pipe is backend independent (for short SS is spark specific, RS is rmr2 specific and BI is backend independent) but git diff didn't catch that
- make.f1 is BI
- mergeable, vectorized and related funcs are BI
- gapply and group.f and ungroup have been completely rewritten for spark
- the keyval related functions are SS and replace the keyval logic in rmr2. There is a little duplication but unless we spin off keyvals in a separate package what else can we do
- group is BI
- run, mrexec mr.options are RS
- output can't be implemented in Spark right now
- the various as. conversions are mostly BS (backend specifc) in their implementation
- the tutorial has been modified to avoid writing to a file in SparkR, can be factored out but would make
tutorial less intuitive, probably best is to keep two version and kick this down the road until the output feature is in SparkR
Merge preparations
If we just did a merge of one branch into the other as it is, it could be a disaster. It's probably best o have BS code in separate files and have very few points of selection where the control depends on the backend. This could be implemented the OO way with two classes inherting from pipe, one pipespark and one pipemr for instance the differ in methods gapply group.f and ungroup. Once this preparations are completed in both branches, the merge should be a lot easier. Another possible preparation could be to somwhat normalize the order of function as git diff seems to be confounded by simple code reordering.
select with no .. args fails when .replace FALSE
select(mtcars, .replace=FALSE) should be same as identity, instead it fails
where with an empty result fails when grouped
as.data.frame(where(group(input(mtcars), carb), FALSE))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1, 0
as.data.frame(where(input(mtcars), FALSE))
[1] mpg cyl disp hp drat wt qsec vs am gear carb
<0 rows> (or 0-length row.names)
review presentation for clarity and consistency
as a consequence of #24
combine non.standard.eval.patch and magic.wand
to make
magic.wand(transform, TRUE)
just work
handle special case of output(input(something))
In this case we don't really need a identity mr job, just move the tmp file to the destination. Right now it just returns the input, a bug. Waiting on RevolutionAnalytics/rmr2#63
more flexible powerful summaries
right now we have counting and quantiles. They can not be mixed and they are done column by column. With counting it is possible to want to count combination of columns, specified possibly with a formula type language.
turn 0.1 function from other packages into extensions
the functions that were axed in 0.2, like transform, we can let people have them back if they want.
extending the magic.wand function
The idea is to have the simple cases of conversion from squential in memory computation to MR covered and to chip away at more complex examples.magic.wand
covers the simplest level, map only, magic.wand(sum)
do the right thing on big data? It should work in combination with grouping, that is magic.wand(sum)(group(data, col1, col2))
should return a sum for each group. What else can we hope to automagically port to the big data case?
make quantile function more flexible
should resemble stats::quantile as far as possible.
document new vectorized group features
magic.wand issue with summarize
One needs to magic wand summarise or both summarize and summarise, because @hadley implemented summarize as UseMethod("summarise"). I think this will become default included in the package so that the user doesn't have to deal with it and that will make the issue redundant.
mergeabilty information
We've moved from recursive grouping being an option to group
to it being a possible type of grouping selected when at least the first transformation on the reduce side as the mergeability property. The mergeability property is set with with the mergeable function and examples in the documentation and tutorial have been upgraded to use it. Now we need to look at all other uses of group where recursive was on and provide the mergeabilty information and encapsulate it at the correct level.
find proper places to link to new doc about vectorized aggregation
select col by vector
As hand as the ... args are, sometimes one would like to specify a range of cols like 3:5 or has the col names in a char vector. To support those cases we could add a .columns argument that would be used as in
x[,.columns]
or some such
Error in transform function
partial ungroup
additional argument that change the grouping, don't reset it
Problems reading from HDFS
Hi,
I have a problem reading csv files using plyrmr package and the input() function. It is working if I put the R object directly as input object (see example code). I get the following error in the stderr output. Any solution idea?
Error: !is.null(template) is not TRUE
No traceback available
Error during wrapup:
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:365)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
MY Rcode:
library(plyrmr)
Sys.setenv(HADOOP_CMD="/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar")
mtcars
transform(mtcars, carb.per.cyl = carb/cyl)
working
as.data.frame( transform(input(mtcars), carb.per.cyl = carb/cyl) )
write.table(mtcars, file = "/mapr/my.cluster.com/tmp/test.csv", sep = ",", row.names=TRUE, col.names=FALSE)
air_format =
make.input.format(
"csv",
sep = ","
)
not working
as.data.frame( transform(input("/mapr/my.cluster.com/tmp/test.csv", input.format=air_format), carb.per.cyl = carb/cyl) )
packageJobJar: [/tmp/Rtmpg1L4rx/rmr-local-env2ec8d6a90cb, /tmp/Rtmpg1L4rx/rmr-global-env2ec810159853, /tmp/Rtmpg1L4rx/rmr-streaming-map2ec831a0084b, /tmp/hadoop-schmidbm/hadoop-unjar1606066649349702739/] [] /tmp/streamjob4216188614721465077.jar tmpDir=null
14/03/24 12:39:34 INFO fs.JobTrackerWatcher: Current running JobTracker is: ex4s-dev01.devproof.org/144.76.60.132:9001
14/03/24 12:39:36 INFO mapred.FileInputFormat: Total input paths to process : 1
14/03/24 12:39:36 INFO mapred.JobClient: Creating job's output directory at maprfs:/tmp/file2ec88e61485
14/03/24 12:39:36 INFO mapred.JobClient: Creating job's user history location directory at maprfs:/tmp/file2ec88e61485/_logs
14/03/24 12:39:37 INFO streaming.StreamJob: getLocalDirs(): [/tmp/mapr-hadoop/mapred/local]
14/03/24 12:39:37 INFO streaming.StreamJob: Running job: job_201403201120_0338
14/03/24 12:39:37 INFO streaming.StreamJob: To kill this job, run:
14/03/24 12:39:37 INFO streaming.StreamJob: /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201403201120_0338
14/03/24 12:39:37 INFO streaming.StreamJob: Tracking URL: http://ex4s-dev01.devproof.org:50030/jobdetails.jsp?jobid=job_201403201120_0338
14/03/24 12:39:39 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:40:14 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:40:21 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:40:26 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:40:36 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:41:02 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:41:08 INFO streaming.StreamJob: map 100% reduce 100%
14/03/24 12:41:08 INFO streaming.StreamJob: To kill this job, run:
14/03/24 12:41:08 INFO streaming.StreamJob: /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201403201120_0338
14/03/24 12:41:08 INFO streaming.StreamJob: Tracking URL: http://ex4s-dev01.devproof.org:50030/jobdetails.jsp?jobid=job_201403201120_0338
14/03/24 12:41:08 ERROR streaming.StreamJob: Job not successful. Error: NA
14/03/24 12:41:08 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Deleted maprfs:/tmp/file2ec86959dfd6
Deleted maprfs:/tmp/file2ec85d6e85fe
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C
[6] LC_MESSAGES=C LC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyrmr_0.1.0 hydroPSO_0.3-3 R.methodsS3_1.6.1 pryr_0.1 rmr2_3.1.0 caTools_1.16
[7] plyr_1.8.1 stringr_0.6.2 reshape2_1.2.2 functional_0.4 digest_0.6.4 bitops_1.0-6
[13] RJSONIO_1.0-3 Rcpp_0.11.1
loaded via a namespace (and not attached):
[1] Formula_1.1-1 Hmisc_3.14-3 RColorBrewer_1.0-5 cluster_1.15.1 codetools_0.2-8 grid_3.0.1
[7] lattice_0.20-27 latticeExtra_0.6-26 sp_1.0-14 splines_3.0.1 survival_2.37-7 tools_3.0.1
[13] zoo_1.7-11
Complete Hadoop logs:
Loading objects:
.Random.seed
air_format
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
vectorized.reduce
verbose
work.dir
Loading required package: plyrmr
Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: methods
Loading required package: bitops
Loading required package: digest
Loading required package: reshape2
Loading required package: stringr
Loading required package: plyr
Loading required package: caTools
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
Loading required package: hydroPSO
(C) 2011-2013 M. Zambrano-Bigiarini and R. Rojas (GPL >=2 license)
Type 'citation('hydroPSO')' to see how to cite this package
Attaching package: ‘plyrmr’
The following object is masked from ‘package:pryr’:
where
The following object is masked from ‘package:rmr2’:
gather
The following objects are masked from ‘package:plyr’:
mutate, summarize
The following object is masked from ‘package:reshape2’:
dcast
The following objects are masked from ‘package:base’:
intersect, rbind, sample, union
Error: !is.null(template) is not TRUE
No traceback available
Error during wrapup:
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:365)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
syslog logs
2014-03-24 12:41:03,837 INFO org.apache.hadoop.mapred.Child: JVM: jvm_201403201120_0338_m_-442050359 pid: 12813
2014-03-24 12:41:04,175 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-global-env2ec810159853 <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-global-env2ec810159853
2014-03-24 12:41:04,176 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-local-env2ec8d6a90cb <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-local-env2ec8d6a90cb
2014-03-24 12:41:04,176 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/job.jar <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/job.jar
2014-03-24 12:41:04,177 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/.job.jar.crc <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/.job.jar.crc
2014-03-24 12:41:04,177 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-streaming-map2ec831a0084b <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-streaming-map2ec831a0084b
2014-03-24 12:41:04,198 INFO org.apache.hadoop.mapred.Child: Starting task attempt_201403201120_0338_m_000001_3
2014-03-24 12:41:04,199 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2014-03-24 12:41:04,308 INFO org.apache.hadoop.mapreduce.util.ProcessTree: setsid exited with exit code 0
2014-03-24 12:41:04,311 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: /proc//status does not have information about swap space used(VmSwap). Can not track swap usage of a task.
2014-03-24 12:41:04,312 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.mapreduce.util.LinuxResourceCalculatorPlugin@57271b36
2014-03-24 12:41:04,449 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded
2014-03-24 12:41:04,502 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/Rscript, --vanilla, ./rmr-streaming-map2ec831a0084b]
2014-03-24 12:41:04,559 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2014-03-24 12:41:04,563 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2014-03-24 12:41:06,941 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2014-03-24 12:41:06,941 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
dfs.mv function not found
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(plyrmr)
Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.4.4 (2013-05-19) successfully loaded. See ?R.methodsS3 for help.
Attaching package: ‘plyrmr’
The following object is masked from ‘package:plyr’:
mutate, summarize
> rmr.options(backend = 'local')
NULL
> big.mtcars = input(mtcars)
> as.data.frame(big.mtcars)
Error in as.big.data.pipe(x) : could not find function "dfs.mv"
Basic data frame functions fill in
In a quest to run before we walk, we seem to have forgotten basics like nrow
, ncol
, mean
and some such. It is more glamorous to implement quantile
but we can't leave the obvious behind. Please comment here if there are basic functions from base
that should be extended to big data sets ASAP.
implementation of nested %|%
Doesn't do the right thing, for instance this should fail
4 %|% sqrt(9 %|% sqrt(..))
but it simply ignores the 4. Maybe recursive implementation of find.. needs to be recovered an improved
Write document to explain grouping
simple, nested, recursive, vectorized, ungroup
change depends to imports
See equivalent issue for RevolutionAnalytics/rmr2#87 . That has to be done first
backend unsettled at start
it is set to "hadoop" but appears to be set to a list of 2 elements if inspected.
> plyrmr.options("backend")
2014-07-24 15:29:54.454 java[57317:1003] Unable to load realm info from SCDynamicStore
14/07/24 15:29:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[[1]]
[1] "local"
[[2]]
[1] "hadoop"
better naming for freq columns
as the come out of count. Numbering is not acceptable for this package.
implicit as.data.frame for guaranteed small result operations
So far the plyrmr approach has been the same as rmr2's, which is no implicit DFS to master RAM transfers and back. But what if the result of an operation is guaranteed to fit main memory? Take gather
: groups everything into one, hence it must fit in memory for the last reduce call. NO! gather
could be composed with another grouping operation, hence there could be multiple reduce calls hence the output could not fit in RAM. Score one for orthogonality! Anyway, let's document here this possibility has been considered but so far rejected. Unlike sparkR's Reduce, which will zap any existing grouping, gather
plays nicely with existing grouping (that allows to write operation like quantile that work for group and ungrouped data and use gather
; the alternative could be, as it was in the past, that operations check for the grouping state explicitly before applying a gather
, but that's boilerplate).
support for row names
expose vectorized reduce feature in a plyrmr-esque way
non standard eval document outdated
And a fix was attempted by blind search and replace. Need some human being look into it.
dummy col logic should move to safe cbind
there should never be a reason to cbind a col of 1s to anything else
what happen to output format when an expression requires two jobs`
it should delay the format change to the last job, but I don't think that's what happens
Installation error
embed mergeability information
The idea is that we don't want to call mergeable all the time when passing an argument to do. Then it would be an argument to do, not a higher level function. We need to find all data frame operations in plyrmr that are mergeable (not sure there are) and outside, at least the ones of importance. The ones that we own needs to have the mergeable attribute set and for the ones outside, one possibility would be to shadow them, the other to special case them in is.mergeable. The former is cleaner, the latter probably less intrusive.
make sure the summarize example works
review tutorial for consistency and clarity
after API changes #24
Doc bug in docs/tutorial.md
In docs/tutorial.md
, I see the following in the middle of the page:
...we need the output call, as in:
This is the real deal: we have performed a computation on the cluster, in parallel, and the data...
It looks like there's supposed to be a code example somewhere between those two paragraphs, did it go missing in a patch somewhere?
Conundrum applying magic wand to summarize
The idea is that data frame operations that are used in plyrmr can be marked with the mergeable attribute if amenable for the application of a combiner, no further user action needed. magic wand allows to create methods with this attribute. Unfortunately summarise is general enough that it can be sometimes mergeable and sometimes not. Transmute takes a .mergeable argument to delay this decision. Example.
summarize(mtcars, sum(cyl)) #mergeable
summarize(mtcars, mean(cyl)) #not mergeable
functions can be marked mergeable with the higher order function mergeable
and rum will inspect the mergeable attribute and act accordingly. Maybe we need a mergeable.function and a mergeable.pipe method of a generic mergeable and delay this decision to every time summarise is called, that is
mergeable(summarize(mtcars, sum(cyl))) #mergeable
summarize(mtcars, mean(cyl)) #not mergeable
remove dead cpp code
row names modified
it appears that some row names are sanitzed going through do
as.data.frame(do(input(mtcars), identity))
spaces become "." and "(" become something else. Legal row names should not change unless the user changes them, or when they don't make sense (like in a group.by, it would make more sense to have the row named after the group)
delayed evaluation and environments
Delayed evaluation can have odd interactions with mutable environments. Look at this one
> p = mutate(input(data.frame(x = 1:10)), x = x + delta)
> as.data.frame(p)
x
X1 1
X2 2
X3 3
X4 4
X5 5
X6 6
X7 7
X8 8
X9 9
X10 10
> delta = -10
> as.data.frame(p)
x
X1 -9
X2 -8
X3 -7
X4 -6
X5 -5
X6 -4
X7 -3
X8 -2
X9 -1
X10 0
We want the value of delta to be frozen to when the pipe is defined. There is a workaround by creating an environment to hold the delta at pipe definition time
> p1 = mutate(input(data.frame(x = 1:10)), (function() {delta = 0; x + delta})())
> as.data.frame(p1)
x
X1 1
X2 2
X3 3
X4 4
X5 5
X6 6
X7 7
X8 8
X9 9
X10 10
> delta
[1] -10
The question is can we make this transparent to the user?
support combiners
more powerful recycling for select
since it uses data.frame, it can only do whole number recycling. Case for using the more powerful fractional recycling of cbind. Use case mark the even rows of a data frame
select(mtcars, even = c(F,T))
works only with even row number.
replace uses of ddply with dplyr
should be mostly summarize, for speed
fast vectorized summaries
To program plyrmr efficiently in the face of small groups one has the option to go with the vectorized reduce option. Then each call to the reducer will get in input multiple groups. In plyrmr this takes the form of a data frame with multiple groups. The problem is to process such data frame. If one goes for a split or some such, it is as slow as the non-vectorized mode because the split creates many small data frames and is normally followed by a lapply which calls an interpreted R function. So no point going down that path.
The idea is to provide a few predefined fast vectorized reducers. The specific form of this idea is to leverage explored in this issue is to use dplyr to help with that. dplyr has a system for handling summaries that sidesteps the interpreter for simple functions (vaguely described as handlers by the dplyr crew)
The general transformation of a non vectorized reduce to a vectorized one could look like:
plyrmr::do(plyrmr::group(input(path), vars), f) =>
plyrmr::do(plyrmr::group(input(path), vars), function(x) dplyr::do(dplyr::group_by(x, vars), f))
Unfortunately, the semantics of dplyr::do is different from plyrmr:do (it returns lists, weird) and described by @Romain as work in progress. So the only dplyr function we can use is summarize AFAICT. So the above transformation would become
plyrmr::summarize(plyrmr::group(input(path), vars), exp) =>
plyrmr::summarize(plyrmr::group(input(path), vars), function(x) dplyr::summarize(dplyr::group_by(x, vars), f))
I am not totally clear what the generality of the handler mechanism is. These are some experiments:
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), mysum(x)))
user system elapsed
0.201 0.012 0.211
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), mysum(x)))
user system elapsed
0.007 0.001 0.008
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), sum(x)))
user system elapsed
0.090 0.005 0.094
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), sum(x)))
user system elapsed
0.006 0.000 0.006
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), median(x)))
user system elapsed
2.589 0.009 2.597
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), median(x)))
user system elapsed
0.008 0.001 0.008
We are contrasting the performance on two vs 10^5 groups, while keeping the amount of data fixed. mysum
is just function(x) sum(x)
API cleanup
First select
is a misnomer, the fact hat SQL calls it that way doesn't make it better. We could rename the current select
with a name in the tradition of transform and mutate (both of which are way too general for what the functions do). It could be calculate, compute, transmogrify, columns, new.columns, not sure. A select devoted to just selection could be based on dplyr select, with -colname and colname:colname syntax that have different meanings in the current select
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.