revolutionanalytics / plyrmr Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 31.0 3.71 MB

Makefile 0.23% R 99.77%

plyrmr's People

Contributors

Stargazers

Watchers

plyrmr's Issues

combine non.standard.eval.patch and magic.wand

to make

magic.wand(transform, TRUE)

just work

better naming for freq columns

as the come out of count. Numbering is not acceptable for this package.

make sure the summarize example works

merge spark dev branches

This is a complex merge which requires careful planning.

First the rationale:

pros:
- single code base to maintain with maybe 80% shared
- single syntax and semantics to document, teach or learn, but there will be backend specific variations, hopefully just in the form of specific features
- avoid progressive drift between the two branches (both expected to change quickly)
- multiple backend implementation proved good modularity prod and bug detector (because of tests of equivalence between backends)
cons:
- single code base more complex than each of the separate ones
- merge is going to break things that currently work (hopefully this is temporary and will be confined in a dedicated branch)
- merge will take a lot of work
- SparkR is still incomplete, dev preview, not ready for prime time. Merged package won't be ready for release will have to be maintained in parallel for an unpredictable amount of time

backend specific features

sparkR
- cache
- indexed partitions
- not available yet: accumulators
rmr2:
- local backend
- io formats (for now)
- writing files (for now)

Merge issues

What do we do with deps unique to each branch? Import all of them creates a burden on the installer that needs only one backend. Installing rmr2 means installing hadoop, installing SparkR means installing spark. If we go with the suggests: field, we may end up with an installation that doesn't work. This is also the dplyr approach, but the local mode sort of comes built in. Here it is part of rmr2 and can't be unbundled.
Same for namespace issues, what do we do with functions and methods that are backend specific? In this case there is no suggest alternative, but we can probably keep them all without harm
the big.data class and big.data.R, albeit a misnomer at this point, can be used to interface with rmr2. We could rename it in a more specific way for clarity.
the pipe class and pipe.R file are the tough ones. Let's break it down by function or group thereof.
- comp is a little functional programming helper used for delayed evaluation. Can go to common.R or anywhere
- options env is spark only, but it was on the todo list for the dev branch to also have configurable options and shield the user from accessing rmr2 options directly
- drop.gather changes based on different representation of keyvals in the two backends
- make*fun functions are rmr2 specific
- is.data is probably dead code, as.character.pipe for sure
- set.context should become part of the options subsystem, spark only
- print.pipe is backend independent (for short SS is spark specific, RS is rmr2 specific and BI is backend independent) but git diff didn't catch that
- make.f1 is BI
- mergeable, vectorized and related funcs are BI
- gapply and group.f and ungroup have been completely rewritten for spark
- the keyval related functions are SS and replace the keyval logic in rmr2. There is a little duplication but unless we spin off keyvals in a separate package what else can we do
- group is BI
- run, mrexec mr.options are RS
- output can't be implemented in Spark right now
- the various as. conversions are mostly BS (backend specifc) in their implementation
- the tutorial has been modified to avoid writing to a file in SparkR, can be factored out but would make
  tutorial less intuitive, probably best is to keep two version and kick this down the road until the output feature is in SparkR

Merge preparations

If we just did a merge of one branch into the other as it is, it could be a disaster. It's probably best o have BS code in separate files and have very few points of selection where the control depends on the backend. This could be implemented the OO way with two classes inherting from pipe, one pipespark and one pipemr for instance the differ in methods gapply group.f and ungroup. Once this preparations are completed in both branches, the merge should be a lot easier. Another possible preparation could be to somwhat normalize the order of function as git diff seems to be confounded by simple code reordering.

support for row names

expose vectorized reduce feature in a plyrmr-esque way

where with an empty result fails when grouped

as.data.frame(where(group(input(mtcars), carb), FALSE))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1, 0
as.data.frame(where(input(mtcars), FALSE))
[1] mpg cyl disp hp drat wt qsec vs am gear carb
<0 rows> (or 0-length row.names)

different types of grouping

right now we have grouping by column or function, gather everything together and we have the recursive variety and an advance windowing construct that works with regular interval time series. What is important next? Resampling? Binning? Hashing? What are the use cases? Some of these ideas are being developed also for the dplyr project.

review presentation for clarity and consistency

as a consequence of #24

review tutorial for consistency and clarity

after API changes #24

Write document to explain grouping

simple, nested, recursive, vectorized, ungroup

extending the magic.wand function

The idea is to have the simple cases of conversion from squential in memory computation to MR covered and to chip away at more complex examples.magic.wand covers the simplest level, map only, $f(rbind(x_1, \ldots, x_n)) = rbind(f(x_1), \ldots, f(x_n))$ Can we extend it to cover the easiest summaries such as sums? Let's say we have a data.frame method for sum, can we have magic.wand(sum) do the right thing on big data? It should work in combination with grouping, that is magic.wand(sum)(group(data, col1, col2)) should return a sum for each group. What else can we hope to automagically port to the big data case?

document new vectorized group features

select col by vector

As hand as the ... args are, sometimes one would like to specify a range of cols like 3:5 or has the col names in a char vector. To support those cases we could add a .columns argument that would be used as in

x[,.columns]

or some such

delayed evaluation and environments

Delayed evaluation can have odd interactions with mutable environments. Look at this one

> p = mutate(input(data.frame(x = 1:10)), x = x + delta)
> as.data.frame(p)
     x
X1   1
X2   2
X3   3
X4   4
X5   5
X6   6
X7   7
X8   8
X9   9
X10 10
> delta = -10
> as.data.frame(p)
     x
X1  -9
X2  -8
X3  -7
X4  -6
X5  -5
X6  -4
X7  -3
X8  -2
X9  -1
X10  0

We want the value of delta to be frozen to when the pipe is defined. There is a workaround by creating an environment to hold the delta at pipe definition time

> p1 = mutate(input(data.frame(x = 1:10)), (function() {delta = 0; x + delta})())
> as.data.frame(p1)
     x
X1   1
X2   2
X3   3
X4   4
X5   5
X6   6
X7   7
X8   8
X9   9
X10 10
> delta
[1] -10

The question is can we make this transparent to the user?

API cleanup

First select is a misnomer, the fact hat SQL calls it that way doesn't make it better. We could rename the current select with a name in the tradition of transform and mutate (both of which are way too general for what the functions do). It could be calculate, compute, transmogrify, columns, new.columns, not sure. A select devoted to just selection could be based on dplyr select, with -colname and colname:colname syntax that have different meanings in the current select

Basic data frame functions fill in

In a quest to run before we walk, we seem to have forgotten basics like nrow, ncol, mean and some such. It is more glamorous to implement quantile but we can't leave the obvious behind. Please comment here if there are basic functions from base that should be extended to big data sets ASAP.

magic.wand issue with summarize

One needs to magic wand summarise or both summarize and summarise, because @hadley implemented summarize as UseMethod("summarise"). I think this will become default included in the package so that the user doesn't have to deal with it and that will make the issue redundant.

backend unsettled at start

it is set to "hadoop" but appears to be set to a list of 2 elements if inspected.

> plyrmr.options("backend")
2014-07-24 15:29:54.454 java[57317:1003] Unable to load realm info from SCDynamicStore
14/07/24 15:29:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[[1]]
[1] "local"

[[2]]
[1] "hadoop"

more powerful recycling for select

since it uses data.frame, it can only do whole number recycling. Case for using the more powerful fractional recycling of cbind. Use case mark the even rows of a data frame

select(mtcars, even = c(F,T))
works only with even row number.

bind.cols should be identity when no args provided

Error in transform function

I tried loading HDFS file and do some transformations on it as you mention in your document and it is giving some errors. please find below image for the same

row names modified

it appears that some row names are sanitzed going through do

as.data.frame(do(input(mtcars), identity))

spaces become "." and "(" become something else. Legal row names should not change unless the user changes them, or when they don't make sense (like in a group.by, it would make more sense to have the row named after the group)

dfs.mv function not found

R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(plyrmr)
Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.4.4 (2013-05-19) successfully loaded. See ?R.methodsS3 for help.

Attaching package: ‘plyrmr’

The following object is masked from ‘package:plyr’:

    mutate, summarize

> rmr.options(backend = 'local')
NULL
> big.mtcars = input(mtcars)
> as.data.frame(big.mtcars)
Error in as.big.data.pipe(x) : could not find function "dfs.mv"

change depends to imports

See equivalent issue for RevolutionAnalytics/rmr2#87 . That has to be done first

Doc bug in docs/tutorial.md

In docs/tutorial.md, I see the following in the middle of the page:

...we need the output call, as in:

This is the real deal: we have performed a computation on the cluster, in parallel, and the data...

It looks like there's supposed to be a code example somewhere between those two paragraphs, did it go missing in a patch somewhere?

turn 0.1 function from other packages into extensions

the functions that were axed in 0.2, like transform, we can let people have them back if they want.

replace uses of ddply with dplyr

should be mostly summarize, for speed

help file sqlish.Rd needs a rewrite

in the details section

find proper places to link to new doc about vectorized aggregation

Installation error

I am trying to install plyrmr and it is asking for the package "pryr" as a dependency. I tried over internet and i couldn't find any r package name "pryr". it might me plyr.

fast vectorized summaries

To program plyrmr efficiently in the face of small groups one has the option to go with the vectorized reduce option. Then each call to the reducer will get in input multiple groups. In plyrmr this takes the form of a data frame with multiple groups. The problem is to process such data frame. If one goes for a split or some such, it is as slow as the non-vectorized mode because the split creates many small data frames and is normally followed by a lapply which calls an interpreted R function. So no point going down that path.

The idea is to provide a few predefined fast vectorized reducers. The specific form of this idea is to leverage explored in this issue is to use dplyr to help with that. dplyr has a system for handling summaries that sidesteps the interpreter for simple functions (vaguely described as handlers by the dplyr crew)

The general transformation of a non vectorized reduce to a vectorized one could look like:

plyrmr::do(plyrmr::group(input(path), vars), f) => 
plyrmr::do(plyrmr::group(input(path), vars), function(x) dplyr::do(dplyr::group_by(x, vars), f))

Unfortunately, the semantics of dplyr::do is different from plyrmr:do (it returns lists, weird) and described by @Romain as work in progress. So the only dplyr function we can use is summarize AFAICT. So the above transformation would become

plyrmr::summarize(plyrmr::group(input(path), vars), exp) =>
plyrmr::summarize(plyrmr::group(input(path), vars), function(x) dplyr::summarize(dplyr::group_by(x, vars), f))

I am not totally clear what the generality of the handler mechanism is. These are some experiments:

> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), mysum(x)))
   user  system elapsed 
  0.201   0.012   0.211 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), mysum(x)))
   user  system elapsed 
  0.007   0.001   0.008 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), sum(x)))
   user  system elapsed 
  0.090   0.005   0.094 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), sum(x)))
   user  system elapsed 
  0.006   0.000   0.006 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), median(x)))
   user  system elapsed 
  2.589   0.009   2.597 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), median(x)))
   user  system elapsed 
  0.008   0.001   0.008

We are contrasting the performance on two vs 10^5 groups, while keeping the amount of data fixed. mysum is just function(x) sum(x)

mergeabilty information

We've moved from recursive grouping being an option to group to it being a possible type of grouping selected when at least the first transformation on the reduce side as the mergeability property. The mergeability property is set with with the mergeable function and examples in the documentation and tutorial have been upgraded to use it. Now we need to look at all other uses of group where recursive was on and provide the mergeabilty information and encapsulate it at the correct level.

non standard eval document outdated

And a fix was attempted by blind search and replace. Need some human being look into it.

embed mergeability information

The idea is that we don't want to call mergeable all the time when passing an argument to do. Then it would be an argument to do, not a higher level function. We need to find all data frame operations in plyrmr that are mergeable (not sure there are) and outside, at least the ones of importance. The ones that we own needs to have the mergeable attribute set and for the ones outside, one possibility would be to shadow them, the other to special case them in is.mergeable. The former is cleaner, the latter probably less intrusive.

remove dead cpp code

partial ungroup

additional argument that change the grouping, don't reset it

implementation of nested %|%

Doesn't do the right thing, for instance this should fail

4 %|% sqrt(9 %|% sqrt(..))

but it simply ignores the 4. Maybe recursive implementation of find.. needs to be recovered an improved

what happen to output format when an expression requires two jobs`

it should delay the format change to the last job, but I don't think that's what happens

make quantile function more flexible

should resemble stats::quantile as far as possible.

Conundrum applying magic wand to summarize

The idea is that data frame operations that are used in plyrmr can be marked with the mergeable attribute if amenable for the application of a combiner, no further user action needed. magic wand allows to create methods with this attribute. Unfortunately summarise is general enough that it can be sometimes mergeable and sometimes not. Transmute takes a .mergeable argument to delay this decision. Example.

summarize(mtcars, sum(cyl)) #mergeable
summarize(mtcars, mean(cyl)) #not mergeable

functions can be marked mergeable with the higher order function mergeable and rum will inspect the mergeable attribute and act accordingly. Maybe we need a mergeable.function and a mergeable.pipe method of a generic mergeable and delay this decision to every time summarise is called, that is

mergeable(summarize(mtcars, sum(cyl))) #mergeable
summarize(mtcars, mean(cyl)) #not mergeable

handle special case of output(input(something))

In this case we don't really need a identity mr job, just move the tmp file to the destination. Right now it just returns the input, a bug. Waiting on RevolutionAnalytics/rmr2#63

Problems reading from HDFS

Hi,

I have a problem reading csv files using plyrmr package and the input() function. It is working if I put the R object directly as input object (see example code). I get the following error in the stderr output. Any solution idea?

Error: !is.null(template) is not TRUE
No traceback available
Error during wrapup:
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:365)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

MY Rcode:

library(plyrmr)
Sys.setenv(HADOOP_CMD="/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar")

mtcars

transform(mtcars, carb.per.cyl = carb/cyl)

working

as.data.frame( transform(input(mtcars), carb.per.cyl = carb/cyl) )

write.table(mtcars, file = "/mapr/my.cluster.com/tmp/test.csv", sep = ",", row.names=TRUE, col.names=FALSE)

air_format =
make.input.format(
"csv",
sep = ","
)

not working

as.data.frame( transform(input("/mapr/my.cluster.com/tmp/test.csv", input.format=air_format), carb.per.cyl = carb/cyl) )
packageJobJar: [/tmp/Rtmpg1L4rx/rmr-local-env2ec8d6a90cb, /tmp/Rtmpg1L4rx/rmr-global-env2ec810159853, /tmp/Rtmpg1L4rx/rmr-streaming-map2ec831a0084b, /tmp/hadoop-schmidbm/hadoop-unjar1606066649349702739/] [] /tmp/streamjob4216188614721465077.jar tmpDir=null
14/03/24 12:39:34 INFO fs.JobTrackerWatcher: Current running JobTracker is: ex4s-dev01.devproof.org/144.76.60.132:9001
14/03/24 12:39:36 INFO mapred.FileInputFormat: Total input paths to process : 1
14/03/24 12:39:36 INFO mapred.JobClient: Creating job's output directory at maprfs:/tmp/file2ec88e61485
14/03/24 12:39:36 INFO mapred.JobClient: Creating job's user history location directory at maprfs:/tmp/file2ec88e61485/_logs
14/03/24 12:39:37 INFO streaming.StreamJob: getLocalDirs(): [/tmp/mapr-hadoop/mapred/local]
14/03/24 12:39:37 INFO streaming.StreamJob: Running job: job_201403201120_0338
14/03/24 12:39:37 INFO streaming.StreamJob: To kill this job, run:
14/03/24 12:39:37 INFO streaming.StreamJob: /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201403201120_0338
14/03/24 12:39:37 INFO streaming.StreamJob: Tracking URL: http://ex4s-dev01.devproof.org:50030/jobdetails.jsp?jobid=job_201403201120_0338
14/03/24 12:39:39 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:40:14 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:40:21 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:40:26 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:40:36 INFO streaming.StreamJob: map 0% reduce 0%
14/03/24 12:41:02 INFO streaming.StreamJob: map 50% reduce 0%
14/03/24 12:41:08 INFO streaming.StreamJob: map 100% reduce 100%
14/03/24 12:41:08 INFO streaming.StreamJob: To kill this job, run:
14/03/24 12:41:08 INFO streaming.StreamJob: /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201403201120_0338
14/03/24 12:41:08 INFO streaming.StreamJob: Tracking URL: http://ex4s-dev01.devproof.org:50030/jobdetails.jsp?jobid=job_201403201120_0338
14/03/24 12:41:08 ERROR streaming.StreamJob: Job not successful. Error: NA
14/03/24 12:41:08 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Deleted maprfs:/tmp/file2ec86959dfd6
Deleted maprfs:/tmp/file2ec85d6e85fe

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C
[6] LC_MESSAGES=C LC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] plyrmr_0.1.0 hydroPSO_0.3-3 R.methodsS3_1.6.1 pryr_0.1 rmr2_3.1.0 caTools_1.16
[7] plyr_1.8.1 stringr_0.6.2 reshape2_1.2.2 functional_0.4 digest_0.6.4 bitops_1.0-6
[13] RJSONIO_1.0-3 Rcpp_0.11.1

loaded via a namespace (and not attached):
[1] Formula_1.1-1 Hmisc_3.14-3 RColorBrewer_1.0-5 cluster_1.15.1 codetools_0.2-8 grid_3.0.1
[7] lattice_0.20-27 latticeExtra_0.6-26 sp_1.0-14 splines_3.0.1 survival_2.37-7 tools_3.0.1
[13] zoo_1.7-11

Complete Hadoop logs:
Loading objects:
.Random.seed
air_format
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
vectorized.reduce
verbose
work.dir
Loading required package: plyrmr
Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: methods
Loading required package: bitops
Loading required package: digest
Loading required package: reshape2
Loading required package: stringr
Loading required package: plyr
Loading required package: caTools
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
Loading required package: hydroPSO
(C) 2011-2013 M. Zambrano-Bigiarini and R. Rojas (GPL >=2 license)
Type 'citation('hydroPSO')' to see how to cite this package

Attaching package: ‘plyrmr’

The following object is masked from ‘package:pryr’:

where

The following object is masked from ‘package:rmr2’:

gather

The following objects are masked from ‘package:plyr’:

mutate, summarize

The following object is masked from ‘package:reshape2’:

dcast

The following objects are masked from ‘package:base’:

intersect, rbind, sample, union

syslog logs

2014-03-24 12:41:03,837 INFO org.apache.hadoop.mapred.Child: JVM: jvm_201403201120_0338_m_-442050359 pid: 12813
2014-03-24 12:41:04,175 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-global-env2ec810159853 <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-global-env2ec810159853
2014-03-24 12:41:04,176 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-local-env2ec8d6a90cb <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-local-env2ec8d6a90cb
2014-03-24 12:41:04,176 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/job.jar <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/job.jar
2014-03-24 12:41:04,177 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/.job.jar.crc <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/.job.jar.crc
2014-03-24 12:41:04,177 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/jars/rmr-streaming-map2ec831a0084b <- /tmp/mapr-hadoop/mapred/local/taskTracker/schmidbm/jobcache/job_201403201120_0338/attempt_201403201120_0338_m_000001_3/work/rmr-streaming-map2ec831a0084b
2014-03-24 12:41:04,198 INFO org.apache.hadoop.mapred.Child: Starting task attempt_201403201120_0338_m_000001_3
2014-03-24 12:41:04,199 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2014-03-24 12:41:04,308 INFO org.apache.hadoop.mapreduce.util.ProcessTree: setsid exited with exit code 0
2014-03-24 12:41:04,311 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: /proc//status does not have information about swap space used(VmSwap). Can not track swap usage of a task.
2014-03-24 12:41:04,312 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.mapreduce.util.LinuxResourceCalculatorPlugin@57271b36
2014-03-24 12:41:04,449 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded
2014-03-24 12:41:04,502 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/Rscript, --vanilla, ./rmr-streaming-map2ec831a0084b]
2014-03-24 12:41:04,559 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2014-03-24 12:41:04,563 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2014-03-24 12:41:06,941 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2014-03-24 12:41:06,941 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!

implicit as.data.frame for guaranteed small result operations

So far the plyrmr approach has been the same as rmr2's, which is no implicit DFS to master RAM transfers and back. But what if the result of an operation is guaranteed to fit main memory? Take gather: groups everything into one, hence it must fit in memory for the last reduce call. NO! gather could be composed with another grouping operation, hence there could be multiple reduce calls hence the output could not fit in RAM. Score one for orthogonality! Anyway, let's document here this possibility has been considered but so far rejected. Unlike sparkR's Reduce, which will zap any existing grouping, gather plays nicely with existing grouping (that allows to write operation like quantile that work for group and ungrouped data and use gather; the alternative could be, as it was in the past, that operations check for the grouping state explicitly before applying a gather, but that's boilerplate).

dummy col logic should move to safe cbind

there should never be a reason to cbind a col of 1s to anything else

more flexible powerful summaries

right now we have counting and quantiles. They can not be mixed and they are done column by column. With counting it is possible to want to count combination of columns, specified possibly with a formula type language.

ordering

In a distributed context, with the data in general partitioned, it's not clear what it means to sort exactly. In the case of terasort, it means that the data is partitioned by range and then sorted within each partition. What are alternatives to ordering and what are the use cases that these solve? Hive has an order by and sort by. Sort is within each reducer. Order has to be followed by a limit clause, which makes it equivalent to top k, which plyrmr has already.

find new default for argument R in moving window

select with no .. args fails when .replace FALSE

select(mtcars, .replace=FALSE) should be same as identity, instead it fails