The rmr2 from revolutionanalytics

extending typedbytes to represent atomic vectors and NULLs

partially picking up from RevolutionAnalytics/RHadoop#48

We need to file issue upstream with the streaming folks and even better submit a patch. We need to remark how the hive and streaming implementations of typedbytes have diverged and that this is not good.

rmr doesn't clean up temp dir

file clean up works, but the directory is left behinf

Equijoin mutiple files as input.

There is a Warning in equijoin, "Doesn't work with multiple inputs like mapreduce".
I am guessing that this means when multiple files are present inside the folder,map reduce takes them as multiple input arguments.
If this is not possible, would it also not be able to read the output of a reduce job if Multiple part files are created?

OO refactor for keyval objects

I would like to propose the definition of a keyval object and an interface for an object that can be a key or a value. The goal is to provide an extensible definition for keyval, which is right now limited to containing vectors, lists, matrices or data frames. Not too shabby, but what about sparse matrices? The design is sketched already in the file keyval.R. The function starting with rmr. define an interface for things that can be keys or values. It actually doesn't have to be the same for the two and I think for the sake of generality the symmetry should be dropped. The functions ending in .keyval define an interface for the keyval class, and hopefully also an implementation that relies completely on the rmr interface to hide the differences between the concrete data structures. There are some exceptions like the c.or.rbind function which is rmr. type function where the name is maybe too tied to a possible implementation or key.normalize, which should be part of a key interface as the name suggests. The goal here is to allow the user to mapreduce any data structures that satisfy a small number of properties, like the ability to be split, unsplit, sliced, measured and a few more, all reasonable (and serde-bility).

Should use `read.csv` instead of `read.table` for "csv" format?

Since the rmr2 format is referred to as "csv", shouldn't it actually call read.csv so that it has the expected default parameters? Of particular importance is comment.char = "", which I spent a surprising amount of time debugging before I finally noticed that rmr actually calls read.table. I think it specifies somewhere in the documentation that read.table is being called, but at least I still found it surprising that it's not calling read.csv.

faster equijoin with vectorized reduce

semantics of key-val pairs with length-0 keys or vals

Continuing from issue RevolutionAnalytics/RHadoop#167

rmr2 failed - Rscript error

I tried to run rmr2 tutorial on CDH4, and I compiled source code of R and made it executable by all users. Now I can run WordCount jave code, but stucked at rmr2 package, because I saw this message first:

Error: java.lang.RuntimeException: Error in configuring object

I wonder how I can debug rmr2 package, any thoughts?

Here are details.
Source code:

[cloudera@localhost ~]$ cat test.rmr.R
library(rmr2)
small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))

Rscript is installed on /usr/bin:

[cloudera@localhost ~]$ ls -l /usr/bin/Rscript
lrwxrwxrwx 1 root root 37 Apr 1 15:19 /usr/bin/Rscript -> /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript

Error message:

[cloudera@localhost ~]$ Rscript test.rmr.R
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
13/04/01 17:35:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/04/01 17:35:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning message:
In to.dfs(1:1000) : Converting to.dfs argument to keyval with a NULL key
13/04/01 17:35:29 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/Rtmps3CHWS/rmr-local-env406f381b22dd, /tmp/Rtmps3CHWS/rmr-global-env406f575cacdf, /tmp/Rtmps3CHWS/rmr-streaming-map406f54bdce9f] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.2.0.jar] /tmp/streamjob34376797830229279.jar tmpDir=null
13/04/01 17:35:31 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:33 INFO mapred.FileInputFormat: Total input paths to process : 1
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: number of splits:2
13/04/01 17:35:33 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
13/04/01 17:35:33 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
13/04/01 17:35:33 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1364677243840_0008
13/04/01 17:35:33 INFO client.YarnClientImpl: Submitted application application_1364677243840_0008 to ResourceManager at /0.0.0.0:8032
13/04/01 17:35:33 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1364677243840_0008/
13/04/01 17:35:33 INFO mapreduce.Job: Running job: job_1364677243840_0008
13/04/01 17:35:45 INFO mapreduce.Job: Job job_1364677243840_0008 running in uber mode : false
13/04/01 17:35:45 INFO mapreduce.Job: map 0% reduce 0%
13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:12 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:14 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:23 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:27 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:36 INFO mapreduce.Job: map 50% reduce 0%
13/04/01 17:36:36 INFO mapreduce.Job: Job job_1364677243840_0008 failed with state FAILED due to: Task failed task_1364677243840_0008_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

13/04/01 17:36:37 INFO mapreduce.Job: Counters: 6
Job Counters
Failed map tasks=7
Launched map tasks=8
Other local map tasks=6
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=85605
Total time spent by all reduces in occupied slots (ms)=0
13/04/01 17:36:37 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Calls: mapreduce -> mr
Execution halted

0-length argument error at keyval

Hi,
I've written a MapReduce job that takes as input, the output of a previous MapReduce job.
The first few lines of the job look something like this:

map_1 = function(k,input_data)
{   
index = which(k ==1)


if(length(index) > 0)
{ 
    input_data=do.call("rbind",input_data[index])

    input_data = as.data.frame(input_data)

    if(nrow(input_data)!=0)
    {

        #some processing

        keyval("constant_key",processed_input_data)
    }
}
}

Now, in the processing part, I've clipped some columns but there is nothing that will change the number of rows. But the following is the error I am getting.

Loading required package: rmr2
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
data ENTERING FOR at risk19sh: -c: line 0: syntax error near unexpected token `(Ê¼
sh: -c: line 0: `/usr/lib/hadoop/bin/hadoop dfs -put  /mnt/data/mapred/local/taskTracker/musigma/jobcache/job_201303201047_0032/attempt_201303201047_0032_m_000000_0/work/tmp/RtmpF1qXQp/file708455830e3d c(" new/output/alltags-v33_1/AtRisk/part-00001", " new/output/alltags-v33_1/AtRisk/part-00002")Ê¼
Length of Keys = 1
Length of Values = 3
Error in rmr.recycle(k, v) : CanÊ¼t recycle 0-length argument
Calls: <Anonymous> ... keyval.writer -> format -> recycle.keyval -> keyval -> rmr.recycle
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

Since there is already an if condition that should take care if the rows are zero, the error is being thrown at keyval, and we are not able to figure out why.
What could be the problem?

duplication of kv.cmp and cmp

it seems cmp is a remnant of a bygone version with only one use

controlling number of map and reduce tasks

Was RevolutionAnalytics/RHadoop#164 in the old repo

rmr with output.format="native" failing with java.io.EOFException

I'm using RMR and I'd like to serialize multiple randomForests to hdfs.

reducer <- function (k,v) {
rf <- randomForest(formula=model.formula,
data=v,
na.action=na.roughfix,
ntree=number.trees,
do.trace=TRUE
)
keyval(k,list(forest=rf))
}

I'm calling the reducer like this

mapreduce(input="train_clean.csv",
input.format=titanic.input.format,
map=mapper,
reduce=reducer,
output.format="native",
output="titanic-out")

When I run this, the reducers fail like this:

2013-05-02 08:18:27,372 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:458)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:399)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:218)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376)
2013-05-02 08:18:27,375 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:458)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:399)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:218)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376

If I change output.format=text, the same code works.

Install: Cannot find hadoop-core jar file in hadoop home

Hi,
While installing rmr2 on a system with CDH4, I get the following message, stating "Cannot find hadoop-core jar file in hadoop home".

This seems to be more of a warning, as the package runs as expected regardless.
Is this something that we should be bothered about?

The exact install log is as follows:

* installing to library â/usr/local/lib64/R/libraryâ
* installing *source* package ârmr2â ...
** libs
g++ -I/usr/local/lib64/R/include -DNDEBUG  -I/usr/local/include   `/usr/local/lib64/R/bin/Rscript -e "Rcpp:::CxxFlags()"` -fpic  -g -O2  -c extras.cpp -o extras.o
g++ -I/usr/local/lib64/R/include -DNDEBUG  -I/usr/local/include   `/usr/local/lib64/R/bin/Rscript -e "Rcpp:::CxxFlags()"` -fpic  -g -O2  -c hbase-to-df.cpp -o hbase-to-df.o
g++ -I/usr/local/lib64/R/include -DNDEBUG  -I/usr/local/include   `/usr/local/lib64/R/bin/Rscript -e "Rcpp:::CxxFlags()"` -fpic  -g -O2  -c typed-bytes.cpp -o typed-bytes.o
g++ -shared -L/usr/local/lib64 -o rmr2.so extras.o hbase-to-df.o typed-bytes.o -L/usr/local/lib64/R/library/Rcpp/lib -lRcpp -Wl,-rpath,/usr/local/lib64/R/library/Rcpp/lib -L/usr/local/lib64/R/lib -lR
((which hbase && (mkdir -p ../inst; cd hbase-io; sh build_linux.sh; cp build/dist/* ../../inst)) || echo "can't build hbase IO classes, skipping" >&2)
/usr/bin/hbase
build_linux.sh: line 159: [: missing `]'
Using /usr/lib/hadoop as hadoop home
Using /usr/lib/hbase as hbase home

Copying libs into local build directory
ls: cannot access /usr/lib/hadoop/hadoop-*-core.jar: No such file or directory
ls: cannot access /usr/lib/hadoop/hadoop-core-*.jar: No such file or directory
Cannot find hadoop-core jar file in hadoop home
cp: cannot stat `build/dist/*': No such file or directory
can't build hbase IO classes, skipping
installing to /usr/local/lib64/R/library/rmr2/libs
** R
** preparing package for lazy loading
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called âquickcheckâ
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded

* DONE (rmr2)

map side joins

After reading http://www.slideshare.net/Hadoop_Summit/innovations-in-apache-hadoop-mapreduce-pig-hive-for-improving-query-performance slide 19 in particular not only I was reminded of the large performance advantages of map side joins but that they have natural use cases in things like star schemas and the like. Moreover, it seems like an rmr implementation shouldn't be all that difficult. One decision is if we should hide it behind the regular equijoin interface as an implementation change, with at most an API hint to use map side algorithm or some addition to the API. The latter is less conservative, makes the API more complex, but it also allows to do big-to-many-small joins typical of a star schema in one step, if all the small tables fit in memory, which allows to skip persisting of intermediate results, one for each small table.

from.dfs should use hdfs.getmerge for receiving multiple files

When you use from.dfs to load a directory from hadoop, rmr will spawn multiple hadoop dfs -get tasks for each individual file. Instead it could use hdfs.getmerge to do this in a more efficient way, drastically decreasing the JVM startup overhead.

text output and combiner don't work together

picking up from RevolutionAnalytics/RHadoop#113

to.dfs(...) fails with "Permission denied"

I am facing the following problem:
these two R commands:

library(rmr2)
small.ints = to.dfs(1:1000)

produce the following output:

library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
small.ints = to.dfs(1:1000)
sh: 1: /usr/local/hadoop: Permission denied
Warning message:
In to.dfs(1:1000) : Converting to.dfs argument to keyval with a NULL key

HADOOP_CMD is set to /usr/local/hadoop. The directory should be accessible, I even used chmod -R 777...
I use hadoop v1.1.2 which resides in /usr/local/hadoop.

R is selfcompiled v3.0.1 and stored in /usr/local/R.

Are there any good ideas or can anyone give advice what I can try next?

rmr2 failed because Rscript error