Giter VIP home page Giter VIP logo

rmr2's Introduction

rmr2

A package that allows R developer to use Hadoop MapReduce, developed as part of the RHadoop project. Please see the RHadoop wiki for information.

rmr2's People

Contributors

hughdevlin avatar jamiefolson avatar khharut avatar oreh avatar piccolbo avatar russell-datascience avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rmr2's Issues

Equijoin mutiple files as input.

There is a Warning in equijoin, "Doesn't work with multiple inputs like mapreduce".
I am guessing that this means when multiple files are present inside the folder,map reduce takes them as multiple input arguments.
If this is not possible, would it also not be able to read the output of a reduce job if Multiple part files are created?

OO refactor for keyval objects

I would like to propose the definition of a keyval object and an interface for an object that can be a key or a value. The goal is to provide an extensible definition for keyval, which is right now limited to containing vectors, lists, matrices or data frames. Not too shabby, but what about sparse matrices? The design is sketched already in the file keyval.R. The function starting with rmr. define an interface for things that can be keys or values. It actually doesn't have to be the same for the two and I think for the sake of generality the symmetry should be dropped. The functions ending in .keyval define an interface for the keyval class, and hopefully also an implementation that relies completely on the rmr interface to hide the differences between the concrete data structures. There are some exceptions like the c.or.rbind function which is rmr. type function where the name is maybe too tied to a possible implementation or key.normalize, which should be part of a key interface as the name suggests. The goal here is to allow the user to mapreduce any data structures that satisfy a small number of properties, like the ability to be split, unsplit, sliced, measured and a few more, all reasonable (and serde-bility).

Should use `read.csv` instead of `read.table` for "csv" format?

Since the rmr2 format is referred to as "csv", shouldn't it actually call read.csv so that it has the expected default parameters? Of particular importance is comment.char = "", which I spent a surprising amount of time debugging before I finally noticed that rmr actually calls read.table. I think it specifies somewhere in the documentation that read.table is being called, but at least I still found it surprising that it's not calling read.csv.

rmr2 failed - Rscript error

I tried to run rmr2 tutorial on CDH4, and I compiled source code of R and made it executable by all users. Now I can run WordCount jave code, but stucked at rmr2 package, because I saw this message first:

Error: java.lang.RuntimeException: Error in configuring object

I wonder how I can debug rmr2 package, any thoughts?

Here are details.
Source code:

[cloudera@localhost ~]$ cat test.rmr.R
library(rmr2)
small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))

Rscript is installed on /usr/bin:

[cloudera@localhost ~]$ ls -l /usr/bin/Rscript
lrwxrwxrwx 1 root root 37 Apr 1 15:19 /usr/bin/Rscript -> /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript

Error message:

[cloudera@localhost ~]$ Rscript test.rmr.R
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
13/04/01 17:35:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/04/01 17:35:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning message:
In to.dfs(1:1000) : Converting to.dfs argument to keyval with a NULL key
13/04/01 17:35:29 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/Rtmps3CHWS/rmr-local-env406f381b22dd, /tmp/Rtmps3CHWS/rmr-global-env406f575cacdf, /tmp/Rtmps3CHWS/rmr-streaming-map406f54bdce9f] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.2.0.jar] /tmp/streamjob34376797830229279.jar tmpDir=null
13/04/01 17:35:31 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:33 INFO mapred.FileInputFormat: Total input paths to process : 1
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: number of splits:2
13/04/01 17:35:33 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
13/04/01 17:35:33 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
13/04/01 17:35:33 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1364677243840_0008
13/04/01 17:35:33 INFO client.YarnClientImpl: Submitted application application_1364677243840_0008 to ResourceManager at /0.0.0.0:8032
13/04/01 17:35:33 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1364677243840_0008/
13/04/01 17:35:33 INFO mapreduce.Job: Running job: job_1364677243840_0008
13/04/01 17:35:45 INFO mapreduce.Job: Job job_1364677243840_0008 running in uber mode : false
13/04/01 17:35:45 INFO mapreduce.Job: map 0% reduce 0%
13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:12 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:14 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:23 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:27 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:36 INFO mapreduce.Job: map 50% reduce 0%
13/04/01 17:36:36 INFO mapreduce.Job: Job job_1364677243840_0008 failed with state FAILED due to: Task failed task_1364677243840_0008_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

13/04/01 17:36:37 INFO mapreduce.Job: Counters: 6
Job Counters
Failed map tasks=7
Launched map tasks=8
Other local map tasks=6
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=85605
Total time spent by all reduces in occupied slots (ms)=0
13/04/01 17:36:37 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Calls: mapreduce -> mr
Execution halted

0-length argument error at keyval

Hi,
I've written a MapReduce job that takes as input, the output of a previous MapReduce job.
The first few lines of the job look something like this:

map_1 = function(k,input_data)
{   
index = which(k ==1)


if(length(index) > 0)
{ 
    input_data=do.call("rbind",input_data[index])

    input_data = as.data.frame(input_data)

    if(nrow(input_data)!=0)
    {

        #some processing

        keyval("constant_key",processed_input_data)
    }
}
}

Now, in the processing part, I've clipped some columns but there is nothing that will change the number of rows. But the following is the error I am getting.

Loading required package: rmr2
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
data ENTERING FOR at risk19sh: -c: line 0: syntax error near unexpected token `(รŠยผ
sh: -c: line 0: `/usr/lib/hadoop/bin/hadoop dfs -put  /mnt/data/mapred/local/taskTracker/musigma/jobcache/job_201303201047_0032/attempt_201303201047_0032_m_000000_0/work/tmp/RtmpF1qXQp/file708455830e3d c(" new/output/alltags-v33_1/AtRisk/part-00001", " new/output/alltags-v33_1/AtRisk/part-00002")รŠยผ
Length of Keys = 1
Length of Values = 3
Error in rmr.recycle(k, v) : CanรŠยผt recycle 0-length argument
Calls: <Anonymous> ... keyval.writer -> format -> recycle.keyval -> keyval -> rmr.recycle
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

Since there is already an if condition that should take care if the rows are zero, the error is being thrown at keyval, and we are not able to figure out why.
What could be the problem?

rmr with output.format="native" failing with java.io.EOFException

I'm using RMR and I'd like to serialize multiple randomForests to hdfs.

reducer <- function (k,v) {
rf <- randomForest(formula=model.formula,
data=v,
na.action=na.roughfix,
ntree=number.trees,
do.trace=TRUE
)
keyval(k,list(forest=rf))
}

I'm calling the reducer like this

mapreduce(input="train_clean.csv",
input.format=titanic.input.format,
map=mapper,
reduce=reducer,
output.format="native",
output="titanic-out")

When I run this, the reducers fail like this:

2013-05-02 08:18:27,372 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:458)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:399)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:218)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376)
2013-05-02 08:18:27,375 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:458)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:399)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:218)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376

If I change output.format=text, the same code works.

Install: Cannot find hadoop-core jar file in hadoop home

Hi,
While installing rmr2 on a system with CDH4, I get the following message, stating "Cannot find hadoop-core jar file in hadoop home".

This seems to be more of a warning, as the package runs as expected regardless.
Is this something that we should be bothered about?

The exact install log is as follows:

* installing to library รข/usr/local/lib64/R/libraryรข
* installing *source* package รขrmr2รข ...
** libs
g++ -I/usr/local/lib64/R/include -DNDEBUG  -I/usr/local/include   `/usr/local/lib64/R/bin/Rscript -e "Rcpp:::CxxFlags()"` -fpic  -g -O2  -c extras.cpp -o extras.o
g++ -I/usr/local/lib64/R/include -DNDEBUG  -I/usr/local/include   `/usr/local/lib64/R/bin/Rscript -e "Rcpp:::CxxFlags()"` -fpic  -g -O2  -c hbase-to-df.cpp -o hbase-to-df.o
g++ -I/usr/local/lib64/R/include -DNDEBUG  -I/usr/local/include   `/usr/local/lib64/R/bin/Rscript -e "Rcpp:::CxxFlags()"` -fpic  -g -O2  -c typed-bytes.cpp -o typed-bytes.o
g++ -shared -L/usr/local/lib64 -o rmr2.so extras.o hbase-to-df.o typed-bytes.o -L/usr/local/lib64/R/library/Rcpp/lib -lRcpp -Wl,-rpath,/usr/local/lib64/R/library/Rcpp/lib -L/usr/local/lib64/R/lib -lR
((which hbase && (mkdir -p ../inst; cd hbase-io; sh build_linux.sh; cp build/dist/* ../../inst)) || echo "can't build hbase IO classes, skipping" >&2)
/usr/bin/hbase
build_linux.sh: line 159: [: missing `]'
Using /usr/lib/hadoop as hadoop home
Using /usr/lib/hbase as hbase home

Copying libs into local build directory
ls: cannot access /usr/lib/hadoop/hadoop-*-core.jar: No such file or directory
ls: cannot access /usr/lib/hadoop/hadoop-core-*.jar: No such file or directory
Cannot find hadoop-core jar file in hadoop home
cp: cannot stat `build/dist/*': No such file or directory
can't build hbase IO classes, skipping
installing to /usr/local/lib64/R/library/rmr2/libs
** R
** preparing package for lazy loading
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called รขquickcheckรข
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded

* DONE (rmr2)

map side joins

After reading http://www.slideshare.net/Hadoop_Summit/innovations-in-apache-hadoop-mapreduce-pig-hive-for-improving-query-performance slide 19 in particular not only I was reminded of the large performance advantages of map side joins but that they have natural use cases in things like star schemas and the like. Moreover, it seems like an rmr implementation shouldn't be all that difficult. One decision is if we should hide it behind the regular equijoin interface as an implementation change, with at most an API hint to use map side algorithm or some addition to the API. The latter is less conservative, makes the API more complex, but it also allows to do big-to-many-small joins typical of a star schema in one step, if all the small tables fit in memory, which allows to skip persisting of intermediate results, one for each small table.

from.dfs should use hdfs.getmerge for receiving multiple files

When you use from.dfs to load a directory from hadoop, rmr will spawn multiple hadoop dfs -get tasks for each individual file. Instead it could use hdfs.getmerge to do this in a more efficient way, drastically decreasing the JVM startup overhead.

to.dfs(...) fails with "Permission denied"

I am facing the following problem:
these two R commands:

library(rmr2)
small.ints = to.dfs(1:1000)

produce the following output:

library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
small.ints = to.dfs(1:1000)
sh: 1: /usr/local/hadoop: Permission denied
Warning message:
In to.dfs(1:1000) : Converting to.dfs argument to keyval with a NULL key

HADOOP_CMD is set to /usr/local/hadoop. The directory should be accessible, I even used chmod -R 777...
I use hadoop v1.1.2 which resides in /usr/local/hadoop.

R is selfcompiled v3.0.1 and stored in /usr/local/R.

Are there any good ideas or can anyone give advice what I can try next?

rmr2 failed because Rscript error

I tried to run rmr2 tutorial on CDH4, and I compiled source code of R and made it executable by all users. Now I can run WordCount jave code, but stucked at rmr2 package, because I saw this message first:

Error: java.lang.RuntimeException: Error in configuring object

Here are details.
Source code:

[cloudera@localhost ~]$ cat test.rmr.R
library(rmr2)
small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))

Rscript is installed on /usr/bin:

[cloudera@localhost ~]$ ls -l /usr/bin/Rscript
lrwxrwxrwx 1 root root 37 Apr 1 15:19 /usr/bin/Rscript -> /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript

Error message:

[cloudera@localhost ~]$ Rscript test.rmr.R
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
13/04/01 17:35:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/04/01 17:35:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning message:
In to.dfs(1:1000) : Converting to.dfs argument to keyval with a NULL key
13/04/01 17:35:29 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/Rtmps3CHWS/rmr-local-env406f381b22dd, /tmp/Rtmps3CHWS/rmr-global-env406f575cacdf, /tmp/Rtmps3CHWS/rmr-streaming-map406f54bdce9f] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.2.0.jar] /tmp/streamjob34376797830229279.jar tmpDir=null
13/04/01 17:35:31 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:33 INFO mapred.FileInputFormat: Total input paths to process : 1
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: number of splits:2
13/04/01 17:35:33 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
13/04/01 17:35:33 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
13/04/01 17:35:33 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1364677243840_0008
13/04/01 17:35:33 INFO client.YarnClientImpl: Submitted application application_1364677243840_0008 to ResourceManager at /0.0.0.0:8032
13/04/01 17:35:33 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1364677243840_0008/
13/04/01 17:35:33 INFO mapreduce.Job: Running job: job_1364677243840_0008
13/04/01 17:35:45 INFO mapreduce.Job: Job job_1364677243840_0008 running in uber mode : false
13/04/01 17:35:45 INFO mapreduce.Job: map 0% reduce 0%
13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:12 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:14 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:23 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:27 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:36 INFO mapreduce.Job: map 50% reduce 0%
13/04/01 17:36:36 INFO mapreduce.Job: Job job_1364677243840_0008 failed with state FAILED due to: Task failed task_1364677243840_0008_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

13/04/01 17:36:37 INFO mapreduce.Job: Counters: 6
Job Counters
Failed map tasks=7
Launched map tasks=8
Other local map tasks=6
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=85605
Total time spent by all reduces in occupied slots (ms)=0
13/04/01 17:36:37 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Calls: mapreduce -> mr
Execution halted

configure vectorization based on size

The current setting of N records doesn't generalize from small to large records. What we want is to load as much data as possible in one shot, without running out of memory. As it is now it's actually easier than a set number of records for binary formats. The short term approach could be to leave it a number of lines for text formats and make it a number of bytes or MB for binary formats.

Fail on mix with zero-length keyvals

According to changes in 2.1 we cannot return NULL keys in reduce function. Instead, we should use 0-length values. It works with local backend but fails on hadoop:

WORKS:

rmr.options(backend = "local")
mapreduce(to.dfs(keyval(1:3, 1:3)), reduce = function(k, v) if (k %% 2) keyval(k, v) else keyval(integer(0), integer(0)))

WORKS:

rmr.options(backend = "hadoop")
mapreduce(to.dfs(keyval(1:2, 1:2)), reduce = function(k, v) if (k %% 2) keyval(k, v) else keyval(integer(0), integer(0)))

FAILS:

rmr.options(backend = "hadoop")
mapreduce(to.dfs(keyval(1:3, 1:3)), reduce = function(k, v) if (k %% 2) keyval(k, v) else keyval(integer(0), integer(0)))

reduce calls counter counts wrong thing

It's not the number of calls, it's the number of records that go through any number of calls. This doesn't help with vectorization issue in a way because what we need to keep in check is the number of reduce calls, not the number of records, which can be big and yet not imply inefficiency if the number of distinct keys is small and the code vectorized. Unfortunately an efficient fix is not obvious because the split into distinct keys happens inside the reduce.keyval function which doesn't return the number of calls to reduce. Moving the counter call inside reduce.keyval it seems to me, would violate separation of concerns (the keyval concept doesn't depend on any backend or any other part of mapreduce).

Duplicate values returned from the mapper under rmr 2.1.0

opening on behalf of @everdark

Hi,

Recently I update the package to 2.1.0 (from 1.3.1!) and found something unexpected in a even simplest form of a mapreduce job. Here it is:

test <- from.dfs(
mapreduce(
input=fname.sample,
map=function(.,obs) keyval(NULL,1),
input.format=make.input.format(format="csv", sep=","),
reduce=NULL,
combine=NULL
))
test

where fname.sample is a string indicating the path of a .csv file stored in Hadoop.
The result given was:

test
$key
NULL
$val
[1] 1 1

which is quite weird for me to understand what's going on.
Why did the values get duplicated? This case does happen in my code for a more serious job (where the key will unnecessarily recycles...), which in turn make the result quite unpredictable.

Does anyone have ideas on this issue?

The complete log in the R console is as follows.

packageJobJar: [/tmp/RtmpVpDPTq/rmr-local-env16ec54892904, /tmp/RtmpVpDPTq/rmr-global-env16ec49ffa7b3, /tmp/RtmpVpDPTq/rmr-streaming-map16ec5fd2c921, /tmp/hadoop-mis/hadoop-unjar6866723637623951400/] [] /tmp/streamjob7363209296587117900.jar tmpDir=null
13/03/15 00:09:31 INFO mapred.FileInputFormat: Total input paths to process : 1
13/03/15 00:09:32 INFO streaming.StreamJob: getLocalDirs(): [/data/hadoop/mapred/temp/]
13/03/15 00:09:32 INFO streaming.StreamJob: Running job: job_201302201645_2649
13/03/15 00:09:32 INFO streaming.StreamJob: To kill this job, run:
13/03/15 00:09:32 INFO streaming.StreamJob: /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=s1dhd02.buyabs.corp:8021 -kill job_201302201645_2649
13/03/15 00:09:32 INFO streaming.StreamJob: Tracking URL: XXXXXX
13/03/15 00:09:33 INFO streaming.StreamJob: map 0% reduce 0%
13/03/15 00:09:43 INFO streaming.StreamJob: map 100% reduce 0%
13/03/15 00:09:45 INFO streaming.StreamJob: map 100% reduce 100%
13/03/15 00:09:45 INFO streaming.StreamJob: Job complete: job_201302201645_2649
13/03/15 00:09:45 INFO streaming.StreamJob: Output: /tmp/RtmpVpDPTq/file16ec62d597d8
13/03/15 00:09:52 WARN snappy.LoadSnappy: Snappy native library is available
13/03/15 00:09:52 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/15 00:09:52 INFO snappy.LoadSnappy: Snappy native library loaded
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 WARN snappy.LoadSnappy: Snappy native library is available
13/03/15 00:09:54 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/15 00:09:54 INFO snappy.LoadSnappy: Snappy native library loaded
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor

rbind.fill must be a data frame

Hi,

i was trying to use the equijoin function of rmr2.1 .While using the equijoin I encountered the error "rbind.fill must be a dataframe". Only few of my reduce tasks were affected with this error.
I checked the code and my suspicion was there might be keys being passed with NA or null value. Checked it out that wasn't the case.
Then I even coerced my object to a data frame and tried running the equijoin it still fails.
Can you help me figure out the possible other causes for this problem?

Thanks,
Mayank

OO refactor for big data objects

Right now a big data object (returned by mapreduce when output is NULL) is an alternative to an explicit hdfs path with some interesting properties (garbage collected). I would like to explore the possibility of turning this concept into a class to unify support of different I/O possibilities and better hide the implementation. A big data object could be an explicit path, a temporary, garbage collected file or directory, a managed file which is refreshed when it is stale compared to its inputs and generating function, a hbase table and what not. It would incorporate a notion of a format which is not handled separately.

[Error] segfault: memory not mapped

I wrote a sample code for equijoin, and I am getting a 'segfault: memory not mapped' error. My R session crashes if I run it from RStudio, but using R, I can exit gracefully (please see the output below).

The code I am trying to run is:

library("rmr2")

authors = data.frame(
    surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
    nationality = c("US", "Australia", "US", "UK", "Australia"),
    deceased = c("yes", rep("no", 4)))

books = data.frame(
    name = I(c("Tukey", "Venables", "Tierney",
               "Ripley", "Ripley", "McNeil", "R Core")),
    title = c("Exploratory Data Analysis",
              "Modern Applied Statistics ...",
              "LISP-STAT",
              "Spatial Statistics", "Stochastic Simulation",
              "Interactive Data Analysis",
              "An Introduction to R"),
    other.author = c(NA, "Ripley", NA, NA, NA, NA,
                     "Venables & Smith"))

to.dfs(kv=authors, output="authors.csv", format=make.output.format("csv", sep=","))
to.dfs(kv=books, output="books.csv", format=make.output.format("csv", sep=","))

eqj = function(left.input="authors.csv", right.input="books.csv", output=NULL) {
    map.left  = function(k, v) {
        names(v) = c("surname", "nationality", "deceased")
        keyval(v[, "surname"], v[, drop=FALSE])
    }

    map.right = function(k, v) {
        names(v) = c("name", "title", "other.author")
        keyval(v[, "name"], v[, "title", drop=FALSE])
    }

    equijoin(left.input="authors.csv", right.input="books.csv",
             input.format=make.input.format("csv", sep=",", as.is=TRUE),
             outer="left", map.left=map.left, map.right=map.right)
}

merged = eqj()

The output while running through R is:

> library("rmr2")
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
> from.dfs(equijoin(left.input = to.dfs(keyval(1:10, 1:10^2)), right.input = to.dfs(keyval(1:10, 1:10^3))))

 *** caught segfault ***
address 0x8, cause 'memory not mapped'

Traceback:
 1: .Call("typedbytes_writer", objects, native, PACKAGE = "rmr2")
 2: writeBin(.Call("typedbytes_writer", objects, native, PACKAGE = "rmr2"),     con)
 3: typedbytes.writer(interleave(keys(kvs), values(kvs)), con, native)
 4: format(kv, con)
 5: keyval.writer(kv)
 6: write.file(kv, tmp)
 7: to.dfs(keyval(1:10, 1:10^2))
 8: xor(!is.null(left.input), !is.null(input) && (is.null(left.input) ==     is.null(right.input)))
 9: stopifnot(xor(!is.null(left.input), !is.null(input) && (is.null(left.input) ==     is.null(right.input))))
10: equijoin(left.input = to.dfs(keyval(1:10, 1:10^2)), right.input = to.dfs(keyval(1:10,     1:10^3)))
11: to.dfs.path(input)
12: from.dfs(equijoin(left.input = to.dfs(keyval(1:10, 1:10^2)),     right.input = to.dfs(keyval(1:10, 1:10^3))))

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 3
[user@localhost ~]$

Any clue on why this is happening?

secondary keys

see https://issues.apache.org/jira/browse/HADOOP-5528
use hashing to make it more user-friendly

With the current reduce interface this doesn't matter that much because one can sort the values associated with a key in memory. If we revisit the issue of an iterator type interface for reduce (for when values are big and can not be held in memory more than a few at a time) then this will go on the short track. The reference to hashing above means the following. Since this binary partitioner is very low level and allows only to specify the keys as number of bytes to consider or skip, it would be very hard to support complex keys and provide a user friendly API. If we take two lists of primary and secondary keys, hash the former and then prepend the hash to the key, we can use this simple binary partitioner even with complex keys. The next hurdle though is ordering, as byte ordering which what java would perform is unlikely to be the correct ordering for the original key domain. An additional hurdle is the efficient implementation of all of this. One wonders why the author of the patch for the above issue, didn't use typedbytes serialization for this case as he did with the multiplefileoutputformat. In that case, the key is an ArrayList with the first element being the filename and the second the actual key. The same could have been done for primary and secondary, and we should consider submitting our own patch to make it that way.

The dumbo author comments

No grand reasons for it really, it just seemed sensible to keep it general/low-level. Writing a custom partitioner using typed bytes isn't hard though, e.g.:

https://github.com/klbostee/feathers/blob/ae854f6b4f78fc42e8b3fbb8e216319cbdae1343/src/partition/Prefix.java

Naming a mapreduce job

I'd like to be able to specify the name of my mapreduce job to identity it in the JobTracker among all others.

While this can be done by specifying "mapred.job.name" in the backend.parameters; that feature is deprecated as discussed in #9

`backend.parameters` for `rmr.options`?

If backend.parameters is considered deprecated, what is the best-practice for things like rmr-wide options. For example, a multi-purpose Hadoop cluster is unlikely to have the configuration optimized for R tasks. In particular, things like mapred.child.java.opts are likely to be set for Java jobs, allocating a large amount of memory to the JVM. mapreduce jobs need to either reduce maximum heap space allocated to the JVM (128 MB is as low as I could go) or increase mapred.job.map.memory.mb which is likely to make everyone else angry at you since this is probably configured for your cluster's specific hardware to allow efficient distribution of tasks.

Is there/should there be another way to set rmr Hadoop parameters (as opposed to job-specific parameters)?

Converting to.dfs argument to keyval with a NULL key

Hi,

I was using the rmr2.1 package and while using the to.dfs I encountered the error below
"Converting to.dfs argument to keyval with a NULL key log4j:ERROR Could not find value for key log4j.appender."
This error is rmr2.1 specific only, I try the same commands on rmr2.02 and they run.
Is there any reason for this to fail in rmr2.1?
Also what could be a fix to solve this issue?

Thanks,
Mayank

backend.parameters is depreciated.

Hi,

I was reading the earlier issues about backend.parameters being depreciated and that you are planning to remove it from rmr 2.1 .
I think there is a case for still keeping the backend.parameters option. I agree with your view that on a properly set cluster it is not required, but there are issues security and access restrictions which sometimes hinder figuring out a wrongly set cluster and an erroneous code. In these cases backend.parameters helps you figure it out.
Recently I had a issue where only 1 reducer was being spawned, it turned out to be a wrongly configured cluster which I was able to guess only after specifying backend.parameters and reduce tasks to a specific number.
So I would really appreciate if you keep that option. It is helpful.

Custom Directory for temp Files on HDFS

rmr2 uses the tempdir() function for creating a name for a temporary directory on HDFS. While it is possible to modify the base directory by setting the environment variable TMPDIR (or alternatively TMP or TEMP) before starting R, there are some limitations. In particular, the problem is that the directory containing temp files has to be the same both on the local file system as well as on HDFS. The following paragraph illustrates this with an example.

For example, if I set the environment variable TMPDIR to some specific location on HDFS, e.g. /user/hduser/tmp, I cannot do it, because I don't have /user/hduser/tmp on my local file system. (R does not allow a change of the temp dir, unless the directory exists on the local file system and is writable.) This is a problem in itself; furthermore, it gets exacerbated if I also need to use tempdir() for my own code.

There is also another reason why changing any of TMPDIR, TMP, or TEMP might not be a good idea: It also affects other parts of the system. For example, it also affects LXDE and Openbox; if you set any of those variables to a non-existing local directory, your background wallpaper will not load after logging in, and you also cannot change the desktop preferences any more.

Multiple dataframe outputs from the same reducer

Picking up from RevolutionAnalytics/RHadoop#177

It does seem a bit complicated to do. Your data is structured, but has two structures. So we need to deal with it as unstructured data, that is to use lists as arguments to keyval. One way of doing that is to split the data frames by the key you want, as in

splitcars = split(mtcars, mtcars$mpg)
keyal(names(splitcars), splitcars)

The same for the other type of frames. One caveat is that if you have many different keys this is going to create lots of small data frames, which is very inefficient. rmr2 has the same limitation, which comes straight from R. I am not sure I understand the overall goals of your program, so if this doesn't do it please give me a bit more context.

Output format providing trailing "\t"

Hi,

I was the output.format command for my mapreduce output with rmr2.1
The output file gets created peacefully. If i specify my output format as the following way,
-output.format = make.output.format("csv",sep=",")
the file gets created with no issue. But if i specify it as the following way,
-output.format = make.output.format("csv",sep="|")
the last column of my data frame has a "\t" appended at the end of each row.
Is there a reason for the "\t" to be present?

Thanks,
Mayank

Optimizer or planner for mapreduce jobs

It is common practice to apply transformation to mapreduce programs that change the number and nature of jobs involved, usually to minimize I/O while preserving the same function. It is done in Hive, Pig and Cascading for example. In rmr it is a little more challenging because

  • the I/O bound assumption which is behind for instance, the Cascading optimizer (called a planner in that context), is not necessarily true for complex analytics programs.
  • the variety of programs that can be written with rmr. It is not a little crippled special language, it is allows the full power of R. So it's going to be difficult to apply general transformations while preserving semantic equivalence.
  • The unavailability of some advanced java-only features such as multiple output formats.

On the positive side are the reflection capabilities of R that allows to inspect the parse tree, for instance. A little example of what could be done is in a function optimize in the source, completely untested. The only optimization applied it to reduce a chain of mapreduce calls that have a reduce only at the end to a single mapreduce job by composing the mappers.

Negative length vector in .Call("typed_bytes_reader", data, nobjs, PACKAGE = "rmr2")

Hi,

I'm getting the following error while running MapReduce jobs in rmr2 v2.0.2. The output of the standard error stream is as follows:

Loading required package: rmr2 Loading required package: Rcpp Loading required package: methods Loading required package: int64 Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr Warning: NAs introduced by coercion Error in .Call("typed_bytes_reader", data, nobjs, PACKAGE = "rmr2") : negative length vectors are not allowed Calls: ... keyval.reader -> format -> typed.bytes.reader -> .Call Execution halted java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:502) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.Child.main(Child.java:262) log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Can you throw some light on this issue?
Thanks for your help.

rmr2 silently loses data when presented with a csv input file that has a header.

Here is some code:

#!/usr/bin/env Rscript

#
# Based on Breen's Example 2: airline
#

library(rmr2)

# assumes 'airline' and airline/data exists on HDFS under user's home directory

hdfs.data.root = 'airline'
hdfs.data = file.path(hdfs.data.root, 'data')

# unless otherwise specified, directories on HDFS should be relative to user's home
hdfs.out.root = hdfs.data.root
hdfs.out = file.path(hdfs.out.root, 'out')

mapper.year.market.enroute_time = function(k, fields) {
# Skip header line in csv formatted file
  if (!(as.character(fields[[1]]) == "Year")) {
    keyval(as.character(fields[[9]]), 1)
  }
}

reducer.year.market.enroute_time = function(key, vv) {
# count values for each key
  keyval(key, sum(as.numeric(vv),na.rm=TRUE))
}

mr.year.market.enroute_time = function (input, output) {
  mapreduce(input = input,
            output = output,
            input.format = make.input.format("csv", sep = ","),
            map = mapper.year.market.enroute_time,
            reduce = reducer.year.market.enroute_time)
}

out = from.dfs(mr.year.market.enroute_time(hdfs.data, hdfs.out))
results.df = as.data.frame(out,stringsAsFactors=F )
colnames(results.df) = c('carrier', 'count')
print(results.df)

Here is a sample csv file:

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2004,3,25,4,848,840,1241,1225,HA,1,N587HA,353,345,218,16,8,LAX,HNL,2556,4,11,0,,0,16,0,0,0,0
2004,3,25,4,1426,1425,2135,2140,HA,2,N592HA,309,315,402,-5,1,HNL,LAX,2556,10,17,0,,0,0,0,0,0,0
2004,3,25,4,1222,1220,1551,1605,HA,3,N583HA,329,345,192,-14,2,LAX,HNL,2556,4,13,0,,0,0,0,0,0,0
2004,3,25,4,2220,2225,524,525,HA,4,N583HA,304,300,400,-1,-5,HNL,LAX,2556,5,19,0,,0,0,0,0,0,0
2004,3,25,4,1016,1010,1431,1430,HA,7,N591HA,375,380,228,1,6,LAS,HNL,2762,3,24,0,,0,0,0,0,0,0
2004,3,25,4,2243,2250,617,615,HA,8,N584HA,334,325,434,2,-7,HNL,LAS,2762,7,13,0,,0,0,0,0,0,0
2004,3,25,4,1717,1725,2046,2110,HA,9,N584HA,329,345,196,-24,-8,LAX,HNL,2556,6,7,0,,0,0,0,0,0,0

Approximately half of the real rows of the file will be "missed" when processed by the script. There is no warning or error message.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.