klbostee / dumbo Goto Github PK
View Code? Open in Web Editor NEWPython module that allows one to easily write and run Hadoop programs.
Home Page: http://projects.dumbotics.com/dumbo
Python module that allows one to easily write and run Hadoop programs.
Home Page: http://projects.dumbotics.com/dumbo
DESCRIPTION """"""""""" Dumbo is a Python module that makes writing and running Hadoop Streaming programs very easy. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs. MORE INFO """"""""" http://projects.dumbotics.com/dumbo http://wikis.dumbotics.com/dumbo
Hi,
I found that the system()
function in Dumbo always returns zero if the actual return value is less than 256. As you know, command 'hadoop fs -test -e
' returns 1 if the given path do not exists. But the system()
function returns zero in the case. I think '/ 256
' should be removed in the code.
def system(cmd, stdout=sys.stdout, stderr=sys.stderr): if sys.version[:3] == '2.4': return os.system(cmd) / 256 proc = subprocess.Popen(cmd, shell=True, stdout=stdout, stderr=stderr) return os.waitpid(proc.pid, 0)[1] / 256
Please let me know what you think of this. Thanks.
Seongsu
If your starter dies, its exit code is parsed out of the os.waitpid() call in util.py:system, then passed back via util.py:execute, cmd.py:start and cmd.py:dumbo.
If you're running dumbo from the entry script generated by easy_install, all is fine, as it captures the return value of the entry point (cmd.py:dumbo) and passes this to sys.exit().
But if you're using the wrapper generated by buildout, this return code is ignored, because of this issue:
https://bugs.launchpad.net/zc.buildout/+bug/697913
... meaning that your dumbo task as a whole still exits with 0, despite the exception in your starter.
I know this isn't really a bug in dumbo, but since dumbo is distributed via buildout, I thought it worth reporting. One workaround would be to add a line like this to cmd.py:dumbo:
if retval != 0:
raise RuntimeError('%s command died with code %d') % (sys.argv[1], retval)
Arguably this is a more pythonic way to communicate an error condition than passing integers around anyway.
PS Thanks for fixing those last ones so quickly, Klaas :-)
Hello!
Let me explain that one:
I have two jobs creating there own output files.
Then I want to merge those two files using a third job.
In my first attempt, the first two jobs were yielding python structures (dict
) as values, and unicode strings as key, which turned out to be dumb. I would have had to eval
the keys and values in my third job. I'm not sure anyone would want to do that.
Now I try to output pure strings through encode('utf-8')
and some json.dumps
. I now have string every where, dumbo cat
confirmed it.
But, if I try to use those two files as input for the third merge job, keys and values are single-quoted, which if quite a pain to test my code locally. Of course, I will use dumbo cat out > out.txt
to be able to test the merge job locally, but the code driving those three jobs won't be testable unless ran on a real hadoop cluster.
Did I miss something?
Thanks a lot for your help!
When
dumbo start ./someprog.py ...
is executed and the file someprog.py doesn’t exist, Dumbo will return the error "relative module names not supported" which isn’t very helpful.
Hi,
I'm attempting to patch hadoop and install dumbo, but whenever I try "ant package" I get a few "class, interface, or enum expected" errors in the BinaryComparable.java file. Shown below is the code. The areas that are causing the issue are the 2nd, 3rd, and 4th package statements.
Any help you can provide would be greatly appreciated.
Thanks,
Yigael
/**
http://www.apache.org/licenses/LICENSE-2.0
package org.apache.hadoop.io;
/**
Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {
/**
/**
/**
/**
/**
/**
}
/**
http://www.apache.org/licenses/LICENSE-2.0
package org.apache.hadoop.io;
/**
Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {
/**
/**
/**
/**
/**
/**
}
/**
http://www.apache.org/licenses/LICENSE-2.0
package org.apache.hadoop.io;
/**
Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {
/**
/**
/**
/**
/**
/**
}
/**
http://www.apache.org/licenses/LICENSE-2.0
package org.apache.hadoop.io;
/**
Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {
/**
/**
/**
/**
/**
/**
}
Pre-outputs aren't deleted automatically anymore since we introduced the DAG capabilities for job flows. The reason for this is that
backend = get_backend(opts)
fs = backend.create_filesystem(opts)
gets executed at a point where 'opts' doesn't contain the 'hadoop' option anymore (since this option gets removed as part of running the iteration).
Hi, the readme says basically "Python module that allows you to easily write and run Hadoop programs", and points the wiki, which contains some config files and sample programs.
If you really want to attract users, you should mention more explicitly why users should use this, rather then writing their own python programs for the streaming api.
AFAICT, dumbo provides:
that seems to be it, from a quick glance.
If you recommend users to add this 2500-lines-of-python dependency, you should imho justify it more.
One of the frustrating problems I've been running into is that if I
have "print statements" in code called by my mapper/reducer this will
break the pipe used by my streaming job.
It seems like a simple change to dumbo can fix this.
In core.py change
typedbytes.PairedOutput(sys.stdout).writes(outputs)
to
typedbytes.PairedOutput(sys.stdout).writes(outputs)
This way all we have to do is redirect stdout to stderr and extraneous
print statements will no longer cause problems.
I've tried this out and it seems to work for me.
When running on unix in local mode, dumbo uses a simple pipe to sort without any command-line options. Consequently, it might fill up the local /tmp directory which sort would use by default and/or create performance/memory problems if not limiting sort correctly. Therefore it would be useful to add two new command-line/dumborc options -sorttmpdir and -sortbufsize that correspond to the unix sort -T and -S parameters, respectively. Those can then be used in the config under the [unix] section or can be specified on the commandline, e.g:
dumbo start prog.py -input .. -output .. -sorttmpdir ~/hdfs/tmp -sortbufsize 5%
It would be useful if a multimapper would inherit all the opt decorations of the mappers that get added to it.
Hey, I really love the job management stuff in dumbo. However, it seems like the inner-core of hadoopy is more highly optimized. (I get a factor of 2 better performance in my tests.) So it seems to me like the right way of combining the two is to write a hadoopy backend for dumbo. Would this be something you'd be interested in adding to dumbo? I'm happy to work on it in some capacity if there is interest.
Hi,
I've a program consisting of 3 jobs.
I have 3 additer to create the job sequence and I would like to save the output of the 2nd job.
I looked to the code and saw that preoutputs option should be set to "yes" but I cannot figure out how to set it without having a error when the hadoop jobs are started.
Thank you for helping me.
Sometimes it's handy to have the equality test between join keys be something other than strict equality; moving the key equality test into its own method would make it easier for subclasses of JoinReducer to customize.
Also, by putting the key_check method can be defined in JoinCombiner to always return True, in which case it's no longer necessary to duplicate the logic in the call method in both JoinCombiner and JoinReducer.
I'm issuing a pull request with my patch.
In my task, i want use Dumbo to do this:
the function of mapper get IP and occurrences number from server`s log.
and the reducer function return a SQL with key(IP) and value(occurrences number)
just like so:
def mapper(key, value) :
key = """get the ip for value"""
yield key,1
def combiner(key,value) :
yield key,sum(value)
def reducer(key,value) :
yield """sql with key and value"""
how can i do?
Please forgive my bad English.
The friend from China wait for you answer!
Very Thinks!
$ bin/test
Traceback (most recent call last):
File "bin/test", line 47, in <module>
import zope.testrunner
File "/media/sf_Workspace/dumbo/eggs/zope.testrunner-4.0.3-py2.6.egg/zope/testrunner/__init__.py", line 21, in <module>
import zope.testrunner.interfaces
File "/media/sf_Workspace/dumbo/eggs/zope.testrunner-4.0.3-py2.6.egg/zope/testrunner/interfaces.py", line 21, in <module>
import zope.interface
ImportError: No module named interface
It should be possible to tell dumbo to overwrite the output path if it already exists by simply providing an -overwrite yes option.
At some point we should add a proper interface between the dumbo core and the unix and hadoop streaming backends we currently have (both for the code that runs mapreduce iterations and the code that executes commands). In addition to cleaning up the code, this would also make it possible to easily add additional backends.
When -overwrite yes is specified Dumbo should also overwrite intermediate _pre dirs instead of failing on them.
As Python strings are mapped to typedbytes (UTF-8) strings, it isn't currently possible to output binary data directly without any intermediate transformations of the bytes.
See also https://github.com/tims/lasthbase/issues#issue/3 for more info.
First one's just a typo: on line 209, it has 'sdterr' instead of 'stderr'.
The second one might not be a bug but it looks a bit suspect to me :-) On line 197, program.start() is called regardless of whether starter(program) succeeded or failed.
Shouldn't it test this first, and not try to start the job if the starter failed? e.g.
retval = program.start() if status == 0 else status
This is presumably due to the options refactoring, since this works in 0.21.31
Traceback (most recent call last): File "/usr/local/bin/dumbo", line 9, in <module> load_entry_point('dumbo==0.21.32', 'console_scripts', 'dumbo')() File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/__init__.py", line 32, in execute_and_exit sys.exit(dumbo()) File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/cmd.py", line 42, in dumbo retval = cat(sys.argv[2], parseargs(sys.argv[2:])) File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/cmd.py", line 101, in cat return create_filesystem(opts).cat(path, opts) File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/backends/unix.py", line 114, in cat return decodepipe(opts + [('file', path)]) TypeError: unsupported operand type(s) for +: 'Options' and 'list'
How Import external libraries in an dumbo MapReduce ?
Thank you
Currently the only way to specify a local run is by omitting the -hadoop option. However this doesn't work if there is a ~/.dumborc file that specifies a default hadoop option; we have to comment out and uncomment this line in the config for toggling between local and cluster runs. Essentially we need a 'nohadoop' (or 'local') option to override from the command line the hadoop option in the config.
Another quick and dirty workaround would be to denote the local run with a special '-hadoop' value, e.g. " -hadoop '!' ". That's what I ended up implementing in my fork; if you're interested I can upload it and make a pull request.
Adam wrote a small module, inspired by cloudera's MRUnit, that lets one easily create unit-tests for dumbo mapreduce tasks.
The code is currently at:
http://github.com/adamhadani/dumbo/blob/master/dumbo/mapredtest.py
He also added a unittest for it that serves the double purpose of unit-testing the unit-testing itself
as well as serving as example of how to work with it:
http://github.com/adamhadani/dumbo/blob/master/tests/testmapredtest.py
The nice thing about it is that it takes care of some things behind the scenes, e.g deriving the mapper/reducer classes from mapredbase when needed, making sure input/output is iterable (allowing for arbitrarily large input/output test cases - need not fit in memory as seems to be the case with MRUnit), and so on.
Hello,
I'm having the same issue described in this thread, but with the more recent CDH3B4:
http://groups.google.com/group/dumbo-user/browse_thread/thread/d5440880a5588278
Namely:
EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar -input '$HDFS_USERDIR/access.log' -output 'ipcounts' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo' -mapper 'python -m ipcount map 0 262144000' -reducer 'python -m ipcount red 0 262144000' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'mapred.job.name=ipcount.py (1/1)' -inputformat 'org.apache.hadoop.streaming.AutoInputFormat' -outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv 'PYTHONPATH=common.pyc' -file '$HOME/ipcount.py' -file '$HOME/dumbo/dumbo/backends/common.pyc' -jobconf 'tmpfiles=$VIRTUALENV_INSTALL/typedbytes.pyc' 11/02/26 20:11:20 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead. packageJobJar: [$HOME/ipcount.py, $HOME/dumbo/dumbo/backends/common.pyc, $HDFS_PATH//hadoop-unjar44840/] [] /tmp/streamjob44841.jar tmpDir=null 11/02/26 20:11:21 INFO mapred.JobClient: Cleaning up the staging area hdfs://$HOSTNAME/$HDFS_PATH//mapred/staging/romanvg/.staging/job_201102242242_0010 11/02/26 20:11:21 ERROR streaming.StreamJob: Error launching job , bad input path : File does not exist: $VIRTUALENV_INSTALL/typedbytes.pyc Streaming Command Failed!
easy_install does not seem to be the reason. I've installed it via pip, easy_install, and now git clone. Seems to me that the jobconf 'tmpfiles' is what causes the problem.
Commenting the offending code allows the mapreduce job to start but fails shortly after, on the hadoop side (not dumbo):
diff --git a/dumbo/backends/streaming.py b/dumbo/backends/streaming.py index 1f03b13..df2d3d5 100644 --- a/dumbo/backends/streaming.py +++ b/dumbo/backends/streaming.py @@ -180,15 +180,15 @@ class StreamingIteration(Iteration): hadenv = envdef('HADOOP_CLASSPATH', addedopts['libjar'], 'libjar', self.opts, shortcuts=dict(configopts('jars', self.prog))) fileopt = getopt(self.opts, 'file') - if fileopt: - tmpfiles = [] - for file in fileopt: - if file.startswith('file://'): - self.opts.append(('file', file[7:])) - else: - tmpfiles.append(file) - if tmpfiles: - self.opts.append(('jobconf', 'tmpfiles=' + ','.join(tmpfiles))) +# if fileopt: +# tmpfiles = [] +# for file in fileopt: +# if file.startswith('file://'): +# self.opts.append(('file', file[7:])) +# else: +# tmpfiles.append(file) +# if tmpfiles: +# self.opts.append(('jobconf', 'tmpfiles=' + ','.join(tmpfiles))) libjaropt = getopt(self.opts, 'libjar') if libjaropt: tmpjars = []
The result of the above modification is:
11/02/26 20:23:32 INFO streaming.StreamJob: map 0% reduce 0%
11/02/26 20:23:40 INFO streaming.StreamJob: map 50% reduce 0%
11/02/26 20:23:41 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:02 INFO streaming.StreamJob: map 100% reduce 17%
11/02/26 20:24:06 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:13 INFO streaming.StreamJob: map 100% reduce 33%
11/02/26 20:24:17 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:26 INFO streaming.StreamJob: map 100% reduce 17%
11/02/26 20:24:29 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:32 INFO streaming.StreamJob: map 100% reduce 100%
(...)
11/02/26 20:24:32 ERROR streaming.StreamJob: Job not successful. Error: NA
Whenever a jumbo job is run, this warning appears:
11/07/07 13:23:25 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
As originally reported by Elias Pampalk:
The following scripts demonstrate a failure to fail when executed on a hadoop cluster (fails fine if executed locally):
import dumbo
def mapper(k, v):
yield 1, 1
if __name__ == "__main__":
dumbo.run(mapper, dumbo.sumsreducer, combiner=dumbo.sumsreducer)
The test uses dumbo.sumsreducer where dumbo.sumreduce should be used. A TypeError should be thrown by dumbo.sumsreducer (in the combiner). Instead the Hadoop reports show no error and zero output from the mapper.
See recent commits on https://github.com/dangra/dumbo
Hi Klbostee,
I got bitten by this problem when running a script. I'm using the last dumbo version, but I see the same problem in 0.21.29. I could find using PDB that the duplicates appear here : https://github.com/klbostee/dumbo/blob/master/dumbo/core.py#L384.
I workaround it with this : http://paste.pocoo.org/show/539376/. I think the best is that the options be a set of (key, value) instead of a list, but I don't know if the options can hold some non hashable value on it.
Well, what do you think? I think there is some room to improve options handling, but better to know well from you a bit more about its values.
Btw, I'm running this line (I have to change the names, sorry) on the cluster:
python2.5 -m dumbo.cmd start mylib.mymodule.script -cachefile hdfspath/to/myfile.db#myfile.db -param locale=en-gb -param workdir=/some/other/hdfspath -param dataversion=2012-01-23T01-59-59.797550 -input /data/input/2012/01/20 -output /data/output/2012/01/20/transformed -hadoop system
Well, thanks,
Andrés
It'd be really useful to have a command-line switch to say "don't read any config files" or "read this specific config file instead of the default ones".
That way, you could run dumbo from a virtualenv or buildout without worrying that eggs, jars etc. specified in dumbo.conf or .dumborc would sneak in.
when actually using -memlimit, its cumbersome to specify the limit in terms of bytes.
A simple format for specifying bytes could be used, e.g:
-memlimit 100k
-memlimit 256m
-memlimit 2g
this should default to bytes when no unit specifier is given (e.g -memlimit 1024)
It would be convenient to provide a -queue <name> alternative for -jobconf mapred.job.queue.name=<name>.
If a parser is accidentally used on the wrong type of data file (a pretty common mistake among the analysts I'm supporting with Dumbo), the entire contents of the input file get dumped into the job logs.
I've seen this bring nodes down repeatedly, when the partition holding the log files gets completely filled up. Parsers are very useful, but I think it would be better if bad values just incremented a counter.
Hadoop version: official release of 0.21.0
Dumbo version: dumbo-0.21.28-py2.6.egg
MR Program is the example found in the Wiki (https://github.com/klbostee/dumbo/wiki/Short-tutorial)
stderr log:
java.io.IOException: Cannot run program "python": java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:96)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:96)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.io.IOException: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more
Jobs that consist of only one iteration and don't use the "Job" object are shown as having 0 iterations in Hadoop webui now.
hi,
I want to run Dumbo with a specific input format (to read from Avro files).
It seems Dumbo does not use the input format specified by '-inputformat' when it is run locally (without specifying '-hadoop'). Instead it uses its default input format.
To check that, I specify a unknown class with '-inputformat foo.bar.UnknownClass'. It fails on hadoop but passes in local mode.
Hadoop mode:
$ dumbo start cat.py
-input word-count.avro
-output tmp
-libjar avro-1.4.1.jar
-libjar avro-utils-1.5.3-SNAPSHOT.jar
-inputformat foo.bar.UnknownClass
-python /home/sites/sci-env/0.0.5/bin/python
-hadoop /usr/lib/hadoop
...
-inputformat : class not found : foo.bar.UnknownClass
Streaming Command Failed!
Local mode:
$ dumbo start cat.py
-input word-count.avro
-output tmp
-libjar avro-1.4.1.jar
-libjar avro-utils-1.5.3-SNAPSHOT.jar
-inputformat foo.bar.UnknownClass
-python /home/sites/sci-env/0.0.5/bin/python
INFO: buffersize = 168960
=> no error, tmp was created but it contains the content of the binary avro file as it was read as text...
Is it a limitation of Dumbo that the '-input' format is working only in Hadoop mode or is it a bug?
thanks,
jeff
Starting with CDH4b2 (nightlies at http://nightly.cloudera.com/cdh4/), /usr/lib/hadoop has been split out into a bunch of different directories, like hadoop-hdfs and hadoop-mapreduce, so a lot of the assumptions made in dumbo no longer work.
A few examples:
JAVA_HOME=/usr/lib/jvm/default-java
and HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
, I can't even run dumbo ls / -hadoop /usr/lib/hadoop
since it will complain about JAVA_HOME not being set or not being able to find libexec.dumbo start wordcount.py -input foo -output bar -hadoop /usr/lib/hadoop
results in "ERROR: Streaming jar not found" since the streaming jar is now under /usr/lib/hadoop-mapreduce.I might be able to fix this if I find some free time, but it might require some structural changes.
Replacing "sum" by (the non-existing function name) "summ" in "reducer1" in "examples/itertwice.py" and then running "bin/test" leads to 1 failure and 1 error, whereas doing the same change in "reducer2" (and removing "reducer2" as combiner for the first iteration) leads to 0 failures and 1 error.
Causes frustration with new users that don't know why it's mysteriously failing and haven't figured out either of the workarounds yet.
proposed patch included.. perhaps not the best solution, but better than failing silently.
diff --git a/dumbo/backends/streaming.py b/dumbo/backends/streaming.py index f71e611..7755fbe 100644 --- a/dumbo/backends/streaming.py +++ b/dumbo/backends/streaming.py @@ -230,6 +230,8 @@ class StreamingFileSystem(FileSystem): subpaths = [path] ls.close() for subpath in subpaths: + if subpath.endswith("/_logs"): + continue dumptb = os.popen('%s %s/bin/hadoop jar %s dumptb %s 2> /dev/null' % (hadenv, self.hadoop, streamingjar, subpath)) ascodeopt = getopt(opts, 'ascode')
If a mapper class has a map function, or a reducer or combiner class has a reduce function, neither configure() nor close() functions on the class will be called.
(Ref: http://groups.google.com/group/dumbo-user/browse_frm/thread/432db8171497d7f6)
It will look for the first directory that exists in its search path and then assume the jar file should be found there. A better approach would be to do an explicit file check on each directory.
By making JoinReducer do a natural join by default (http://dumbo.assembla.com/spaces/dumbo/tickets/52), we made it not suitable anymore for combining. It would be good to add a JoinCombiner that doesn't have the natural join restriction.
See the traceback from the logs below.
Traceback (most recent call last): File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File "/data/0/hdfs/local/taskTracker/jobcache/job_201011021825_0760 /attempt_201011021825_0760_r_000000_1/work/rec2.py", line 156, in main(runner) File "dumbo/core.py", line 211, in main job.run() File "dumbo/core.py", line 61, in run run(*args, **kwargs) File "dumbo/core.py", line 366, in run for output in dumpcode(inputs): UnboundLocalError: local variable 'inputs' referenced before assignment
(Used to be "parser attribute on a single mapper gets applied to others in same MultiMapper".)
The following script should generate the same output regardless of the input but it doesn't when the input is empty, despite yielding the same 100 entries as evidenced by the stderr messages:
import sys
import dumbo
from dumbo.lib import identitymapper
def reducer(data):
for _ in data:
pass
for i in xrange(100):
yield "foo", i
print >> sys.stderr, "yielding" , i
if __name__ == '__main__':
dumbo.run(identitymapper, reducer)
After upgrading hadoop (to hadoop-0.20-0.20.2+320-1.noarch using cloudera's yum repo), dumbo no longer finds the hadoop streaming jar.
It seems that the jar filename has changed to:
hadoop-streaming-0.20.2+320.jar
which is not matched by the regular expression in util.py:
re.compile(r'hadoop.*%s.jar' % name)
When I run any dumbo script (I have release-0.21.28) with "-addpath yes" in the arguments, my map jobs fail with the following error: "KeyError: 'map_input_file'"
It appears that the environment variable map_input_file is no longer used in hadoop 0.21.0, and has been replaced with mapreduce_map_input_file.
This diagnosis is supported by a comment on HADOOP-5973 (https://issues.apache.org/jira/browse/HADOOP-5973) that mentions that map.input.file is only available in the older (deprecated) version of the MapReduce API in Hadoop 0.20.0.
I was able to make it work by replacing all instances of "map_input_file" with "mapreduce_map_input_file" in dumbo/core.py, but perhaps a longer-term solution would be to check both variables to see which one exists.
Hi there,
I'm having trouble making dumbo work with virtualenv.
My setting is a hadoop grid running cloudera cdh3b3 on debian lenny (yes I know it's old). In order to run a decent, fixed version of python I have a virtualenv with python2.6 and dumbo installed into it.
When running in local mode a simple example (ipcount), everything is fine.
But when I try to run in grid mode, I get this error:
"""
(0.0.5)ediemert@bom-dev-hadn9:~/sources/scratch/dumbo$ dumbo start count.py -input /benchmark/data/dbpedia/long_a
bstracts_en.nt -output wcounts -hadoop /usr/lib/hadoop/
EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" /usr/lib/hadoop//bin/hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2+737.jar -input '/benchmark/data/dbpedia/long_abstracts_en.nt' -output 'wcounts' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo' -mapper 'python -m count map 0 262144000' -reducer 'python -m count red 0 262144000' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'mapred.job.name=count.py (1/1)' -inputformat 'org.apache.hadoop.streaming.AutoInputFormat' -outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv 'PYTHONPATH=common.pyc' -file '/home/ediemert/sources/scratch/dumbo/count.py' -file '/home/sites/sci-env/0.0.5/lib/python2.6/site-packages/dumbo/backends/common.pyc' -jobconf 'tmpfiles=/home/sites/sci-env/0.0.5/lib/python2.6/site-packages/typedbytes.pyc'
11/05/23 16:02:13 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/home/ediemert/sources/scratch/dumbo/count.py, /home/sites/sci-env/0.0.5/lib/python2.6/site-packages/dumbo/backends/common.pyc, /tmp/hadoop-ediemert/hadoop-unjar5768148904382263570/] [] /tmp/streamjob6800517112258694028.jar tmpDir=null
11/05/23 16:02:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://bom-dev-hann1.xxx.com/tmp/hadoop-mapred/mapred/staging/ediemert/.staging/job_201104261107_0645
11/05/23 16:02:13 ERROR streaming.StreamJob: Error launching job , bad input path : File does not exist: /home/sites/sci-env/0.0.5/lib/python2.6/site-packages/typedbytes.pyc
Streaming Command Failed!
"""
It seems to me like this error is related to my python2.6 and dumbo being executed from virtualenv.
Any ideas on how to tackle this problem ?
thansk a lot !
oddskool
It would be nice if it were possible, when calling Job.additer, to specify that the input for an iteration should be the output of one or more previous iterations of the Job. Something like...
job = dumbo.Job()
job.input # is an id for job's input (i.e., specified by -input on the command line)
o0 = job.additer(mapper, reducer) # returns an id for the iteration's output
o1 = job.additer(mapper, reducer, input=job.input) # take input from the job input instead of iteration 0
o2 = job.additer(mapper, reducer, input=[o0,o1]) # take input from both iteration 0 and 1
The job's output would be the output of the last iteration as always.
It seems to me this would be a fairly easy modification that would add a lot of flexibility.
As originally reported by Zak Stone:
It appears that
dumbo cat /hdfs/path/part*
does not actually concatenate all of the parts in an HDFS directory — instead, it silently emits only the key-value pairs from the first part.
Since the normal Dumbo syntax without the final star chokes on the _logs directory that Hadoop creates by default, people may be using this part* syntax frequently, and they may not realize that it yields incorrect results.
Current workarounds include using dumbo cat without the star by manually deleting the _logs directory or configuring Hadoop not to create it. It may be more convenient to use the HDFS ls command to iterate through the part files in a directory explicitly to ensure that each one is processed as expected.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.