klbostee / dumbo Goto Github PK

Python module that allows one to easily write and run Hadoop programs.

Home Page: http://projects.dumbotics.com/dumbo

Python 100.00%

dumbo's Introduction

DESCRIPTION
"""""""""""

Dumbo is a Python module that makes writing and running Hadoop
Streaming programs very easy. More generally, Dumbo can be considered
a convenient Python API for writing MapReduce programs.


MORE INFO
"""""""""

http://projects.dumbotics.com/dumbo

http://wikis.dumbotics.com/dumbo

dumbo's People

Contributors

Stargazers

Watchers

Forkers

forsberg e2r0r drsm79 r0wb0t dgleich fangy adamhadani shotaatago d5nguyenvan eldondev atiw003 dangra edwardgeorge thaingo greeness jsifantu pealco jrula andrix brisssou irradiate cpatrick vickifu zhuowater xjzhou osiloke dataartisan evilkirin bandwidthcrunch seacoastboy liuyu pfig lhillenbrand joskid ebottabi invinciblejha jkleint rmarshasatx jun-liu mlerner hellcoderz brunovianarezende liyanghua mshevelev nwf5d rongchao revskill10 wrightrocket d2207197 a4tunado k8nx perryhau liaopengyu spirit-dongdong sudy alfa07 jrib web5design simplemia strategist922 jso jeffliu pflaquerre acanalesg sunguo vkuznet fover717 cyaliven jgors tempbottle raymondhe mayanhui cniclsh sigmoidfreud sonemaro karan7868 wowgeeker oddskool lucidfrontier45 jrmontag cosmicbboy duanhongyi vijaykumar243 jdcaballerov maheedhargunturu savanibharat yanshanjing advancedpartnerships heidcloud prometheus1923 justin2061 moerstw soxofaan erikvdp viveksck isubin faithgithub abensrhir velikooutro berenjb

dumbo's Issues

return value of system() function

Hi,

I found that the system() function in Dumbo always returns zero if the actual return value is less than 256. As you know, command 'hadoop fs -test -e' returns 1 if the given path do not exists. But the system() function returns zero in the case. I think '/ 256' should be removed in the code.

def system(cmd, stdout=sys.stdout, stderr=sys.stderr):
    if sys.version[:3] == '2.4':
        return os.system(cmd) / 256
    proc = subprocess.Popen(cmd, shell=True, stdout=stdout,
                            stderr=stderr)
    return os.waitpid(proc.pid, 0)[1] / 256

Please let me know what you think of this. Thanks.
Seongsu

Buildout wrapper script swallows exceptions raised by starter

If your starter dies, its exit code is parsed out of the os.waitpid() call in util.py:system, then passed back via util.py:execute, cmd.py:start and cmd.py:dumbo.

If you're running dumbo from the entry script generated by easy_install, all is fine, as it captures the return value of the entry point (cmd.py:dumbo) and passes this to sys.exit().

But if you're using the wrapper generated by buildout, this return code is ignored, because of this issue:

https://bugs.launchpad.net/zc.buildout/+bug/697913

... meaning that your dumbo task as a whole still exits with 0, despite the exception in your starter.

I know this isn't really a bug in dumbo, but since dumbo is distributed via buildout, I thought it worth reporting. One workaround would be to add a line like this to cmd.py:dumbo:

if retval != 0:
    raise RuntimeError('%s command died with code %d') % (sys.argv[1], retval)

Arguably this is a more pythonic way to communicate an error condition than passing integers around anyway.

PS Thanks for fixing those last ones so quickly, Klaas :-)

It's hard to use an output as an input

Hello!
Let me explain that one:
I have two jobs creating there own output files.
Then I want to merge those two files using a third job.

In my first attempt, the first two jobs were yielding python structures (dict) as values, and unicode strings as key, which turned out to be dumb. I would have had to eval the keys and values in my third job. I'm not sure anyone would want to do that.

Now I try to output pure strings through encode('utf-8') and some json.dumps. I now have string every where, dumbo cat confirmed it.

But, if I try to use those two files as input for the third merge job, keys and values are single-quoted, which if quite a pain to test my code locally. Of course, I will use dumbo cat out > out.txt to be able to test the merge job locally, but the code driving those three jobs won't be testable unless ran on a real hadoop cluster.

Did I miss something?

Thanks a lot for your help!

cryptic "relative module names not supported" error message

When

dumbo start ./someprog.py ...

is executed and the file someprog.py doesn’t exist, Dumbo will return the error "relative module names not supported" which isn’t very helpful.

Error in BinaryComparable.java file

Hi,

I'm attempting to patch hadoop and install dumbo, but whenever I try "ant package" I get a few "class, interface, or enum expected" errors in the BinaryComparable.java file. Shown below is the code. The areas that are causing the issue are the 2nd, 3rd, and 4th package statements.

Any help you can provide would be greatly appreciated.

Thanks,
Yigael

/**

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
*

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package org.apache.hadoop.io;

/**

Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {

/**
- Return n st bytes 0..n-1 from {#getBytes()} are valid.
  */
  public abstract int getLength();
/**
- Return representative byte array for this instance.
  */
  public abstract byte[] getBytes();
/**
- Compare bytes from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#compareBytes(byte[],int,int,byte[],int,int)
  */
  public int compareTo(BinaryComparable other) {
  if (this == other)
  return 0;
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other.getBytes(), 0, other.getLength());
  }
/**
- Compare bytes from {#getBytes()} to those provided.
  */
  public int compareTo(byte[] other, int off, int len) {
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other, off, len);
  }
/**
- Return true if bytes from {#getBytes()} match.
  */
  public boolean equals(Object other) {
  if (!(other instanceof BinaryComparable))
  return false;
  BinaryComparable that = (BinaryComparable)other;
  if (this.getLength() != that.getLength())
  return false;
  return this.compareTo(that) == 0;
  }
/**
- Return a hash of the bytes returned from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#hashBytes(byte[],int)
  */
  public int hashCode() {
  return WritableComparator.hashBytes(getBytes(), getLength());
  }

}
/**

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
*

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package org.apache.hadoop.io;

/**

Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {

/**
- Return n st bytes 0..n-1 from {#getBytes()} are valid.
  */
  public abstract int getLength();
/**
- Return representative byte array for this instance.
  */
  public abstract byte[] getBytes();
/**
- Compare bytes from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#compareBytes(byte[],int,int,byte[],int,int)
  */
  public int compareTo(BinaryComparable other) {
  if (this == other)
  return 0;
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other.getBytes(), 0, other.getLength());
  }
/**
- Compare bytes from {#getBytes()} to those provided.
  */
  public int compareTo(byte[] other, int off, int len) {
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other, off, len);
  }
/**
- Return true if bytes from {#getBytes()} match.
  */
  public boolean equals(Object other) {
  if (!(other instanceof BinaryComparable))
  return false;
  BinaryComparable that = (BinaryComparable)other;
  if (this.getLength() != that.getLength())
  return false;
  return this.compareTo(that) == 0;
  }
/**
- Return a hash of the bytes returned from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#hashBytes(byte[],int)
  */
  public int hashCode() {
  return WritableComparator.hashBytes(getBytes(), getLength());
  }

}
/**

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
*

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package org.apache.hadoop.io;

/**

Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {

/**
- Return n st bytes 0..n-1 from {#getBytes()} are valid.
  */
  public abstract int getLength();
/**
- Return representative byte array for this instance.
  */
  public abstract byte[] getBytes();
/**
- Compare bytes from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#compareBytes(byte[],int,int,byte[],int,int)
  */
  public int compareTo(BinaryComparable other) {
  if (this == other)
  return 0;
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other.getBytes(), 0, other.getLength());
  }
/**
- Compare bytes from {#getBytes()} to those provided.
  */
  public int compareTo(byte[] other, int off, int len) {
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other, off, len);
  }
/**
- Return true if bytes from {#getBytes()} match.
  */
  public boolean equals(Object other) {
  if (!(other instanceof BinaryComparable))
  return false;
  BinaryComparable that = (BinaryComparable)other;
  if (this.getLength() != that.getLength())
  return false;
  return this.compareTo(that) == 0;
  }
/**
- Return a hash of the bytes returned from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#hashBytes(byte[],int)
  */
  public int hashCode() {
  return WritableComparator.hashBytes(getBytes(), getLength());
  }

}
/**

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
*

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package org.apache.hadoop.io;

/**

Interface supported by {@link org.apache.hadoop.io.WritableComparable}
types supporting ordering/permutation by a representative set of bytes.
*/
public abstract class BinaryComparable implements Comparable {

/**
- Return n st bytes 0..n-1 from {#getBytes()} are valid.
  */
  public abstract int getLength();
/**
- Return representative byte array for this instance.
  */
  public abstract byte[] getBytes();
/**
- Compare bytes from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#compareBytes(byte[],int,int,byte[],int,int)
  */
  public int compareTo(BinaryComparable other) {
  if (this == other)
  return 0;
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other.getBytes(), 0, other.getLength());
  }
/**
- Compare bytes from {#getBytes()} to those provided.
  */
  public int compareTo(byte[] other, int off, int len) {
  return WritableComparator.compareBytes(getBytes(), 0, getLength(),
  other, off, len);
  }
/**
- Return true if bytes from {#getBytes()} match.
  */
  public boolean equals(Object other) {
  if (!(other instanceof BinaryComparable))
  return false;
  BinaryComparable that = (BinaryComparable)other;
  if (this.getLength() != that.getLength())
  return false;
  return this.compareTo(that) == 0;
  }
/**
- Return a hash of the bytes returned from {#getBytes()}.
- @see org.apache.hadoop.io.WritableComparator#hashBytes(byte[],int)
  */
  public int hashCode() {
  return WritableComparator.hashBytes(getBytes(), getLength());
  }

}

pre-outputs not deleted automatically anymore

Pre-outputs aren't deleted automatically anymore since we introduced the DAG capabilities for job flows. The reason for this is that

backend = get_backend(opts)
fs = backend.create_filesystem(opts)

gets executed at a point where 'opts' doesn't contain the 'hadoop' option anymore (since this option gets removed as part of running the iteration).

document why anyone should use this

Hi, the readme says basically "Python module that allows you to easily write and run Hadoop programs", and points the wiki, which contains some config files and sample programs.

If you really want to attract users, you should mention more explicitly why users should use this, rather then writing their own python programs for the streaming api.

AFAICT, dumbo provides:

wrapping i/o (streaming api stdin/stdout) to function arguments and return values
something to make sure reducers are run with a constant key, whereas with the streaming api reducers can get key/value pairs with different keys.

that seems to be it, from a quick glance.
If you recommend users to add this 2500-lines-of-python dependency, you should imho justify it more.

Permit stdout redirection to avoid broken pipes

One of the frustrating problems I've been running into is that if I
have "print statements" in code called by my mapper/reducer this will
break the pipe used by my streaming job.

It seems like a simple change to dumbo can fix this.
In core.py change
typedbytes.PairedOutput(sys.stdout).writes(outputs)

to
typedbytes.PairedOutput(sys.stdout).writes(outputs)

This way all we have to do is redirect stdout to stderr and extraneous
print statements will no longer cause problems.

I've tried this out and it seems to work for me.

parameters for local sort

When running on unix in local mode, dumbo uses a simple pipe to sort without any command-line options. Consequently, it might fill up the local /tmp directory which sort would use by default and/or create performance/memory problems if not limiting sort correctly. Therefore it would be useful to add two new command-line/dumborc options -sorttmpdir and -sortbufsize that correspond to the unix sort -T and -S parameters, respectively. Those can then be used in the config under the [unix] section or can be specified on the commandline, e.g:

dumbo start prog.py -input .. -output .. -sorttmpdir ~/hdfs/tmp -sortbufsize 5%

make MultiMapper inherit options

It would be useful if a multimapper would inherit all the opt decorations of the mappers that get added to it.

hadoopy backend?

Hey, I really love the job management stuff in dumbo. However, it seems like the inner-core of hadoopy is more highly optimized. (I get a factor of 2 better performance in my tests.) So it seems to me like the right way of combining the two is to write a hadoopy backend for dumbo. Would this be something you'd be interested in adding to dumbo? I'm happy to work on it in some capacity if there is interest.

Want to preserve job output

Hi,

I've a program consisting of 3 jobs.
I have 3 additer to create the job sequence and I would like to save the output of the 2nd job.
I looked to the code and saw that preoutputs option should be set to "yes" but I cannot figure out how to set it without having a error when the hadoop jobs are started.

Thank you for helping me.

equality test between join keys should be customizable

Sometimes it's handy to have the equality test between join keys be something other than strict equality; moving the key equality test into its own method would make it easier for subclasses of JoinReducer to customize.

Also, by putting the key_check method can be defined in JoinCombiner to always return True, in which case it's no longer necessary to duplicate the logic in the call method in both JoinCombiner and JoinReducer.

I'm issuing a pull request with my patch.

HELP me!

In my task, i want use Dumbo to do this:
the function of mapper get IP and occurrences number from server`s log.
and the reducer function return a SQL with key(IP) and value(occurrences number)
just like so:
def mapper(key, value) :
key = """get the ip for value"""
yield key,1
def combiner(key,value) :
yield key,sum(value)
def reducer(key,value) :
yield """sql with key and value"""

how can i do?
Please forgive my bad English.
The friend from China wait for you answer!
Very Thinks!

tests won't run on ubuntu with zope.interface installed globally

$ bin/test
Traceback (most recent call last):
  File "bin/test", line 47, in <module>
    import zope.testrunner
  File "/media/sf_Workspace/dumbo/eggs/zope.testrunner-4.0.3-py2.6.egg/zope/testrunner/__init__.py", line 21, in <module>
    import zope.testrunner.interfaces
  File "/media/sf_Workspace/dumbo/eggs/zope.testrunner-4.0.3-py2.6.egg/zope/testrunner/interfaces.py", line 21, in <module>
    import zope.interface
ImportError: No module named interface

overwrite option

It should be possible to tell dumbo to overwrite the output path if it already exists by simply providing an -overwrite yes option.

backend abstraction

At some point we should add a proper interface between the dumbo core and the unix and hadoop streaming backends we currently have (both for the code that runs mapreduce iterations and the code that executes commands). In addition to cleaning up the code, this would also make it possible to easily add additional backends.

make -overwrite yes work for intermediate output dirs too

When -overwrite yes is specified Dumbo should also overwrite intermediate _pre dirs instead of failing on them.

Outputting bytes directly on Hadoop is not possible

As Python strings are mapped to typedbytes (UTF-8) strings, it isn't currently possible to output binary data directly without any intermediate transformations of the bytes.

See also https://github.com/tims/lasthbase/issues#issue/3 for more info.

Two error handling bugs in core.py

First one's just a typo: on line 209, it has 'sdterr' instead of 'stderr'.

The second one might not be a bug but it looks a bit suspect to me :-) On line 197, program.start() is called regardless of whether starter(program) succeeded or failed.

Shouldn't it test this first, and not try to start the job if the starter failed? e.g.

retval = program.start() if status == 0 else status

dumbo cat leads to unsupported operand +: Options and list

This is presumably due to the options refactoring, since this works in 0.21.31

Traceback (most recent call last): File "/usr/local/bin/dumbo", line 9, in <module> load_entry_point('dumbo==0.21.32', 'console_scripts', 'dumbo')() File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/__init__.py", line 32, in execute_and_exit sys.exit(dumbo()) File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/cmd.py", line 42, in dumbo retval = cat(sys.argv[2], parseargs(sys.argv[2:])) File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/cmd.py", line 101, in cat return create_filesystem(opts).cat(path, opts) File "/usr/local/lib/python2.6/dist-packages/dumbo-0.21.32-py2.6.egg/dumbo/backends/unix.py", line 114, in cat return decodepipe(opts + [('file', path)]) TypeError: unsupported operand type(s) for +: 'Options' and 'list'

Import external libraries in an dumbo MapReduce

How Import external libraries in an dumbo MapReduce ?
Thank you

Explicit option for local run

Currently the only way to specify a local run is by omitting the -hadoop option. However this doesn't work if there is a ~/.dumborc file that specifies a default hadoop option; we have to comment out and uncomment this line in the config for toggling between local and cluster runs. Essentially we need a 'nohadoop' (or 'local') option to override from the command line the hadoop option in the config.

Another quick and dirty workaround would be to denote the local run with a special '-hadoop' value, e.g. " -hadoop '!' ". That's what I ended up implementing in my fork; if you're interested I can upload it and make a pull request.

unittest framework for dumbo mapreduce

Adam wrote a small module, inspired by cloudera's MRUnit, that lets one easily create unit-tests for dumbo mapreduce tasks.

The code is currently at:

http://github.com/adamhadani/dumbo/blob/master/dumbo/mapredtest.py

He also added a unittest for it that serves the double purpose of unit-testing the unit-testing itself
as well as serving as example of how to work with it:

http://github.com/adamhadani/dumbo/blob/master/tests/testmapredtest.py

The nice thing about it is that it takes care of some things behind the scenes, e.g deriving the mapper/reducer classes from mapredbase when needed, making sure input/output is iterable (allowing for arbitrarily large input/output test cases - need not fit in memory as seems to be the case with MRUnit), and so on.

VirtualEnv + Hadoop CDH3B4 mode + Dumbo = import site error (was: StreamJob fails -- error finding typedbytes.pyc even though it exists)

Hello,

I'm having the same issue described in this thread, but with the more recent CDH3B4:

http://groups.google.com/group/dumbo-user/browse_thread/thread/d5440880a5588278

Namely:

EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar -input '$HDFS_USERDIR/access.log' -output 'ipcounts' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo' -mapper 'python -m ipcount map 0 262144000' -reducer 'python -m ipcount red 0 262144000' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'mapred.job.name=ipcount.py (1/1)' -inputformat 'org.apache.hadoop.streaming.AutoInputFormat' -outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv 'PYTHONPATH=common.pyc' -file '$HOME/ipcount.py' -file '$HOME/dumbo/dumbo/backends/common.pyc' -jobconf 'tmpfiles=$VIRTUALENV_INSTALL/typedbytes.pyc'
11/02/26 20:11:20 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [$HOME/ipcount.py, $HOME/dumbo/dumbo/backends/common.pyc, $HDFS_PATH//hadoop-unjar44840/] [] /tmp/streamjob44841.jar tmpDir=null
11/02/26 20:11:21 INFO mapred.JobClient: Cleaning up the staging area hdfs://$HOSTNAME/$HDFS_PATH//mapred/staging/romanvg/.staging/job_201102242242_0010
11/02/26 20:11:21 ERROR streaming.StreamJob: Error launching job , bad input path : File does not exist: $VIRTUALENV_INSTALL/typedbytes.pyc
Streaming Command Failed!

easy_install does not seem to be the reason. I've installed it via pip, easy_install, and now git clone. Seems to me that the jobconf 'tmpfiles' is what causes the problem.

Commenting the offending code allows the mapreduce job to start but fails shortly after, on the hadoop side (not dumbo):

diff --git a/dumbo/backends/streaming.py b/dumbo/backends/streaming.py
index 1f03b13..df2d3d5 100644
--- a/dumbo/backends/streaming.py
+++ b/dumbo/backends/streaming.py
@@ -180,15 +180,15 @@ class StreamingIteration(Iteration):
         hadenv = envdef('HADOOP_CLASSPATH', addedopts['libjar'], 'libjar', 
                         self.opts, shortcuts=dict(configopts('jars', self.prog)))
         fileopt = getopt(self.opts, 'file')
-        if fileopt:
-            tmpfiles = []
-            for file in fileopt:
-                if file.startswith('file://'):
-                    self.opts.append(('file', file[7:]))
-                else:
-                    tmpfiles.append(file)
-            if tmpfiles:
-                self.opts.append(('jobconf', 'tmpfiles=' + ','.join(tmpfiles)))
+#        if fileopt:
+#            tmpfiles = []
+#            for file in fileopt:
+#                if file.startswith('file://'):
+#                    self.opts.append(('file', file[7:]))
+#                else:
+#                    tmpfiles.append(file)
+#            if tmpfiles:
+#                self.opts.append(('jobconf', 'tmpfiles=' + ','.join(tmpfiles)))
         libjaropt = getopt(self.opts, 'libjar')
         if libjaropt:
             tmpjars = []

The result of the above modification is:

11/02/26 20:23:32 INFO streaming.StreamJob: map 0% reduce 0%
11/02/26 20:23:40 INFO streaming.StreamJob: map 50% reduce 0%
11/02/26 20:23:41 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:02 INFO streaming.StreamJob: map 100% reduce 17%
11/02/26 20:24:06 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:13 INFO streaming.StreamJob: map 100% reduce 33%
11/02/26 20:24:17 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:26 INFO streaming.StreamJob: map 100% reduce 17%
11/02/26 20:24:29 INFO streaming.StreamJob: map 100% reduce 0%
11/02/26 20:24:32 INFO streaming.StreamJob: map 100% reduce 100%
(...)

11/02/26 20:24:32 ERROR streaming.StreamJob: Job not successful. Error: NA

-jobconf option is deprecated

Whenever a jumbo job is run, this warning appears:

11/07/07 13:23:25 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.

no error reported by Hadoop in case of immediate failure

As originally reported by Elias Pampalk:

The following scripts demonstrate a failure to fail when executed on a hadoop cluster (fails fine if executed locally):

import dumbo

def mapper(k, v):
    yield 1, 1

if __name__ == "__main__":
    dumbo.run(mapper, dumbo.sumsreducer, combiner=dumbo.sumsreducer)

The test uses dumbo.sumsreducer where dumbo.sumreduce should be used. A TypeError should be thrown by dumbo.sumsreducer (in the combiner). Instead the Hadoop reports show no error and zero output from the mapper.

reducers for outputting raw binary files, tokyo cabinet dbs or constant dbs

See recent commits on https://github.com/dangra/dumbo

Options duplicated when given to Hadoop

Hi Klbostee,

I got bitten by this problem when running a script. I'm using the last dumbo version, but I see the same problem in 0.21.29. I could find using PDB that the duplicates appear here : https://github.com/klbostee/dumbo/blob/master/dumbo/core.py#L384.

I workaround it with this : http://paste.pocoo.org/show/539376/. I think the best is that the options be a set of (key, value) instead of a list, but I don't know if the options can hold some non hashable value on it.

Well, what do you think? I think there is some room to improve options handling, but better to know well from you a bit more about its values.

Btw, I'm running this line (I have to change the names, sorry) on the cluster:

python2.5 -m dumbo.cmd start mylib.mymodule.script -cachefile hdfspath/to/myfile.db#myfile.db -param locale=en-gb -param workdir=/some/other/hdfspath -param dataversion=2012-01-23T01-59-59.797550 -input /data/input/2012/01/20 -output /data/output/2012/01/20/transformed -hadoop system

Well, thanks,
Andrés

Override or disable config files

It'd be really useful to have a command-line switch to say "don't read any config files" or "read this specific config file instead of the default ones".

That way, you could run dumbo from a virtualenv or buildout without worrying that eggs, jars etc. specified in dumbo.conf or .dumborc would sneak in.

Java -Xmx style memory size specification when using -memlimit

when actually using -memlimit, its cumbersome to specify the limit in terms of bytes.
A simple format for specifying bytes could be used, e.g:

-memlimit 100k
-memlimit 256m
-memlimit 2g

this should default to bytes when no unit specifier is given (e.g -memlimit 1024)

add -queue option

It would be convenient to provide a -queue <name> alternative for -jobconf mapred.job.queue.name=<name>.

exception handling for parser ValueError should not write all bad lines out to logs

If a parser is accidentally used on the wrong type of data file (a pretty common mistake among the analysts I'm supporting with Dumbo), the entire contents of the input file get dumped into the job logs.

I've seen this bring nodes down repeatedly, when the partition holding the log files gets completely filled up. Parsers are very useful, but I think it would be better if bad values just incremented a counter.

python not found

Hadoop version: official release of 0.21.0
Dumbo version: dumbo-0.21.28-py2.6.egg
MR Program is the example found in the Wiki (https://github.com/klbostee/dumbo/wiki/Short-tutorial)

stderr log:
java.io.IOException: Cannot run program "python": java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:96)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:96)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.io.IOException: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

"(1/0)" in job name

Jobs that consist of only one iteration and don't use the "Job" object are shown as having 0 iterations in Hadoop webui now.

-input format not handled in local mode

hi,

I want to run Dumbo with a specific input format (to read from Avro files).
It seems Dumbo does not use the input format specified by '-inputformat' when it is run locally (without specifying '-hadoop'). Instead it uses its default input format.

To check that, I specify a unknown class with '-inputformat foo.bar.UnknownClass'. It fails on hadoop but passes in local mode.

Hadoop mode:

$ dumbo start cat.py
-input word-count.avro
-output tmp
-libjar avro-1.4.1.jar
-libjar avro-utils-1.5.3-SNAPSHOT.jar
-inputformat foo.bar.UnknownClass
-python /home/sites/sci-env/0.0.5/bin/python
-hadoop /usr/lib/hadoop
...
-inputformat : class not found : foo.bar.UnknownClass
Streaming Command Failed!

Local mode:

=> no error, tmp was created but it contains the content of the binary avro file as it was read as text...

Is it a limitation of Dumbo that the '-input' format is working only in Hadoop mode or is it a bug?

thanks,
jeff

Dumbo doesn't work with CDH4b2 nightlies

Starting with CDH4b2 (nightlies at http://nightly.cloudera.com/cdh4/), /usr/lib/hadoop has been split out into a bunch of different directories, like hadoop-hdfs and hadoop-mapreduce, so a lot of the assumptions made in dumbo no longer work.
A few examples:

If I don't set JAVA_HOME=/usr/lib/jvm/default-java and HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec, I can't even run dumbo ls / -hadoop /usr/lib/hadoop since it will complain about JAVA_HOME not being set or not being able to find libexec.
Trying to run dumbo start wordcount.py -input foo -output bar -hadoop /usr/lib/hadoop results in "ERROR: Streaming jar not found" since the streaming jar is now under /usr/lib/hadoop-mapreduce.

I might be able to fix this if I find some free time, but it might require some structural changes.

failing second iteration not reflected by status code

Replacing "sum" by (the non-existing function name) "summ" in "reducer1" in "examples/itertwice.py" and then running "bin/test" leads to 1 failure and 1 error, whereas doing the same change in "reducer2" (and removing "reducer2" as combiner for the first iteration) leads to 0 failures and 1 error.

dumbo cat silently fails when JobHistory logging is enabled

Causes frustration with new users that don't know why it's mysteriously failing and haven't figured out either of the workarounds yet.

proposed patch included.. perhaps not the best solution, but better than failing silently.

diff --git a/dumbo/backends/streaming.py b/dumbo/backends/streaming.py
index f71e611..7755fbe 100644
--- a/dumbo/backends/streaming.py
+++ b/dumbo/backends/streaming.py
@@ -230,6 +230,8 @@ class StreamingFileSystem(FileSystem):
                 subpaths = [path]
             ls.close()
             for subpath in subpaths:
+                if subpath.endswith("/_logs"):
+                    continue
                 dumptb = os.popen('%s %s/bin/hadoop jar %s dumptb %s 2> /dev/null'
                                   % (hadenv, self.hadoop, streamingjar, subpath))
                 ascodeopt = getopt(opts, 'ascode')

configure() and close() on mapper/reduce/combiner classes not called if map/reduce function is defined

If a mapper class has a map function, or a reducer or combiner class has a reduce function, neither configure() nor close() functions on the class will be called.

(Ref: http://groups.google.com/group/dumbo-user/browse_frm/thread/432db8171497d7f6)

findjar() method is abit naive

It will look for the first directory that exists in its search path and then assume the jar file should be found there. A better approach would be to do an explicit file check on each directory.

JoinReducer cannot be used anymore for combining

By making JoinReducer do a natural join by default (http://dumbo.assembla.com/spaces/dumbo/tickets/52), we made it not suitable anymore for combining. It would be good to add a JoinCombiner that doesn't have the natural join restriction.

some issue when trying to use the new DAG functionality in the 0.21.29 release

See the traceback from the logs below.

 Traceback (most recent call last):
   File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main
     "__main__", fname, loader, pkg_name)
   File "/usr/lib/python2.6/runpy.py", line 34, in _run_code
     exec code in run_globals
   File "/data/0/hdfs/local/taskTracker/jobcache/job_201011021825_0760  /attempt_201011021825_0760_r_000000_1/work/rec2.py", line 156, in 
     main(runner)
   File "dumbo/core.py", line 211, in main
     job.run()
   File "dumbo/core.py", line 61, in run
     run(*args, **kwargs)
   File "dumbo/core.py", line 366, in run
     for output in dumpcode(inputs):
 UnboundLocalError: local variable 'inputs' referenced before assignment

dedicated decorator for specifying parser for a single mapper

(Used to be "parser attribute on a single mapper gets applied to others in same MultiMapper".)

Reducer output seems to go to /dev/null when input is empty

The following script should generate the same output regardless of the input but it doesn't when the input is empty, despite yielding the same 100 entries as evidenced by the stderr messages:

import sys
import dumbo
from dumbo.lib import identitymapper

def reducer(data):
    for _ in data:
        pass
    for i in xrange(100):
        yield "foo", i        
        print >> sys.stderr, "yielding" , i

if __name__ == '__main__':
    dumbo.run(identitymapper, reducer)

dumbo fails to find hadoop streaming jar

After upgrading hadoop (to hadoop-0.20-0.20.2+320-1.noarch using cloudera's yum repo), dumbo no longer finds the hadoop streaming jar.

It seems that the jar filename has changed to:
hadoop-streaming-0.20.2+320.jar
which is not matched by the regular expression in util.py:
re.compile(r'hadoop.*%s.jar' % name)

addpath broken under hadoop-0.21.0

When I run any dumbo script (I have release-0.21.28) with "-addpath yes" in the arguments, my map jobs fail with the following error: "KeyError: 'map_input_file'"

It appears that the environment variable map_input_file is no longer used in hadoop 0.21.0, and has been replaced with mapreduce_map_input_file.

This diagnosis is supported by a comment on HADOOP-5973 (https://issues.apache.org/jira/browse/HADOOP-5973) that mentions that map.input.file is only available in the older (deprecated) version of the MapReduce API in Hadoop 0.20.0.

I was able to make it work by replacing all instances of "map_input_file" with "mapreduce_map_input_file" in dumbo/core.py, but perhaps a longer-term solution would be to check both variables to see which one exists.

dumbo with virtualenv

Hi there,

I'm having trouble making dumbo work with virtualenv.

My setting is a hadoop grid running cloudera cdh3b3 on debian lenny (yes I know it's old). In order to run a decent, fixed version of python I have a virtualenv with python2.6 and dumbo installed into it.

When running in local mode a simple example (ipcount), everything is fine.

But when I try to run in grid mode, I get this error:
"""
(0.0.5)ediemert@bom-dev-hadn9:~/sources/scratch/dumbo$ dumbo start count.py -input /benchmark/data/dbpedia/long_a
bstracts_en.nt -output wcounts -hadoop /usr/lib/hadoop/
EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" /usr/lib/hadoop//bin/hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2+737.jar -input '/benchmark/data/dbpedia/long_abstracts_en.nt' -output 'wcounts' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo' -mapper 'python -m count map 0 262144000' -reducer 'python -m count red 0 262144000' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'mapred.job.name=count.py (1/1)' -inputformat 'org.apache.hadoop.streaming.AutoInputFormat' -outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv 'PYTHONPATH=common.pyc' -file '/home/ediemert/sources/scratch/dumbo/count.py' -file '/home/sites/sci-env/0.0.5/lib/python2.6/site-packages/dumbo/backends/common.pyc' -jobconf 'tmpfiles=/home/sites/sci-env/0.0.5/lib/python2.6/site-packages/typedbytes.pyc'
11/05/23 16:02:13 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/home/ediemert/sources/scratch/dumbo/count.py, /home/sites/sci-env/0.0.5/lib/python2.6/site-packages/dumbo/backends/common.pyc, /tmp/hadoop-ediemert/hadoop-unjar5768148904382263570/] [] /tmp/streamjob6800517112258694028.jar tmpDir=null
11/05/23 16:02:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://bom-dev-hann1.xxx.com/tmp/hadoop-mapred/mapred/staging/ediemert/.staging/job_201104261107_0645
11/05/23 16:02:13 ERROR streaming.StreamJob: Error launching job , bad input path : File does not exist: /home/sites/sci-env/0.0.5/lib/python2.6/site-packages/typedbytes.pyc
Streaming Command Failed!
"""

It seems to me like this error is related to my python2.6 and dumbo being executed from virtualenv.

Any ideas on how to tackle this problem ?

thansk a lot !

oddskool

Allow a Job to be a DAG instead of a chain?

It would be nice if it were possible, when calling Job.additer, to specify that the input for an iteration should be the output of one or more previous iterations of the Job. Something like...

job = dumbo.Job()
job.input # is an id for job's input (i.e., specified by -input on the command line)
o0 = job.additer(mapper, reducer) # returns an id for the iteration's output
o1 = job.additer(mapper, reducer, input=job.input) # take input from the job input instead of iteration 0
o2 = job.additer(mapper, reducer, input=[o0,o1]) # take input from both iteration 0 and 1

The job's output would be the output of the last iteration as always.

It seems to me this would be a fairly easy modification that would add a lot of flexibility.

dumbo cat /hdfs/path/part* silently fails to concatenate all part files

As originally reported by Zak Stone:

It appears that

dumbo cat /hdfs/path/part*

does not actually concatenate all of the parts in an HDFS directory — instead, it silently emits only the key-value pairs from the first part.

Since the normal Dumbo syntax without the final star chokes on the _logs directory that Hadoop creates by default, people may be using this part* syntax frequently, and they may not realize that it yields incorrect results.

Current workarounds include using dumbo cat without the star by manually deleting the _logs directory or configuring Hadoop not to create it. It may be more convenient to use the HDFS ls command to iterate through the part files in a directory explicitly to ensure that each one is processed as expected.

klbostee / dumbo Goto Github PK

dumbo's Introduction

dumbo's People

Contributors

Stargazers

Watchers

Forkers

dumbo's Issues

Recommend Projects

Recommend Topics

Recommend Org