Migrating from dumbo (the other MapReduce Python module) should be pretty easy because

I have a fork where <a class="commit-link" data-hovercard-type="commit" data-hovercard

Add "migrating from dumbo" section to docs about mrjob HOT 22 CLOSED

coyotemarin commented on July 22, 2024

Add "migrating from dumbo" section to docs

from mrjob.

Comments (22)

jwheare commented on July 22, 2024

What were the limitations of dumbo that made you write mrjob? Maybe outlining those differences, whether philosophical or technical, in the docs somewhere would be useful. At the moment I'm just confused which one I'd use.

from mrjob.

coyotemarin commented on July 22, 2024

Good question! Will do. The big advantages mrjob has over dumbo are:

works with EMR
JSON and other protocols for communication between steps and with other processes
add custom switches to your jobs, including file options

from mrjob.

klbostee commented on July 22, 2024

That's not quite true. I haven't looked at mrjob much, but:

dumbo programs can run on EMR too: http://dumbotics.com/2009/12/23/dumbo-on-amazon-emr
one could easily implement a dumbo backend that uses a different protocol, but most people will want to stick to the so called "typed bytes" it uses by default on Hadoop because they are implemented very efficiently at both the Java and the Python side and make it possible to (properly) consume non-text input files such as sequence files: http://static.last.fm/johan/huguk-20090414/klaas-hadoop-1722.pdf
dumbo programs can have custom switches too: http://github.com/klbostee/dumbo/wiki/Short-tutorial#programs_and_starters

from mrjob.

coyotemarin commented on July 22, 2024

Oh, awesome, a visit from the dumbo maintainer!

Neat, didn't know that dumbo has custom switches.

It looks like dumbo's EMR support is more like running dumbo by hand inside EMR, whereas mrjob jobs will actually start up an EMR job flow and launch themselves on EMR (and install mrjob automatically) when you run them with -r emr. But yeah, I stand corrected on EMR as well.

I'll make a point of trying to run by statements about what dumbo can and can't do by you first. I think there's room for more than one Python MapReduce library, and I'm hoping we can work together and learn from each other. :)

By the way, I've taken a look at your typedbytes library. I imagine it'll be pretty useful for mrjob once we support mixing of python scripts and Java (right now it's all Python).

from mrjob.

dgleich commented on July 22, 2024

Are there any forks that try to get mrjob to use typedbytes as the interface instead of json? I need to use numeric data almost exclusively in a project and I'm concerned that json will introduce too much overhead.

from mrjob.

coyotemarin commented on July 22, 2024

Not that I know of, but I'd be happy to work with you on getting it to work.

One assumption you might bump up against is that mrjob assumes you're using a line-based format (since it's a Hadoop streaming library). However, it would not be at all difficult to extend the protocol definition to allow arbitrary reading and writing to file handles.

All the code that controls input/output format is in mrjob.job.MRJob.run_mapper(), mrjob.job.MRJob.run_reducer(), and mrjob.job.MRJob.run_combiner(). I would recommend forking master, hacking this code however you need to in order to get typed bytes to work, and then we can figure out the best way to clean it up and generalize once you have something working.

Let me know how it goes!

from mrjob.

klbostee commented on July 22, 2024

Hadoop streaming doesn't require a line-based format actually, its default InputWriter and OutputReader just happen to be line-based. With typed bytes you simply use a different InputWriter and OutputReader that rely on a more efficient binary format (and those typed bytes IO classes are also shipped with Hadoop streaming by the way).

Also, with the java and python side I meant streaming and dumbo above. Mrjob might be all python (as is dumbo), but you still need to talk to streaming's java code then. Making this more efficient can lead to very substantial performance gains.

from mrjob.

dgleich commented on July 22, 2024

It looks like it won't be too difficult to come up with a proof-of-concept that this works. The user interface for picking a protocol will be a tricky bit. To start with, I'll try and force everything to use typedbytes.

from mrjob.

dgleich commented on July 22, 2024

One issue that'll arise is how to distributed the typedbytes python module with the mrjob.tar.gz file.

from mrjob.

dgleich commented on July 22, 2024

I have a fork where dgleich/mrjob@def0452 that does this in a really hacky way. At the moment, it assumes all IO from python is in typedbytes.

Tested on CDH3 in single node pseudo-distributed mode.
typedbytes module installed system-wide
dumbo installed via pip install dumbo

(pyenv)dgleich@recurrent:~/devextern/mrjob$ HADOOP_HOME=/usr/lib/hadoop-0.20/ python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop -v --no-output -o counts --hadoop-output-format=org.apache.hadoop.mapred.SequenceFileOutputFormat

If I cat the output file, I get garbage now.

Now, let's see the output via dumbo cat (which should convert from typed bytes back to the usual...)

dumbo cat counts/part-* -hadoop /usr/lib/hadoop-0.20
12/02/07 12:57:00 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
_   5
a   2
e   1
g   1
r   2
an  2
by  1

from mrjob.

coyotemarin commented on July 22, 2024

Oh, awesome.

--hadoop-output-format only sets the output format for the last step. You probably want to set input/output format directly with e.g. --hadoop-arg -outputformat --hadoop-arg org.apache.hadoop.mapred.SequenceFileOutputFormat (and input format as well) so that you'll be using that format to pass sequence file data between your mapper and reducer.

(And thank you for the tips, @klbostee!)

from mrjob.

dgleich commented on July 22, 2024

As I'm thinking about how to implement this in a more structured way, I'm wondering if there is any reason to use a line-based interface between hadoop streaming and mrjob.

Klaas, if you use typedbytes as the communication between the streaming jar and python, you can still write out line-based data by setting the right outputformat, right?

I'm also struggling with thinking about how the typedbytes interface fits into mrjob. That is, python can communicate with java via typedbytes. Records can be stored as typedbytes, and I think any standard writeables get converted to typedbytes to communicate with streaming.

If the answer to the above question is yes, then I don't see any reason to keep the old line based interface around. Am I missing something?

(As an aside, I'll note that one reason to keep an OPTION for a line-based mode around would be promote compatibility with other languages. For instance, if mrjob has the ability to specify other executables to run to process the data... it might not be so easy to use typedbytes with those.)

from mrjob.

coyotemarin commented on July 22, 2024

We picked JSON and a line-based interface because:

The default interface to Hadoop Streaming is line-based
The line-based record format is easy to understand
Basically all our input files are line-based log files
JSON is human-readable
JSON is about the most interoperable format around

That being said, I'd love to know if there are significant efficiency gains to be had by using typed bytes, and I'd love to have a good use case to guide the design of the protocol interface for non line-based formats.

from mrjob.

dgleich commented on July 22, 2024

I don't see any way around radically modifying the function job.pick_protocols and job._wrap_protocols

I don't think that modifying _wrap_protocols is an issue as it's an "internal" function. However, is modifying pick_protocols an issue? What (if any) backwards compatibility should I consider with changing this function?

from mrjob.

coyotemarin commented on July 22, 2024

It's totally fine to override pick_protocols() (or to rewrite it while hacking out a proof-of-concept). See http://packages.python.org/mrjob/job.html#mrjob.job.MRJob.pick_protocols.

from mrjob.

irskep commented on July 22, 2024

Two questions:

Do we ever plan to actually add this to the docs?
Should we open a new issue about typedbytes? How's that going?

from mrjob.

dgleich commented on July 22, 2024

I'm working with a student on implementing it ... here's the current repository he's been working on. I just haven't had a chance to test it yet.

See ... joshi-prahlad/mrjob@b0b005a for the most recent changes.

David

from mrjob.

irskep commented on July 22, 2024

Cool, but it's really confusing to me that we're doing all this typedbytes discussion on the "dumbo migration in mrjob docs" issue. Can you please open a new one with what you actually plan to do?

from mrjob.

irskep commented on July 22, 2024

And my "are we actually doing this" question was about the docs, not typedbytes.

from mrjob.

dgleich commented on July 22, 2024

Just added #430 about the typedbytes piece of this issue.

from mrjob.

irskep commented on July 22, 2024

IMO we should close this issue. If someone wants to volunteer to write it, they have my endless thanks, but we'll never do it ourselves and no one has asked for it.

from mrjob.

coyotemarin commented on July 22, 2024

Yep.

from mrjob.

Add "migrating from dumbo" section to docs about mrjob HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent