Giter VIP home page Giter VIP logo

Comments (22)

jwheare avatar jwheare commented on July 22, 2024

What were the limitations of dumbo that made you write mrjob? Maybe outlining those differences, whether philosophical or technical, in the docs somewhere would be useful. At the moment I'm just confused which one I'd use.

from mrjob.

coyotemarin avatar coyotemarin commented on July 22, 2024

Good question! Will do. The big advantages mrjob has over dumbo are:

  • works with EMR
  • JSON and other protocols for communication between steps and with other processes
  • add custom switches to your jobs, including file options

from mrjob.

klbostee avatar klbostee commented on July 22, 2024

That's not quite true. I haven't looked at mrjob much, but:

from mrjob.

coyotemarin avatar coyotemarin commented on July 22, 2024

Oh, awesome, a visit from the dumbo maintainer!

Neat, didn't know that dumbo has custom switches.

It looks like dumbo's EMR support is more like running dumbo by hand inside EMR, whereas mrjob jobs will actually start up an EMR job flow and launch themselves on EMR (and install mrjob automatically) when you run them with -r emr. But yeah, I stand corrected on EMR as well.

I'll make a point of trying to run by statements about what dumbo can and can't do by you first. I think there's room for more than one Python MapReduce library, and I'm hoping we can work together and learn from each other. :)

By the way, I've taken a look at your typedbytes library. I imagine it'll be pretty useful for mrjob once we support mixing of python scripts and Java (right now it's all Python).

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

Are there any forks that try to get mrjob to use typedbytes as the interface instead of json? I need to use numeric data almost exclusively in a project and I'm concerned that json will introduce too much overhead.

from mrjob.

coyotemarin avatar coyotemarin commented on July 22, 2024

Not that I know of, but I'd be happy to work with you on getting it to work.

One assumption you might bump up against is that mrjob assumes you're using a line-based format (since it's a Hadoop streaming library). However, it would not be at all difficult to extend the protocol definition to allow arbitrary reading and writing to file handles.

All the code that controls input/output format is in mrjob.job.MRJob.run_mapper(), mrjob.job.MRJob.run_reducer(), and mrjob.job.MRJob.run_combiner(). I would recommend forking master, hacking this code however you need to in order to get typed bytes to work, and then we can figure out the best way to clean it up and generalize once you have something working.

Let me know how it goes!

from mrjob.

klbostee avatar klbostee commented on July 22, 2024

Hadoop streaming doesn't require a line-based format actually, its default InputWriter and OutputReader just happen to be line-based. With typed bytes you simply use a different InputWriter and OutputReader that rely on a more efficient binary format (and those typed bytes IO classes are also shipped with Hadoop streaming by the way).

Also, with the java and python side I meant streaming and dumbo above. Mrjob might be all python (as is dumbo), but you still need to talk to streaming's java code then. Making this more efficient can lead to very substantial performance gains.

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

It looks like it won't be too difficult to come up with a proof-of-concept that this works. The user interface for picking a protocol will be a tricky bit. To start with, I'll try and force everything to use typedbytes.

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

One issue that'll arise is how to distributed the typedbytes python module with the mrjob.tar.gz file.

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

I have a fork where dgleich/mrjob@def0452 that does this in a really hacky way. At the moment, it assumes all IO from python is in typedbytes.

  • Tested on CDH3 in single node pseudo-distributed mode.

  • typedbytes module installed system-wide

  • dumbo installed via pip install dumbo

    (pyenv)dgleich@recurrent:~/devextern/mrjob$ HADOOP_HOME=/usr/lib/hadoop-0.20/ python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop -v --no-output -o counts --hadoop-output-format=org.apache.hadoop.mapred.SequenceFileOutputFormat

If I cat the output file, I get garbage now.

Now, let's see the output via dumbo cat (which should convert from typed bytes back to the usual...)

dumbo cat counts/part-* -hadoop /usr/lib/hadoop-0.20
12/02/07 12:57:00 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
_   5
a   2
e   1
g   1
r   2
an  2
by  1

from mrjob.

coyotemarin avatar coyotemarin commented on July 22, 2024

Oh, awesome.

--hadoop-output-format only sets the output format for the last step. You probably want to set input/output format directly with e.g. --hadoop-arg -outputformat --hadoop-arg org.apache.hadoop.mapred.SequenceFileOutputFormat (and input format as well) so that you'll be using that format to pass sequence file data between your mapper and reducer.

(And thank you for the tips, @klbostee!)

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

As I'm thinking about how to implement this in a more structured way, I'm wondering if there is any reason to use a line-based interface between hadoop streaming and mrjob.

Klaas, if you use typedbytes as the communication between the streaming jar and python, you can still write out line-based data by setting the right outputformat, right?

I'm also struggling with thinking about how the typedbytes interface fits into mrjob. That is, python can communicate with java via typedbytes. Records can be stored as typedbytes, and I think any standard writeables get converted to typedbytes to communicate with streaming.

If the answer to the above question is yes, then I don't see any reason to keep the old line based interface around. Am I missing something?

(As an aside, I'll note that one reason to keep an OPTION for a line-based mode around would be promote compatibility with other languages. For instance, if mrjob has the ability to specify other executables to run to process the data... it might not be so easy to use typedbytes with those.)

from mrjob.

coyotemarin avatar coyotemarin commented on July 22, 2024

We picked JSON and a line-based interface because:

  • The default interface to Hadoop Streaming is line-based
  • The line-based record format is easy to understand
  • Basically all our input files are line-based log files
  • JSON is human-readable
  • JSON is about the most interoperable format around

That being said, I'd love to know if there are significant efficiency gains to be had by using typed bytes, and I'd love to have a good use case to guide the design of the protocol interface for non line-based formats.

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

I don't see any way around radically modifying the function job.pick_protocols and job._wrap_protocols

I don't think that modifying _wrap_protocols is an issue as it's an "internal" function. However, is modifying pick_protocols an issue? What (if any) backwards compatibility should I consider with changing this function?

from mrjob.

coyotemarin avatar coyotemarin commented on July 22, 2024

It's totally fine to override pick_protocols() (or to rewrite it while hacking out a proof-of-concept). See http://packages.python.org/mrjob/job.html#mrjob.job.MRJob.pick_protocols.

from mrjob.

irskep avatar irskep commented on July 22, 2024

Two questions:

  • Do we ever plan to actually add this to the docs?
  • Should we open a new issue about typedbytes? How's that going?

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

I'm working with a student on implementing it ... here's the current repository he's been working on. I just haven't had a chance to test it yet.

See ... joshi-prahlad/mrjob@b0b005a for the most recent changes.

David

from mrjob.

irskep avatar irskep commented on July 22, 2024

Cool, but it's really confusing to me that we're doing all this typedbytes discussion on the "dumbo migration in mrjob docs" issue. Can you please open a new one with what you actually plan to do?

from mrjob.

irskep avatar irskep commented on July 22, 2024

And my "are we actually doing this" question was about the docs, not typedbytes.

from mrjob.

dgleich avatar dgleich commented on July 22, 2024

Just added #430 about the typedbytes piece of this issue.

from mrjob.

irskep avatar irskep commented on July 22, 2024

IMO we should close this issue. If someone wants to volunteer to write it, they have my endless thanks, but we'll never do it ourselves and no one has asked for it.

from mrjob.

coyotemarin avatar coyotemarin commented on July 22, 2024

Yep.

from mrjob.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.