Comments (22)
What were the limitations of dumbo that made you write mrjob? Maybe outlining those differences, whether philosophical or technical, in the docs somewhere would be useful. At the moment I'm just confused which one I'd use.
from mrjob.
Good question! Will do. The big advantages mrjob has over dumbo are:
- works with EMR
- JSON and other protocols for communication between steps and with other processes
- add custom switches to your jobs, including file options
from mrjob.
That's not quite true. I haven't looked at mrjob much, but:
- dumbo programs can run on EMR too: http://dumbotics.com/2009/12/23/dumbo-on-amazon-emr
- one could easily implement a dumbo backend that uses a different protocol, but most people will want to stick to the so called "typed bytes" it uses by default on Hadoop because they are implemented very efficiently at both the Java and the Python side and make it possible to (properly) consume non-text input files such as sequence files: http://static.last.fm/johan/huguk-20090414/klaas-hadoop-1722.pdf
- dumbo programs can have custom switches too: http://github.com/klbostee/dumbo/wiki/Short-tutorial#programs_and_starters
from mrjob.
Oh, awesome, a visit from the dumbo maintainer!
Neat, didn't know that dumbo has custom switches.
It looks like dumbo's EMR support is more like running dumbo by hand inside EMR, whereas mrjob jobs will actually start up an EMR job flow and launch themselves on EMR (and install mrjob
automatically) when you run them with -r emr
. But yeah, I stand corrected on EMR as well.
I'll make a point of trying to run by statements about what dumbo can and can't do by you first. I think there's room for more than one Python MapReduce library, and I'm hoping we can work together and learn from each other. :)
By the way, I've taken a look at your typedbytes
library. I imagine it'll be pretty useful for mrjob
once we support mixing of python scripts and Java (right now it's all Python).
from mrjob.
Are there any forks that try to get mrjob to use typedbytes as the interface instead of json? I need to use numeric data almost exclusively in a project and I'm concerned that json will introduce too much overhead.
from mrjob.
Not that I know of, but I'd be happy to work with you on getting it to work.
One assumption you might bump up against is that mrjob
assumes you're using a line-based format (since it's a Hadoop streaming library). However, it would not be at all difficult to extend the protocol definition to allow arbitrary reading and writing to file handles.
All the code that controls input/output format is in mrjob.job.MRJob.run_mapper()
, mrjob.job.MRJob.run_reducer()
, and mrjob.job.MRJob.run_combiner()
. I would recommend forking master
, hacking this code however you need to in order to get typed bytes to work, and then we can figure out the best way to clean it up and generalize once you have something working.
Let me know how it goes!
from mrjob.
Hadoop streaming doesn't require a line-based format actually, its default InputWriter and OutputReader just happen to be line-based. With typed bytes you simply use a different InputWriter and OutputReader that rely on a more efficient binary format (and those typed bytes IO classes are also shipped with Hadoop streaming by the way).
Also, with the java and python side I meant streaming and dumbo above. Mrjob might be all python (as is dumbo), but you still need to talk to streaming's java code then. Making this more efficient can lead to very substantial performance gains.
from mrjob.
It looks like it won't be too difficult to come up with a proof-of-concept that this works. The user interface for picking a protocol will be a tricky bit. To start with, I'll try and force everything to use typedbytes.
from mrjob.
One issue that'll arise is how to distributed the typedbytes python module with the mrjob.tar.gz file.
from mrjob.
I have a fork where dgleich/mrjob@def0452 that does this in a really hacky way. At the moment, it assumes all IO from python is in typedbytes.
-
Tested on CDH3 in single node pseudo-distributed mode.
-
typedbytes module installed system-wide
-
dumbo installed via
pip install dumbo
(pyenv)dgleich@recurrent:~/devextern/mrjob$ HADOOP_HOME=/usr/lib/hadoop-0.20/ python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop -v --no-output -o counts --hadoop-output-format=org.apache.hadoop.mapred.SequenceFileOutputFormat
If I cat the output file, I get garbage now.
Now, let's see the output via dumbo cat (which should convert from typed bytes back to the usual...)
dumbo cat counts/part-* -hadoop /usr/lib/hadoop-0.20
12/02/07 12:57:00 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
_ 5
a 2
e 1
g 1
r 2
an 2
by 1
from mrjob.
Oh, awesome.
--hadoop-output-format
only sets the output format for the last step. You probably want to set input/output format directly with e.g. --hadoop-arg -outputformat --hadoop-arg org.apache.hadoop.mapred.SequenceFileOutputFormat
(and input format as well) so that you'll be using that format to pass sequence file data between your mapper and reducer.
(And thank you for the tips, @klbostee!)
from mrjob.
As I'm thinking about how to implement this in a more structured way, I'm wondering if there is any reason to use a line-based interface between hadoop streaming and mrjob.
Klaas, if you use typedbytes as the communication between the streaming jar and python, you can still write out line-based data by setting the right outputformat, right?
I'm also struggling with thinking about how the typedbytes interface fits into mrjob. That is, python can communicate with java via typedbytes. Records can be stored as typedbytes, and I think any standard writeables get converted to typedbytes to communicate with streaming.
If the answer to the above question is yes, then I don't see any reason to keep the old line based interface around. Am I missing something?
(As an aside, I'll note that one reason to keep an OPTION for a line-based mode around would be promote compatibility with other languages. For instance, if mrjob has the ability to specify other executables to run to process the data... it might not be so easy to use typedbytes with those.)
from mrjob.
We picked JSON and a line-based interface because:
- The default interface to Hadoop Streaming is line-based
- The line-based record format is easy to understand
- Basically all our input files are line-based log files
- JSON is human-readable
- JSON is about the most interoperable format around
That being said, I'd love to know if there are significant efficiency gains to be had by using typed bytes, and I'd love to have a good use case to guide the design of the protocol interface for non line-based formats.
from mrjob.
I don't see any way around radically modifying the function job.pick_protocols and job._wrap_protocols
I don't think that modifying _wrap_protocols is an issue as it's an "internal" function. However, is modifying pick_protocols an issue? What (if any) backwards compatibility should I consider with changing this function?
from mrjob.
It's totally fine to override pick_protocols()
(or to rewrite it while hacking out a proof-of-concept). See http://packages.python.org/mrjob/job.html#mrjob.job.MRJob.pick_protocols.
from mrjob.
Two questions:
- Do we ever plan to actually add this to the docs?
- Should we open a new issue about
typedbytes
? How's that going?
from mrjob.
I'm working with a student on implementing it ... here's the current repository he's been working on. I just haven't had a chance to test it yet.
See ... joshi-prahlad/mrjob@b0b005a for the most recent changes.
David
from mrjob.
Cool, but it's really confusing to me that we're doing all this typedbytes
discussion on the "dumbo migration in mrjob docs" issue. Can you please open a new one with what you actually plan to do?
from mrjob.
And my "are we actually doing this" question was about the docs, not typedbytes
.
from mrjob.
Just added #430 about the typedbytes piece of this issue.
from mrjob.
IMO we should close this issue. If someone wants to volunteer to write it, they have my endless thanks, but we'll never do it ourselves and no one has asked for it.
from mrjob.
Yep.
from mrjob.
Related Issues (20)
- upgrade boto3/botocore to support StepConcurrencyLevel HOT 2
- fetching progress from resource manager shouldn't rely on SSH tunnel
- progress indicators are wrong when steps run simultaneously HOT 1
- useless return value from make_pooled_cluster() in pooling tests
- pool_wait_minutes shouldn't wait if pool is empty
- add pool_timeout_minutes option
- add pool_jitter_seconds option HOT 1
- Error when running on hadoop "Found 2 unexpected arguments on the command line" HOT 1
- add_passthru_arg on hadoop
- It possible to prevent decompression and/or splitting in local or inline mode
- Can I write map and reduce in many different class?
- Assign tags on EMR creation in single API call
- ignore unrecognized arguments HOT 1
- code breaks locally but runs fine remotely on hadoop cluster HOT 2
- Hadoop counter in mrjob
- trying to run mr job python script
- Failure to run mrjob on dataproc
- total sort HOT 1
- Read Specific Column From csv file
- Python 3.12 support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mrjob.