Giter VIP home page Giter VIP logo

mincemeatpy's Introduction

mincemeat.py: MapReduce on Python

Introduction

mincemeat.py is a Python implementation of the MapReduce distributed computing framework.

mincemeat.py is:

  • Lightweight - All of the code is contained in a single Python file (currently weighing in at <13kB) that depends only on the Python Standard Library. Any computer with Python and mincemeat.py can be a part of your cluster.
  • Fault tolerant - Workers (clients) can join and leave the cluster at any time without affecting the entire process.
  • Secure - mincemeat.py authenticates both ends of every connection, ensuring that only authorized code is executed.
  • Open source - mincemeat.py is distributed under the MIT License, and consequently is free for all use, including commercial, personal, and academic, and can be modified and redistributed without restriction.

Download

  • Just mincemeat.py (v 0.1.4)
  • The full 0.1.4 release (includes documentation and examples)
  • Clone this git repository: git clone https://github.com/michaelfairley/mincemeatpy.git

Example

Let's look at the canonical MapReduce example, word counting:

example.py:

#!/usr/bin/env python
import mincemeat

data = ["Humpty Dumpty sat on a wall",
        "Humpty Dumpty had a great fall",
        "All the King's horses and all the King's men",
        "Couldn't put Humpty together again",
        ]
# The data source can be any dictionary-like object
datasource = dict(enumerate(data))

def mapfn(k, v):
    for w in v.split():
        yield w, 1

def reducefn(k, vs):
    result = sum(vs)
    return result

s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme")
print results

Execute this script on the server:

python example.py

Run mincemeat.py as a worker on a client:

python mincemeat.py -p changeme [server address]

And the server will print out:

{'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 'had': 1, 'all': 1, 'together': 1, "King's": 2, 'horses': 1, 'All': 1, "Couldn't": 1, 'fall': 1, 'and': 1, 'the': 2, 'put': 1, 'sat': 1}

This example was overly simplistic, but changing the datasource to be a collection of large files and running the client on multiple machines will work just as well. In fact, mincemeat.py has been used to produce a word frequency lists for many gigabytes of text using a slightly modified version of this code.

Clients

You can run the client manually from within other Python scripts (rather than running mincemeat.py directly):

import mincemeat

client = mincemeat.Client()
client.password	= "changeme"
client.conn("localhost", mincemeat.DEFAULT_PORT)

Shepherd.py provides more sophisticated ways to run clients, including having client that poll or are forked on the same machine.

Imports

One potential gotcha when using mincemeat.py: Your mapfn and reducefn functions don't have access to their enclosing environment, including imported modules. If you need to use an imported module in one of these functions, be sure to include import whatever in the functions themselves.

Python 3 support

ziyuang has a fork of mincemeat.py that's comptable with python 3: ziyuang/mincemeatpy

mincemeatpy's People

Contributors

apendleton avatar garyelephant avatar michaelfairley avatar rs2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mincemeatpy's Issues

socket.error: [Errno 48] Address already in use

I am mew to mincemeat. When I try to run a client after running the server I get this error: socket.error: [Errno 48] Address already in use.

Basically what I do is:

  1. python example.py
  2. In another terminal python example.py -p changeme localhost

I am using Snow Leopard 10.6.8

Please advice

Thanks

duplication of work?

It seems that shepherd is handing over the same piece of work to multiple workers?

Minimally modified example.py:

http://dpaste.com/hold/1501864/

Then I ran shepherd this way:

% python shepherd.py -n 3 -8
map K 1 map V Humpty Dumpty had a great fall PID 25153
map K 0 map V Humpty Dumpty sat on a wall PID 25152
map K 2 map V All the King's horses and all the King's men PID 25153
map K 3 map V Couldn't put Humpty together again PID 25152
map K 3 map V Couldn't put Humpty together again PID 25153
reduce k a vs: [1, 1] PID 25152
reduce k on vs: [1] PID 25153
reduce k great vs: [1] PID 25152
reduce k Humpty vs: [1, 1, 1] PID 25153
reduce k all vs: [1] PID 25152
reduce k wall vs: [1] PID 25153
reduce k Dumpty vs: [1, 1] PID 25152
reduce k King's vs: [1, 1] PID 25153
reduce k men vs: [1] PID 25152
reduce k had vs: [1] PID 25153
reduce k All vs: [1] PID 25152
reduce k together vs: [1] PID 25153
reduce k and vs: [1] PID 25152
reduce k horses vs: [1] PID 25153
reduce k Couldn't vs: [1] PID 25152
reduce k fall vs: [1] PID 25153
reduce k put vs: [1] PID 25152
reduce k again vs: [1] PID 25153
reduce k the vs: [1, 1] PID 25152
reduce k sat vs: [1] PID 25153
reduce k sat vs: [1] PID 25152

As you can see, the same piece of work was handed over to more than 1 worker. Does this work as designed?

I get "Connection refused" when running the worker.

When I try to run the worker, I get this.

./mincemeat.py -p aoeuaoeu localhost
error: uncaptured python exception, closing channel <__main__.Client at 0xb6fceeec> (<class 'socket.error'>:[Errno 111] Connection refused [/usr/lib/python2.7/asyncore.py|read|83] [/usr/lib/python2.7/asyncore.py|handle_read_event|441] [/usr/lib/python2.7/asyncore.py|handle_connect_event|449])

This happens on two computers. Both are bizarrely virtualized; one is a Debian chroot jail in a Ubuntu on Xen. The other is CentOS on KVM.

This also happens on multiple python versions. I tried 2.7.3 on both computers and 2.6.6 on CentOS.

Trying to cascade MapReduce

Hello,

I am trying to cascade MapReduce operations using mincemeatpy. The solution I tried is to create a Server within a mapfn. The Server uses a different port than the global Server object. It doesn't work because asyncore uses a global socket_map object.

So I am trying to pass my own map into asyncore.loop

However, I am having difficulties and it seems to be spiraling out of control.

So my first question: Using mincemeatpy, how could I cascade MapReduce operations? Can this be done "out-of-the-box" without any modifications to mincemeatpy?

The goal is to use MapReduce to first send out a filenames to clients. Each of the first layer clients should open their file, then send chunks of their file to other second layer clients. The second layer clients should process the lines they receive and send the result back to the first layer clients (note the first layer clients will be acting as MapReduce second layer servers for this). The first layer clients should collect these results, then reduce and send the final result each file back to the first layer server.

The second question is, if modifications are required, do you have any insight into how/what to modify?

Like I said before, I believe the answer is providing maps to the second layer clients, so they are each using their own socket map and not the global socket map.

If I did not explain myself clearly, or you would like an example, I can provide more details and a simple example.

TIA

A new implementation of mincemeatpy using node.js

Hi:
I'm interesting in your mincemeatpy, and i think it's useful for sudden, heavy job.So I implementation it using node.js.It's easy to use and make full advantages of node.js's network ability.I also make some improvement.
Can u add my repo into your Readme?Thanks!
mincemeat-node

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.