Giter VIP home page Giter VIP logo

spread's Introduction

Spread

Spread is a missing link in the unix toolchain to do simple distributed data mining. Spread partitions data according to key across a fleet. Spread shines when used in conjunction with fleet management/coordination/persistence tools like:

Quickstart

You'll need CMake and Boost to build spread:

sudo apt-get install -y cmake libboost-all-dev

Then you can fetch and install spread:

git clone [email protected]:wavii/spread.git
cd spread
cmake . && make && sudo make install

Examples

Given a hostfile on each machine with the format host1\nhost2\nhost3\n..., here is a distributed wordcount using spread:

tr ' ' '\n' < input | spread -f hosts | sort | uniq -c > output

Spread can also be used as a library for writing your own map programs in C++:

void map(tcp::endpoint& local_endpoint, vector<tcp::endpoint>& remote_endpoints)
{
   asio::io_service io;
   spreader<> s(io, local_endpoint, remote_endpoints, cout);
   string line;
   while (getline(cin, line))
      s << line.substr(line.find("USER: ")); // map the user in a log line
}

Spread works well with cloud concepts. Here's an example of a bash script you could run on each host to operate on S3 data in parallel:

#!/bin/bash

# grab fleet index and size, perhaps this was provided on each machine by Chef
fleetsize=$(wc -l .fleet-hosts | awk '{print $1}')
fleetsize=$(echo "obase=16; $fleetsize" | bc)
id=$(cat .fleet-id)

for uri in `s3cmd ls s3://data-warehouse/ | awk '{print $4}'`; do
   md5id=$(echo $uri | md5sum | awk '{print toupper($1)}')
   remainder=$(echo "ibase=16; $md5id % $fleetsize" | bc)
   if [ $remainder -eq $id ]; then
      s3cmd get --no-progress --force $uri /mnt/input/
   fi
done

# now spread the input we've collected on each machine
cat /mnt/input/* | tr ' ' '\n' | spread -f hosts | sort | uniq -c > output

License

Spread via the MIT License. Have fun!

spread's People

Contributors

erikfrey avatar nova77 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.