Giter VIP home page Giter VIP logo

dimroc / etl-language-comparison Goto Github PK

View Code? Open in Web Editor NEW
187.0 13.0 33.0 358 KB

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.

Home Page: http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/

Elixir 11.33% Ruby 5.98% Shell 6.56% Python 4.25% Scala 11.67% Go 6.92% Rust 5.51% Nim 2.33% JavaScript 4.17% PHP 1.55% Erlang 11.76% C# 9.84% PowerShell 0.42% Perl 4.76% Mathematica 0.68% OCaml 2.67% Makefile 0.58% Haskell 9.02%

etl-language-comparison's Introduction

Update

Please see the following blog posts for the latests updates:

  1. ETL Language Showdown - Sept. 2014
  2. ETL Language Showdown Part 2 - Now with Python - May. 2015
  3. ETL Language Showdown Part 3 - 10 Languages and growing - Nov. 2015

Wins

Analyses and discussions done here have led to the following language pull requests:

  1. Add BIF binary:split/2,3 to Erlang
  2. Improve case insensitive regex to Golang

ETL Language Showdown

This repo implements the same map reduce ETL (Extract-Transform-Load) task in multiple languages in an effort to compare language productivity, terseness and readability. The performance comparisons should not be taken seriously. If anything, it is a bigger indication of my skillset in that language rather than their performance capabilities.

The Task

Count the number of tweets that mention 'knicks' in their message and bucket based on the neighborhood of origin. The ~1GB dataset for this task, sampled below, contains a tweet's message and its NYC neighborhood.

Simply run fetch_tweets in the repo directory or downloaded here.

91	west-brighton	Brooklyn	Uhhh
121	turtle-bay-east-midtown	Manhattan	Say anything
175	morningside-heights	Manhattan	It feels half-cheating half-fulfilling to cite myself.

Initial Assumption

  • These tasks are not run on Hadoop but do run concurrently. Performance numbers are moot since the CPU mostly sits idle waiting on Disk IO.
  • **UPDATE: Boy was the IO bound assumption wrong.

The Languages

Below you will find the languages run. Note that frameworks also play a big role, for example the Scala implementation compares the parallel collection to futures and the Akka framework. Click through on each language to read more.

LanguageOwner
Ruby
Golangmatttproud
Scala
Nim
Node
PHP
Erlang
Elixirjosevalim
Rust
Python
C#mganss
shellmganss
perlsitaramc

etl-language-comparison's People

Contributors

314eter avatar aheld avatar dimroc avatar egonelbre avatar gasche avatar gbulmer avatar ksmth avatar matttproud avatar maxgrenderjones avatar potatosalad avatar pqr avatar sitaramc avatar tippenein avatar tkawachi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etl-language-comparison's Issues

Create a (automatic?) process for updating the benchmark results

It would be nice to have the "Results" table in the README kept up to date. For instance, a Rust implementation was recently added (#14), and there's a pull request to improve said implementation (#5).

Obviously, this could be a CI task, but maybe the simpler thing to do is update the README in any pull request which changes/adds implementation(s)? For consistency across machines, probably a complete rewrite of the benchmark results table would be required, perhaps noting the machine specs as well.

This is just a thought and is open for discussion.

Include memory consumption in the benchmarks

Hi @dimroc, nice repo and blog posts!

It would be nice to also see a memory consumption comparison. For example the Parallel Ruby implementation got close in speed to the Go sub-string implementation, but Go's goroutines are supposed to have a much lower memory consumption than spawning multiple copies of the Ruby program, so it would be good to see this shown here.

Add README.me to Rust implementation

Hi @potatosalad,
Would it be possible to add a README.md to the rust/ folder talking about the implementations? A lot of it has already been covered in the PR's. I figured a clean up of a copy and paste would be great.

Alternative single-threaded python implementation (using library)

For fun I implemented the task using streamutils, a python library I've been working on that makes text processing very quick to write (but single threaded and uses generators, so not particularly fast to run). Makes for pretty short code though - read it like it were some bash commands chained together:

import streamutils as su
import re

bag=su.find('tmp/tweets/tweets_*') | su.read() | su.search(r'.*?\t(.*?)\t.*?\t.*knicks.*', group=1, flags=re.IGNORECASE, match=True) | su.bag()
bag.most_common() | su.smap(lambda x: '%s\t%s\n' % (x[0], x[1])) | su.write('tmp/python_streamoutput')

Alternative version, parallelized using multiprocessing:

from multiprocessing import Pool, Queue, cpu_count
from collections import Counter
import streamutils as su
import re

def process(f):
    return su.read(fname=f) | su.search(r'.*?\t(.*?)\t.*?\t.*knicks.*', group=1, flags=re.IGNORECASE, match=True) | su.bag()

if __name__=='__main__':
    bag=Pool(cpu_count()).map(process, su.find('tmp/tweets/tweets_*')) | su.sreduce(lambda x, y: x+y, Counter())
    bag.most_common() | su.smap(lambda x: '%s\t%s\n' % (x[0], x[1])) | su.write('tmp/python_parallelstreamoutput')

Not sure if it's worthy of a pull request though - more just to show that python can be terse too :)

Add README.md to Elixir implementation

Hi @josevalim,
Would it be possible to add a README.me to the elixir/ folder discussing the different implementations?

I'm trying to go through each language's implementation and add a README and think you're the best guy for the job ๐Ÿ‘

Standardize algorithms

I see that contributions have taken different approaches for solving the same problem, so at the end the benchmark is no comparing the language itself.

My suggestion would be to set a guideline for contributing which explains the standard approach, like:

  • It should use files
  • Should have the amount of worker/threads to use as a parameter
  • Can buffer for writing but the buffer size has certain size limit.
  • Should use regular expressions or should include both versions: with and without regexps.

Maybe also allow submitting a non-standard approach that takes advantage of specific language features but keep that one marked as the special one.

So at the end it would be two sets of solutions: (1) the standard that follows the rules and (2) the optimized or non-standard.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.