dimroc / etl-language-comparison Goto Github PK

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.

Home Page: http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/

Elixir 11.33% Ruby 5.98% Shell 6.56% Python 4.25% Scala 11.67% Go 6.92% Rust 5.51% Nim 2.33% JavaScript 4.17% PHP 1.55% Erlang 11.76% C# 9.84% PowerShell 0.42% Perl 4.76% Mathematica 0.68% OCaml 2.67% Makefile 0.58% Haskell 9.02%

etl-language-comparison's Introduction

Update

Please see the following blog posts for the latests updates:

ETL Language Showdown - Sept. 2014
ETL Language Showdown Part 2 - Now with Python - May. 2015
ETL Language Showdown Part 3 - 10 Languages and growing - Nov. 2015

Wins

Analyses and discussions done here have led to the following language pull requests:

ETL Language Showdown

This repo implements the same map reduce ETL (Extract-Transform-Load) task in multiple languages in an effort to compare language productivity, terseness and readability. The performance comparisons should not be taken seriously. If anything, it is a bigger indication of my skillset in that language rather than their performance capabilities.

The Task

Count the number of tweets that mention 'knicks' in their message and bucket based on the neighborhood of origin. The ~1GB dataset for this task, sampled below, contains a tweet's message and its NYC neighborhood.

Simply run fetch_tweets in the repo directory or downloaded here.

91	west-brighton	Brooklyn	Uhhh
121	turtle-bay-east-midtown	Manhattan	Say anything
175	morningside-heights	Manhattan	It feels half-cheating half-fulfilling to cite myself.

Initial Assumption

These tasks are not run on Hadoop but do run concurrently. Performance numbers are moot since the CPU mostly sits idle waiting on Disk IO.
**UPDATE: Boy was the IO bound assumption wrong.

The Languages

Below you will find the languages run. Note that frameworks also play a big role, for example the Scala implementation compares the parallel collection to futures and the Akka framework. Click through on each language to read more.

Language	Owner
Ruby
Golang	matttproud
Scala
Nim
Node
PHP
Erlang
Elixir	josevalim
Rust
Python
C#	mganss
shell	mganss
perl	sitaramc

etl-language-comparison's People

Contributors

Stargazers

Watchers

etl-language-comparison's Issues

Create a (automatic?) process for updating the benchmark results

It would be nice to have the "Results" table in the README kept up to date. For instance, a Rust implementation was recently added (#14), and there's a pull request to improve said implementation (#5).

Obviously, this could be a CI task, but maybe the simpler thing to do is update the README in any pull request which changes/adds implementation(s)? For consistency across machines, probably a complete rewrite of the benchmark results table would be required, perhaps noting the machine specs as well.

This is just a thought and is open for discussion.

Include memory consumption in the benchmarks

Hi @dimroc, nice repo and blog posts!

It would be nice to also see a memory consumption comparison. For example the Parallel Ruby implementation got close in speed to the Go sub-string implementation, but Go's goroutines are supposed to have a much lower memory consumption than spawning multiple copies of the Ruby program, so it would be good to see this shown here.

Add README.me to Rust implementation

Hi @potatosalad,
Would it be possible to add a README.md to the rust/ folder talking about the implementations? A lot of it has already been covered in the PR's. I figured a clean up of a copy and paste would be great.

Review Elixir implementation

Hi @josevalim,
If you have a minute, could you review my elixir code here?
https://github.com/dimroc/etl-language-comparison/tree/master/elixir/lib

It'll be used in my second write up of http://www.dimroc.com/2014/09/29/etl-language-showdown/

Add README.md to PHP implementation

@pqr

Alternative single-threaded python implementation (using library)

For fun I implemented the task using streamutils, a python library I've been working on that makes text processing very quick to write (but single threaded and uses generators, so not particularly fast to run). Makes for pretty short code though - read it like it were some bash commands chained together:

import streamutils as su
import re

bag=su.find('tmp/tweets/tweets_*') | su.read() | su.search(r'.*?\t(.*?)\t.*?\t.*knicks.*', group=1, flags=re.IGNORECASE, match=True) | su.bag()
bag.most_common() | su.smap(lambda x: '%s\t%s\n' % (x[0], x[1])) | su.write('tmp/python_streamoutput')

Alternative version, parallelized using multiprocessing:

from multiprocessing import Pool, Queue, cpu_count
from collections import Counter
import streamutils as su
import re

def process(f):
    return su.read(fname=f) | su.search(r'.*?\t(.*?)\t.*?\t.*knicks.*', group=1, flags=re.IGNORECASE, match=True) | su.bag()

if __name__=='__main__':
    bag=Pool(cpu_count()).map(process, su.find('tmp/tweets/tweets_*')) | su.sreduce(lambda x, y: x+y, Counter())
    bag.most_common() | su.smap(lambda x: '%s\t%s\n' % (x[0], x[1])) | su.write('tmp/python_parallelstreamoutput')

Not sure if it's worthy of a pull request though - more just to show that python can be terse too :)

Add README.md to Elixir implementation

Hi @josevalim,
Would it be possible to add a README.me to the elixir/ folder discussing the different implementations?

I'm trying to go through each language's implementation and add a README and think you're the best guy for the job 👍

Standardize algorithms

I see that contributions have taken different approaches for solving the same problem, so at the end the benchmark is no comparing the language itself.

My suggestion would be to set a guideline for contributing which explains the standard approach, like:

It should use files
Should have the amount of worker/threads to use as a parameter
Can buffer for writing but the buffer size has certain size limit.
Should use regular expressions or should include both versions: with and without regexps.

Maybe also allow submitting a non-standard approach that takes advantage of specific language features but keep that one marked as the special one.

So at the end it would be two sets of solutions: (1) the standard that follows the rules and (2) the optimized or non-standard.