Giter VIP home page Giter VIP logo

freq's Introduction

Simple frequency dict benchmarks

Or, "a million ways to skin a corpus".

Here there be descriptions and test results. Eventually.

In the meantime, here's a cookie, and a test file: https://yadi.sk/d/xTFWy38y3GBAgS

pg.txt, 336 183 276 bytes, MD5 9ca4282866bb919e03cca2ed2b6ce429

What

Briefly, the task is as follows. Write a program that

  • takes exactly 2 command line arguments, myprog in.txt out.txt
  • counts all unique input words, and print them in freq desc, word asc order
  • words are [a-zA-Z]{1,256}, ie. latin letters only, and should be lowercased
  • input can contain arbitrary bytes, assume nothing about encoding, zeroes, etc
  • input might not fit in RAM, output however must fit in RAM
  • single-threaded only, for simplicity etc

And here's more crazy.

  • assume (for now) that 32 bits is enough for frequencies, ie. that all freqs are under 4e9
  • assume that any 64 bit hash function will easily collide, and that any 128 bit hash function might actually collide, either "by random luck" or by malicious attack

Last but not least, have a tiny example.

in.txt:

The time has come, the Walrus said,
to talk of many things

out.txt:

2 the
1 come
1 has
1 many
1 of
1 said
1 talk
1 things
1 time
1 to
1 walrus

Why

This is a community project with the only ultimate goal: to explore.

  • to explore different programming languages and syntaxes
  • to explore "typical" and idiomatic solutions
  • to explore elegant and concise solutions
  • to explore algorithms and programming techniques
  • to explore performance and optimization tricks

That, and to have some fun fiddling with a simple quick task in a myriad different ways.

Benchmark results

i5-4690 @ 3.7 GHz, Windows 7 x64

min..max MB/s program compiler or runtime
1.310..1.325s 256.6 freq03.cpp vs2019 16.1.0
1.747..1.778s 192.5 freq02.cpp vs2019 16.1.0
2.462..2.491s 136.6 freq01.cpp vs2019 16.1.0
2.655..2.675s 126.6 freq01.rs rustc 1.43.1
4.397..4.512s 76.5 freq01.go go 1.14.1
14.089..14.901s 23.9 freq01.scala scala 2.13.1, jdk 1.8.0_144
30.973..31.318s 10.9 freq01.php php 7.0.20-zts
99.880..102.353s 3.4 freq01.py python 3.7.0
1.407..1.416s 239.0 hack01.cpp vs2019 16.1.0

Xeon Gold 6240 @ 2.6 GHz, Debian 9 x64

min..max MB/s program compiler or runtime
1.260..1.278s 266.8 freq03.cpp gcc 6.3.0
1.492..1.521s 225.4 freq02.cpp gcc 6.3.0
2.197..2.245s 153.0 freq01.rs rustc 1.43.1
2.480..2.737s 135.5 freq01.cpp gcc 6.3.0
3.880..4.187s 86.6 freq01.go go 1.14.3
9.037..9.173s 37.2 freq01.php php 7.4.6-nts
20.152..22.151s 16.7 freq01.scala scala 2.13.1, openjdk 1.8.0_242
83.251..85.987s 4.0 freq01.py python 3.5.3
1.349..1.442s 249.2 hack01.cpp gcc 6.3.0

C++

g++ -O3 -o freq01 freq01.cpp
./freq01 pg.txt out.txt

Go

go build freq01.go
./freq01 pg.txt out.txt

Rust

cd build/rust
cargo build --release

cd ../..
./junk/freq01rust pg.txt out.txt

Scala

cd build/scala
sbt assembly
cd ../..
java -jar ./junk/freq01scala.jar pg.txt out.txt

freq's People

Contributors

aodinaev avatar borisbat avatar kevgs avatar rabodaber avatar seigert avatar shodanium avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.