Giter VIP home page Giter VIP logo

benchm-dplyr-dt's Introduction

Simple/basic/limited/incomplete benchmark for dplyr and data.table

For parameters n = 10M, 100M and m = 100, 10K, 1M, create data.frames

d <- data.frame(x = sample(m, n, replace=TRUE), y = runif(n))
dm <- data.frame(x = sample(m))

and corresponding data.tables with and without key on x (d's size in RAM is around 100MB and 1GB, respectively).

The basic tabular operations (filter, aggregate, join etc.) are applied using base, dplyr (with data.frame and data.table backends, with and without key for data.table) and standard data.table (with and without key).

This is just a simple/basic/limited/incomplete benchmark, could do more with various data types (e.g. character), several grouping variables (x1,x2,...), more values for size parameters (n,m), different distributions of values in the data.frames etc. (or with real-world datasets).

Filter
d[d$x>=10 & d$x<20,]
d %>% filter(x>=10, x<20)
dt[x>=10 & x<20]
Sort
d[order(d$x),]
d %>% arrange(x)
dt[order(x)]
New column
d$y2 <- 2*d$y
d %>% mutate(y2 = 2*y)
dt[,y2 := 2*y]
Aggregation
tapply(d$y, d$x, mean)
d %>% group_by(x) %>% summarize(ym = mean(y))
dt[, mean(y), by=x]
Join
merge(d, dm, by="x")
d %>% inner_join(dm, by="x")
dt[dtm, nomatch=0]

Results

Full code in bm.Rmd and results for each n,m in bm-nxx-mxx.md files in the repo. Latest CRAN versions of R, dplyr and data.table have been used (R 3.1.1, dplyr 0.3.0.2 and data.table 1.9.4). A summary of results (relative running times, lower is better) is here:

base dplyr-df dplyr-dt dplyr-dt-k dt dt-k
Filter 2 1 1 1 1 1
Sort 30-60 20-30 1.5-3 1 1.5-3 1
New column 1 1 6 4 6 4 4 1 4 1
Aggregation 8-100 4-30 4-6 1.5 1.5-5 1
Join >100 4-15 4-6 1.5-2.5 - 1

(the larger numbers are usually for larger m, i.e. lots of small groups)

Discussion:
  • Having a key (which for data.table it means having the data pre-sorted in place) obviously helps with sorting, aggregation and joins (depending on the use case though, the time to generate the key should be added to the timing)

  • dplyr with data.table backend/source is almost as fast as plain data.table (because in this case dplyr acts as a wrapper and calls data.table functions behind the scenes) - so, you can kindda have both: dplyr API (my personal preference) and speed

  • dplyr with data.frame source is slower than data.table for sort, aggregation and joins. Some of this has apparently to do with radix sort and binary search joining (data.table) being faster than hash-table based joins (dplyr) as described here, but some of it is likely to be improved as Hadley said here.

  • Defining a new column in data.table (or dplyr with the data.table backend) is slower. I pointed out this to data.table developers Matt and Arun and this can be fixed. The extra slowdown in creating a new column with dplyr with data.table source (vs plain data.table) can also be fixed.

More info:

I'm going to give I gave a short 15-min talk at the LA R meetup about dplyr, and I'll talk I talked about these results as well, slides here.

There are several other benchmarks, for example Matt's benchmark of group-by, or Brodie Gaslam's benchmark of group-by and mutate. My goal was to look at a wider range of operations (but keep the work minimal, so I had to concentrate on a few samples) - and I also wanted to understand the reasons for such performance, and in this respect I'd like to thank the developers for the useful pointers.

Python's pandas:

Besides R, Python is almost as widely used for data analysis nowadays (and see how they dominate in the DataScience.LA data science toolbox survey).

It looks like Python's pandas (0.15.1) is slower than data.table for both aggregates and joins (contrary to measurements/claims from almost 3 years ago). For example for n = 10M and m = 1M runtimes (in seconds, lower is better):

pandas data.table
Aggregate 1.5 1
Aggregate (keys) 0.4 0.2
Join 5.9 -
Join (keys) 2.1 0.5
Creating keys 3.7 0.7

benchm-dplyr-dt's People

Contributors

szilard avatar

Watchers

James Cloos avatar Alex avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.