Giter VIP home page Giter VIP logo

cookbook-rpolars's Introduction

cookbook-rpolars's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cookbook-rpolars's Issues

pictures or diagrams of data and query patterns ?

Hi Damien
(not an issue, just a question, please move to Discussions)
I think I saw pictures or diagrams of data and query patterns, little trees ?
under polars someplace, but can't find it again and don't know the right terms.
I'm not a database guy, but there are so many different ways of storing and querying data
that little pictures might be fun / help intuition.
Does that ring a bell or am I way off ?
Thanks, cheers
-- denis

Benchmark comment "from a CSV"

Hello,

On your CSV benchmark, you use read.csv() (base function, very slow) for all three versions base/dplyr/data.table, while comparing to the Polars-specific CSV read function. Most of the time is spent in this function. So, the same timings for all three that is not representative of each implementation because tidyverse would use readr::read_csv() and data.table would use fread() instead.

Also, in your comparison between eager and lazy polars, you forgot to collect() the lazy version. It should be:

microbenchmark(
  "eager mode" = csv_eager_polars(),
  "lazy mode" = csv_lazy_polars()$collect(),
  times = 5
 )

Otherwise, excellent work !

PhG

Possible Error in piped expressions

On this page, this function:

pl$col("bar")$filter(pl.col("foo") == 1)$sum()

should actual be

pl$col("bar")$filter(pl$col("foo") == 1)$sum()

I think? (I'm still new to polars, so apologies if I missed something!)

Unfairness in benchmarks

Hello!

First, I'd like to say thanks for a great book and bringing knowledge about polars to R community.
I do have a concern about benchmarks in "From an R object" section though.

Currently you are pre-initializing polars object before running your query, while not converting data.frame to data.table or to duckdb / arrow.

robject_polars <- function() {
DataMultiTypes_pl$
# Filter rows

One could argue that DataMultiTypes_pl is not more of an R object than duckdb connection, as both are external references and can't be directly serialized to RDS. Creating a data.table object also takes additional time (albeit negligible compared to polars and duckdb).

So I propose either starting all benchmarks from base data.frame or pre-initializing all objects and connections.

In my testing I also uncovered the fact that polars has substantial initialization overhead, compared to duckdb, thus moving it down in ranks if initialization happens inside of the tested call.

Start a new section about using functions with polars

An example that works:

fn_transformation <- function(data) {

  data$
    # Convert Categorical columns into Strings 
    with_columns(
      pl$col(pl$Categorical)$cast(pl$Utf8))$
    # Make all Strings columns uppercase
    with_columns(
      pl$col(pl$Utf8)$str$to_uppercase())$
    # Filter only the third first rows
    head(3)
  
}

fn_transformation(pl$DataFrame(iris))

shape: (3, 5)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Sepal.Length โ”† Sepal.Width โ”† Petal.Length โ”† Petal.Width โ”† Species โ”‚
โ”‚ ---          โ”† ---         โ”† ---          โ”† ---         โ”† ---     โ”‚
โ”‚ f64          โ”† f64         โ”† f64          โ”† f64         โ”† str     โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 5.1          โ”† 3.5         โ”† 1.4          โ”† 0.2         โ”† SETOSA  โ”‚
โ”‚ 4.9          โ”† 3.0         โ”† 1.4          โ”† 0.2         โ”† SETOSA  โ”‚
โ”‚ 4.7          โ”† 3.2         โ”† 1.3          โ”† 0.2         โ”† SETOSA  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Memory usage outside of R

I have read the description of memory usage on the benchmark page, but this is most likely related to the fact that we cannot observe memory usage outside of R from the R package, so I recommend that you look carefully to see if you are underestimating memory usage in DuckDB, etc.

inefficient data.table code in benchmarks

the data.table code is a bit unfair...

in the first code,

robject_dt <- function() {
  
  as.data.table(DataMultiTypes)[
    
    colInt > 2000 & colInt < 8000
    
  ][, .(min_colInt = min(colInt),
        mean_colInt = mean(colInt),
        mas_colInt = max(colInt),
        min_colNum = min(colNum),
        mean_colNum = mean(colNum),
        max_colNum = max(colNum)),
    
    by = colString
  ]
}

as.data.table does a full copy of the data and to make a fair comparison with polars you could build the data.table before hand,
data.table gets closer to dplyr in my benchmark

In the csv example you do not need as.data.table as fread returns a data.table
and then data.table method gets 2.5x faster than dplyr (on my machine with 10 threads for data.table) and probably beats polars(eager)

I could not run the polars code as it was throwing errors like

syntax error: days is not a method/attribute of the class RPolarsExprDTNameSpace 
       when calling method:
       (pl$col("colDate2") - pl$col("colDate1"))$dt$days

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.