ddotta / cookbook-rpolars Goto Github PK

View Code? Open in Web Editor NEW

47.0 2.0 10.0 7.05 MB

Cookbook to provide solutions to common tasks and problems in using Polars with R

Home Page: https://ddotta.github.io/cookbook-rpolars/

License: Creative Commons Attribution 4.0 International

CSS 100.00%

data-engineering data-science dplyr polars r datatable benchmark tidyr cookbook

cookbook-rpolars's Introduction

Hi 👋,
I'm Damien (him)

I'm working as a data scientist
at Agreste (France) and previsously at Insee (France).
See the repositories :
and
Up-to-date versions of my CV are available for and .

I'm the author of the R packages parquetize and tablexlsx available on CRAN, I maintain the Awesome Polars list and the Cookbook Polars for R. I also contribute to the open source projects of InseeFrLab like utilitR, DoReMIFaSol...

cookbook-rpolars's People

Stargazers

Watchers

Forkers

eitsupi jsgro ghosthwang jongsoo92 fkohrt adamfeuer arthurgailes

cookbook-rpolars's Issues

pictures or diagrams of data and query patterns ?

Hi Damien
(not an issue, just a question, please move to Discussions)
I think I saw pictures or diagrams of data and query patterns, little trees ?
under polars someplace, but can't find it again and don't know the right terms.
I'm not a database guy, but there are so many different ways of storing and querying data
that little pictures might be fun / help intuition.
Does that ring a bell or am I way off ?
Thanks, cheers
-- denis

Add feature to run book's code with github codespace

An example :
https://github.com/geocompx/geocompy/

Very long output in chapter on DataFrames

In this section, you print a very (very) long output:

This is because you overwrite mydf with a very large DataFrame to make a benchmark a bit earlier

Use Cascadia font

Use this font to display the polars code correctly.

Add section about this trick

From twitter

If your data is already sorted @DataPolars
wants to know!

Here for example we see that filtering on a sorted column is 4x faster if Polars knows the data is sorted

Add tidypolars syntax in the panel-tabset

https://www.tidypolars.etiennebacher.com/

Benchmark comment "from a CSV"

Hello,

On your CSV benchmark, you use read.csv() (base function, very slow) for all three versions base/dplyr/data.table, while comparing to the Polars-specific CSV read function. Most of the time is spent in this function. So, the same timings for all three that is not representative of each implementation because tidyverse would use readr::read_csv() and data.table would use fread() instead.

Also, in your comparison between eager and lazy polars, you forgot to collect() the lazy version. It should be:

microbenchmark(
  "eager mode" = csv_eager_polars(),
  "lazy mode" = csv_lazy_polars()$collect(),
  times = 5
 )

Otherwise, excellent work !

PhG

Add section about new features on `LazyFrame`

Write documentation about this new great feature : pola-rs/r-polars#250

Add section "Writing your own function with polars R code"

Update the text in the export section

Update the text in the export section with the content of this issue

Update documentation with Polars v0.8.0

See here

Implement pdf version

Possible Error in piped expressions

On this page, this function:

pl$col("bar")$filter(pl.col("foo") == 1)$sum()

should actual be

pl$col("bar")$filter(pl$col("foo") == 1)$sum()

I think? (I'm still new to polars, so apologies if I missed something!)

Unfairness in benchmarks

Hello!

First, I'd like to say thanks for a great book and bringing knowledge about polars to R community.
I do have a concern about benchmarks in "From an R object" section though.

Currently you are pre-initializing polars object before running your query, while not converting data.frame to data.table or to duckdb / arrow.

cookbook-rpolars/book/content/benchmarking/_from_r_object.qmd

Lines 14 to 17 in e1374f9

 robject_polars <- function() { 

  DataMultiTypes_pl$ 

  # Filter rows

One could argue that DataMultiTypes_pl is not more of an R object than duckdb connection, as both are external references and can't be directly serialized to RDS. Creating a data.table object also takes additional time (albeit negligible compared to polars and duckdb).

So I propose either starting all benchmarks from base data.frame or pre-initializing all objects and connections.

In my testing I also uncovered the fact that polars has substantial initialization overhead, compared to duckdb, thus moving it down in ranks if initialization happens inside of the tested call.

Start a new section about using functions with polars

An example that works:

fn_transformation <- function(data) {

  data$
    # Convert Categorical columns into Strings 
    with_columns(
      pl$col(pl$Categorical)$cast(pl$Utf8))$
    # Make all Strings columns uppercase
    with_columns(
      pl$col(pl$Utf8)$str$to_uppercase())$
    # Filter only the third first rows
    head(3)
  
}

fn_transformation(pl$DataFrame(iris))

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ SETOSA  │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ SETOSA  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ SETOSA  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Memory usage outside of R

I have read the description of memory usage on the benchmark page, but this is most likely related to the fact that we cannot observe memory usage outside of R from the R package, so I recommend that you look carefully to see if you are underestimating memory usage in DuckDB, etc.

Use renv to manage packages

An example here: https://github.com/b-rodrigues/rap4all/blob/master/.github/workflows/quarto-publish.yml

inefficient data.table code in benchmarks

the data.table code is a bit unfair...

in the first code,

robject_dt <- function() {
  
  as.data.table(DataMultiTypes)[
    
    colInt > 2000 & colInt < 8000
    
  ][, .(min_colInt = min(colInt),
        mean_colInt = mean(colInt),
        mas_colInt = max(colInt),
        min_colNum = min(colNum),
        mean_colNum = mean(colNum),
        max_colNum = max(colNum)),
    
    by = colString
  ]
}

as.data.table does a full copy of the data and to make a fair comparison with polars you could build the data.table before hand,
data.table gets closer to dplyr in my benchmark

In the csv example you do not need as.data.table as fread returns a data.table
and then data.table method gets 2.5x faster than dplyr (on my machine with 10 threads for data.table) and probably beats polars(eager)

I could not run the polars code as it was throwing errors like

syntax error: days is not a method/attribute of the class RPolarsExprDTNameSpace 
       when calling method:
       (pl$col("colDate2") - pl$col("colDate1"))$dt$days

Document pipe() method with lazy scan

Add section about combining code from arrow and dbplyr with a complex query

However, for a simple query like this, it is hard to see the benefit of combining with the arrow package from the dbplyr side, so it might be more easier to understand if explaining that arrow and dbplyr support a different range of R functions inside dplyr.

#13 (review)

	robject_polars <- function() {

	DataMultiTypes_pl$
	# Filter rows

ddotta / cookbook-rpolars Goto Github PK

cookbook-rpolars's Introduction

Hi 👋, I'm Damien (him)

cookbook-rpolars's People

Stargazers

Watchers

Forkers

cookbook-rpolars's Issues

Recommend Projects

Recommend Topics

Recommend Org

Hi 👋,
I'm Damien (him)