pola-rs / polars-book Goto Github PK

View Code? Open in Web Editor NEW

184.0 9.0 120.0 1.91 MB

Book documentation of the Polars DataFrame library

License: MIT License

Makefile 79.44% Python 20.56%

polars

polars-book's Introduction

Warning

The user guide has been moved.

It is now part of the main Polars repository. Please open new issues there.

Find the user guide at its new location here.

Polars Book

This repo contains the User Guide for the Polars DataFrame library.

Getting Started

The User guide is made with Material for mkdocs. In order to get started with building this book perform the following steps:

make requirements

In order to serve the books run make serve. This will run all the python examples and display the output inline using the markdown-exec plugin.

Deployment

Deployment of the book is done using Github Pages and Github Workflows. The book is automatically deployed on each push to the main branch. There are a number of checks in the CI pipeline to avoid non-working examples:

Run all python examples, fail on any errors
Check all links in markdown
Run black formatter

polars-book's People

Contributors

Stargazers

Watchers

Forkers

nickray paauw blackrez r-brink mohanrajofficial87 stjordanis gunjanrt04 abhinav-upadhyay hercules261188 ghuls romanovacca closechoice ethack simonw dayalstrub-cma zundertj cenwei odarr jychen7 mrpowers kimjoaoun nchammas adamgreg franzdiebold cnpryer frankvgompel kilobit1 supersupeng marcvanheerden datapsycho braaannigan pedronet4k aimran-adroll snoozefreddo dikuchan ryanrussell jpreszler hblunck yuri-stepanov unidesigner stevenjbaehr mcrumiller dvreed77 hpux735 dzvon jbaxx jorgecarleitao has2k1 hofer-julian marcogorelli sangnguyens universalmind303 andrewpollack trippy3 cheongwy sorhawell oskargottlieb tallamjr rayrrr anatolybuga stefanbras erickisos carbocarde purfakt joshlartz naarlack mjclarke94 hrjn uinelj yamasakih mfloto benoitgillet marco-lardera-ccno reindert94 igmriegel josemasar hennyboiszz corneliusroemer armgabrielyan gavinforest pea-sys mavills jbn edson-github denwong47 buedenbender vinishuchiha c-peters svenz haphut deflateawning lucazanna hrektts vincent-octo cotp27 zm711 laundmo aglebov nedjwestern ammkrn

polars-book's Issues

add float32/64 datatypes

https://pola-rs.github.io/polars-book/user-guide/datatypes.html seems to be missing Float64 and Float32

Expressions page links to itself in "Next Chapter" button

When clicking on the "Next Chapter" button (the big arrow on the right) from this page: https://pola-rs.github.io/polars-book/user-guide/dsl/intro.html, it goes back to that same page, while it should be going to https://pola-rs.github.io/polars-book/user-guide/dsl/contexts.html

This happens when the chapter page and first sub chapter page are the same (in SUMMARY.md):

- [Polars expressions](dsl/intro.md)
  - [Expressions](dsl/intro.md)

It might be a bug (or feature) in mdBook, but couldn't find anything about it.

User guide has some tiny errors

some tiny errors as follows:

you should specify which data you have used.

Is the user guide not completed？

example not reproducible

This rust example does not work as is.
Per the compiler I had to add use std::io::Cursor; to in order for the example to work

polars-book/user_guide/src/quickstart/intro.md

Line 73 in 3278793

use color_eyre::{Result};

Update Python formatting for code snippets

Problem

Looks like formatting is a little unreadable sometimes.

Solution

Update black line-length.

Docs on concat and join

There is no mention of concat or join in the current book. Propose adding a section to howcani with a first version of this

Expressions not showing in Contexts

In a few of the contexts subsections, like this one (and the ones below) the code snippet is not showing for me.

I think the line number needs to be changed (to 4) here

polars-book/user_guide/src/dsl/contexts.md

Line 34 in a73a4c6

and elsewhere, but can't run to check right now.

Cleaning up the getting started steps

Repurposing to yank Docker to prevent confusion.

Problem

As a new contributor my number one goal on this repo is to get set up. I like the idea of using a Docker image, but I'm not sure if the README provides complete instructions. I'd like to use Docker more, but my experience with it is limited to using it for a couple projects in the past.

Note the mdBook executable is downloaded and not compiled to fasten the building of the image.

In order to compile mdBook wouldn't Rust need to be installed (along with cargo) on the image?

Suggested improvements for User Guide

Copied from: pola-rs/polars#2540

Hi,

I was going through the user guide (https://pola-rs.github.io/polars-book/user-guide/index.html) to get a feel for the library and ran into some discrepancies I thought you might want to know about. I wasn't sure how to provide this information otherwise, so here is just an overview of the things I noticed:

~~Some snippets have "print(df)", while they should have "print(out)". In the Expressions page (3.1) + Contexts page (3.2) + Time-series page (8).~~
~~Expressions page (3.1) arrow right to go to next page, returns you to the Expressions page again.~~
Getting the US congress dataset code sample on GroupBy page (3.3) is very unclear about how to replicate ("from .dataset import dataset" doesn't work and is rather unclear how to execute other than that snippet)
~~Gotcha's link in Numpy universal functions page (3.6) gives a 404 error.~~
Performance- Strings page (10.1), final lazy join on categorical data snippet results in error: "Any(ValueError("joins on categorical dtypes can only happen if they are created under the same global string cache"))"

Cheers,
Pieter

`black` is failing in tests

Problem

As mentioned on Discord

Run black --check .
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.12/x64/bin/black", line 8, in <module>
    sys.exit(patched_main())
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/black/__init__.py", line 6606, in patched_main
    patch_click()
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/black/__init__.py", line 6595, in patch_click
    from click import _unicodefun  # type: ignore
ImportError: cannot import name '_unicodefun' from 'click' (/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/click/__init__.py)

Documenting this to either pick up myself or for someone else to grab it before me.

Solution

TBD

Agree on a Markdown linter/prettifier

Rust & Python code taken care of; .md files deserve some attention/formatting too.

Alternative Documentation with Mkdocs

Hi everyone. I have developed an alternative documentation with mkdocs, google colab and github actions. This repository lives at fralfaro/polars-book (documentation: link). In each chapter there is a jupyter notebook connected with google colab. This project consider some chapters right now.

This documentation is similar than mdBook, since mkdocs uses .md files. The main difference is about is possible compile .ipynb files. This is a benefits, because you can use a .ipynb for replicate an example and it is not necessary use this kind of syntaxis:

{{#include ../examples/expressions/window_1.py:0:}}

{{#include ../outputs/expressions/window_1.txt}}

For this reason, it is not necessary to have a folder for the python output.

I hope this is a good alternative. If you have any questions, happy to answer them!

Add quickstart

I've written a quickstart notebook for Polars here:

https://colab.research.google.com/github/braaannigan/data-analysis-with-polars/blob/main/notebooks/intro/00-KeyConcepts.ipynb

We need a quickstart for the user guide as well - do you think this could be the basis of a quick start for the user guide? Obviously needs some tweaks for pep styling and removing references to the course

Rename snippet folder and subfolders

The user_guide/src/_examples/ does not need a _ prefix.
The names of the subfolders contained in this directory should not contain -, but _ instead (to allow python -m user_guide.src.examples.<SUBFOLDER>.snippetX.py calls).
Those changes involve fixing paths over the whole repo, several files would be impacted.

Resampling doc link is dead

I am getting a 404 error when trying to access the resampling page.

Might be this commit but idk how the fix would look like.

Code examples not displaying on some pages

e.g. on https://pola-rs.github.io/polars-book/user-guide/dsl/intro.html#filter-and-conditionals

Coming from pandas

In the python cookbook the "Coming from Pandas" section is key for new users. However, it could be structured better. I propose to re-work it so it starts with the key conceptual differences (rust, arrow, no index, lazy evaluation, query optimization...) before going into differences in writing queries.

@ritchie46 - I can start a PR on this basis if there are no objections.

Wrong output in "How can I" examples?

Unless I am mistaken, a few of the examples have output which instead looks like what the input should look like, e.g. this one is missing the new "e" column.

Documentation uses "col" without mentioning how to get the function

The documentation nowhere mentions the feature flags needed to actually get started using polars. It jumps straight into using "col" which is not available.

One seems to at least need the "lazy" flag but I'm not sure.
The docs also switch between polars::prelude and polars_core::prelude which is also something that's never mentioned anywhere.

The Getting Started guide for example: https://pola-rs.github.io/polars-book/user-guide/quickstart/intro.html
It mentions cargo add polars and then jumps right into something called lazy polars without ever explaining what that is and it uses the col function as well. None of that works without enabling at least one feature.

Adding walk-through content, more documentation hand-holding, and examples completeness

This issue covers some of the feedback I have as a new (and currently limited) user of Polars. There's a lot to unpack for these proposals, so I'll start with some context for a foundation.

I'm a long time Pandas user who's worked on projects ranging from discovery work to implementing packaged compute pipelines in Python. I love what Pandas can do, but not how it's designed and often used. This is a major factor behind why I've grown interested in Polars and hope to get more involved to help polish and promote it as the viable DataFrame library it is.

Motivation

Before listing my feedback I want to establish some core assumptions behind my perspective.

First I believe that (1) as an onboarding user the onus is on you to learn the library. (2) Sufficient documentation isn't absolutely necessary to learn Polars, but it will scale Polars' ability to onboard new users (in turn hopefully surrounding the project with more support), and thus should be prioritized within the context of documentation. (3) The API design is a primary feature of a DataFrame library and is a great selling point for a library like Polars. (4) Some people are thorough and will read through the documentation entirely, while others will look for quicker ways to get hands on. IME some people are just hands on and once you get them started it's easier to guide them through more details. It's really difficult to incentivize these users to read through the docs in their entirety without being more creative with the presentation/flow of the docs. FWIW I started out by reading through the entire user guide.

Feedback

Introduction page is perfect. I wouldn't change it.
Getting started feels short. This is where I think the gap from motivation#4 can be filled. Getting Started content itself can and probably should be compact and to the point, but we can bridge into details and concepts more effectively from here
For example Expressions could be a new concept to both Pandas and non-Pandas users. I'd guess that's a gigantic chunk of potential new users (if it's not the majority). Leading into expressions immediately may start some users off with more information overhead to digest than they should realize. This can be improved by providing 5-10 minutes of opinionated walk-through style content that covers an unspecific workflow using familiar DataFrame library behaviors (Create a dataframe, modify a dataframe, view data from a df, common operations on df, important and relevant data types for dfs, etc.)
User Guide needs more directional hand-holding to educate new users. We can use links to redirect users to needed context/details for different topics that are covered. This leads into 2 major directions for documentation: (1) Topic introduction and (2) example completeness. Having actually been introduced to relatable DataFrame library behaviors, we can help users connect the dots between things like typical DataFrame usage behaviors and concepts like expressions, contexts, etc.
I've been thinking a lot about example robustness and I'm struggling to see it as a net positive for the User Guide specifically. IMO the User Guide should focus on introducing users to Polars enough to get them moving and familiar enough to prevent anti-patterns or minimize user turnover. Otherwise you can end up with a pretty busy Cookbook. Cookbooks shouldn't be too busy and should have clear objectives and recipes for visiting users (another IMO). My feedback is that I'd expect example robustness to be an initiative for the Reference Guide. That doesn't mean it can't be augmented or supported in some form by the User Guide.
At times I'd expect to find examples in the Reference Guide when I don't. This can be solved with the example completeness initiative.
Maybe refining the Cookbook within the User Guide could help organize some of these changes. In other words making User Guide vs Cookbook more clear.

Proposals

I'm putting this issue together to gauge interest. Please feel free to tear this apart.

#154
Add Fundamentals setup page
Fill in some API docs gaps

Improve Getting Started page by adding a guided walk-through

The walkthrough could be composed of the following:

Standards - Includes import, API philosophy, references.
Create data - Create a DataFrame, create a column/Series.
Modify data - Modify columns and values in a DataFrame.
View data - Selection of data from DataFrames with and without filters. Could include config references here.
General Ops - Merge, group, morph, etc.
Datetime behaviors and usage - Units, methods, etc.
Additional data types - Categoricals, struct, etc.
I/O - .csv, .parquet, SQL, why not Excel.

This gives the Cookbook a chance to provide surface-level explanations for certain decisions Polars makes through a relatable medium that includes pointers, recommendations, or just links to more relevant content. To me this is closer to actually guiding users through their onboarding rather than just giving them a topic-by-topic guide.

Filling in API Reference gaps

Edit: I'm getting this out to get this out. Reading this over I want to start with clearly defining directions for User Guide content, Cookbook content, and Reference content. I think that's the common denominator here.

User Guide organizational enhancments

Successor to #153

tl;dr There's a lot of great content in Cookbook 1.0 which, if reorganized and updated, could provide a powerful revised flow for new and existing users reading through the book.

graph LR
  A[Intro] --> B[Getting Started] --> C[Fundamentals] --> D[Walkthrough] --> E[Recipes] --> F[References and Contributing]

Motivation

If you're aware of what you're about to begin learning it's easier to spot where the fundamental information to learn is located. And from there you can branch out. For broader success we can revise the Cookbook to emphasize what should be emphasized (like Expressions) and refine the surrounding flow for the user to follow.

Solution

Step	Content	Description
1	Intro	An overview of the `Polars` `DataFrame` library.
2	Getting Started	Create a `DataFrame` and get started.
3	Fundamentals	Foundations and fundamental concepts for new and existing `Polars` users.
4	Walk-through	A 5 minute walk-through of basic `Polars` API usage.
5	Recipes, References, and Contributing	More material to refer to while learning and using `Polars`.

To do this we can break up the work into a few PRs.

Update index tree structure to emphasize number 3 above and get the organization right (adds refined table above).
Add a fundamentals page leading into the existing expressions page.
Add #154 but as its own page.
Organize recipes and remaining updates.

Also the other topic I'd love to solve here would be the User Guide vs Cookbook confusion. IMO the book should be one or the other. If a third option The Polars Book could be considered that'd be sweet 😎.

The challenge we'll have with doing this is coordinating the cohesive revision of the book. If you're interested in coordinating with me on this please reach out 👋.

Broken images in Optimizations section of the book

At least the following sections seem to have broken/missing images (either a tiny square image or a plain old 404):

Typo in getting started page

In the quickstart section there is this code:

import polars as pl
df = pl.read_csv("https://j.mp/iriscsv")
print(df.filter(pl.col("sepal_length") > 5).groupby("species").agg(pl.all().sum()))
)

The last line has a closing ')', which should not be there.

See: https://pola-rs.github.io/polars-book/user-guide/quickstart/intro.html

Coming from Javascript

Polars book could benefit from a Javascript section with a few basic examples.

some good ones that come to mind are

reading a csv from a fetch call
select/filter operations
simple transforms like csv -> json
nodejs stream based approach vs polars

Documentation missing for required feature dtype-categorical for feature is_in

When enabling "is_in" I get a compiler error unless I also add "dtype-categorical" and that's not documented anywhere I can see.

Sentence ends mid-sentence

https://pola-rs.github.io/polars-book/api-python/polars/internals/frame/DataFrame.html?highlight=dtype#polarsinternalsframedataframe

missing comma in documentation

Hi,

just wanted to report a minor issue.

https://pola-rs.github.io/polars-book/user-guide/coming_from_pandas.html

is missing a comma after col 'type'

df = pd.DataFrame({
"type": ["m", "n", "o", "m", "m", "n", "n"]
"c": [1, 1, 1, 2, 2, 2, 2],
})

Link to "10 minutes to Polars notebook" in howcani/intro.md is broken

In https://github.com/pola-rs/polars-book/blob/master/user_guide/src/howcani/intro.md

The link points to:
https://github.com/pola-rs/polars/blob/master/examples/python/10_minutes_to_polars.ipynb
which gives a 404.

Typo on timeseries page

Under the "Rolling Groupby" section, there's a typo in the year on the later bound of the values -- I believe it should be '2021', not '20210' in this sentence:

So imagine having a time column with the values {2021-01-01, 20210-01-05} and a period="5d" this would create the following windows

See: https://pola-rs.github.io/polars-book/user-guide/timeseries/intro.html#rolling-groupby

Docs on how to programatically create a DataFrame

A section on how to create a DataFrame in code (Rust) would be great.
There is some in the rustdocs on Series etc. but I believe a section in the book makes sense.
And neither rustdoc nor the book show an example of how to create a series/column of type List

10-minutes to polars in the book

That way we ensure that the code is up to date.

- eager
- lazy

Coming from Pandas chapter, Polars does not have an index paragraph - dead link to selection

I assume the link should point here

Will create a PR to fix

How to add a column to a polars DataFrame using .with_columns()

How to create new columns in polars is not very intuitive when coming from pandas.

Hence I would recommend to add the content of this stack overflow page to the Coming from Pandas guide.
https://stackoverflow.com/questions/72245243/polars-how-to-add-a-column-with-numerical/72245435#72245435

Especially the part on how to add a list:

my_list = [10, 20, 30, 40, 50]
df.with_column(pl.Series(name="col_list", values=my_list))

This would fit well into the Column assignment section.

Add note about `scipy.stats` integration

pola-rs/polars#1185 (comment)

Window function examples (ordering within group) seem to produce wrong results

Given the Pokemon dataset example, the code snippet

out = filtered.with_columns(
    [
        pl.col(["Name", "Speed"]).sort(reverse=True).over("Type 1"),
    ]
)
print(out)

apparently sorts Name and Speed individually. While not obviously meaningful (rows now combine names of Pokemon with speed values of different Pokemon, see for example the suggested "Speed of Slowbro" now being 15 instead of 30), this is probably still correct.

Furthermore, the last example should probably try to order descending or call the first 3 by "speed" the slowest 3 instead of the fastest 3.

I would like to (understand and) see the connection to usual 'window functions', and how to write things such as lead or lag known from SQL, as well as the explicit connection between SQL's OVER (PARTITION BY x ORDER BY y), where the documentation currently only illustrates the PARTITION BY x part (and the API is calling it over), while the ordering remains unclear as of the seemingly meaningless results mentioned above.

Python 3.8 quirk from reference guide hard-coding

Problem

polars-book/reference_guide_python/__main__.py

Line 19 in 3610bb4

SRC = ".venv/lib/python3.8/site-packages".rstrip("/") # where it that package

This hard-coding makes using another Python version awkward. I'm also wondering if it's worth just bumping the Python version. I've been doing fine in the User Guide so far with 3.10.3.

Solution

Dynamic path resolution/Python version flexibility. Upgrades in general.

Remove static files from this repo

All images (.png logo and .svg diagrams) should be found in https://github.com/ritchie46/static.
All images created during the make run step remain in the user_guide/src/examples_outputs/ folder.
The user_guide/src/_images/ folder could be entirely removed.

migrate to polars in CI (in favour of py-polars)

.github/workflows/docs.yaml still install py-polars.

Proposed changes to syntax highlight on "Getting started"

user_guide/src/quickstart/intro.md

polars-book/user_guide/src/quickstart/intro.md

Lines 100 to 107 in 7490a99

 ```text 

 # Rust Cargo.toml dependencies 

 [dependencies] 

 polars = { version = "0.24.3", features = ["lazy"] } 

 reqwest = { version = "0.11.12", features = ["blocking"] } 

 color-eyre = "0.6" 

 ```

I think it should be toml rather than text.

```toml
 # Rust Cargo.toml dependencies 
  
 [dependencies] 
 polars = { version = "0.24.3", features = ["lazy"] } 
 reqwest =  { version = "0.11.12", features = ["blocking"] } 
 color-eyre = "0.6" 
```

 # Rust Cargo.toml dependencies 
  
 [dependencies] 
 polars = { version = "0.24.3", features = ["lazy"] } 
 reqwest =  { version = "0.11.12", features = ["blocking"] } 
 color-eyre = "0.6"

Two different Python references

These are:

https://pola-rs.github.io/polars/py-polars/html/reference/

https://pola-rs.github.io/polars-book/api-python/

I suppose the 2nd one is outdated but pops up frequently when searching google. Confusing.

deploy via ci

Operation with custom data types in Coming from Pandas

In chapter Coming from Pandas, would you consider to include an example comparing manipulation with custom data types? The point here is to highlight that polars is as handy as pandas when dealing with different data types.
For instance

data = {"x": [1, 2, 3], "d": [{"k": True}, {"k": False}, {"k": True, "l": [1, "2"]}]}

Pandas

(
    pd.DataFrame(data)
    .assign(y=lambda df: df.apply(lambda row: [row["x"], len(row["d"])], axis=1))
    .loc[lambda df: df["d"].apply(lambda d: d["k"])]
)

Polars

(
    pl.DataFrame(data)
    .with_column(pl.map(["x", "d"], lambda ls: pl.Series([[x, len(d)] for x, d in zip(*ls)])).alias("y"))
    .filter(pl.col("d").map(lambda s: pl.Series([d["k"] for d in list(s)])))
)

Pull request: Polars User Guide code snippets: All jobs have failed

I did some work on the documentation and made a pull request which for reasons unknown has failed a check. Sorry, don't know how to fix that.

Typos in the postgres example

https://pola-rs.github.io/polars-book/user-guide/howcani/io/postgres.html
https://github.com/pola-rs/polars-book/blob/master/user_guide/src/howcani/io/postgres.md

Some typos:
cur = conn.cursort() should be cur = conn.cursor()

INSERT INTO ({}) VALUES({}); should be INSERT INTO {}({}) VALUES({});

Selecting with Expressions examples do not match associated sample code

See for e.g. here:

In the first example, the assignment variable is wrong, as are in the input arguments to df.select:

single_select_df = df.select(["id"]) # not list_select_df
print(single_select_df)

shape: (3, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 2   │
├╌╌╌╌╌┤
│ 3   │
└─────┘

In the second example, the assigned variable should be list_select_df.

In the section Selecting Rows and Columns, the example only selects rows, not rows and columns.

Fix mdBook preprocessor

The preprocessor fails during CI, expectedly due to a wrong path when invoking mdbook outside of the user_guide/ folder (?).

Note the ghp-import: command not found issue is fixed by this PR.

Link in expressions doesn't work

At the end of window expressions there is the following link:
https://pola-rs.github.io/polars/py-polars/html/reference/expression.html#aggregation

There is a Various aggregations section on this page rather than an aggregation section. I think it should link to that

missing_data.html link is dead

In Coming From Pandas documentation, the missing_data link is dead.

repost from pola-rs/polars#4170

Incorrect results in Polars expressions -> GroupBy -> Filtering

I was reading through this and noticed some incorrect results, see https://pola-rs.github.io/polars-book/user-guide/dsl/groupby.html#filtering.

avg M birthday all > 2000
avg F birthday all null

Also, given the compute_age calculation being used, the dataset should probably be using legislators-current.csv" rather than legislators-historical.csv".

Time series docs

There are a number of different topics for time series. I propose an intro page with a table of contents and pages for:

parsing datetime data e.g. auto-parsing files, converting strings to dates
indexing
transformations using the .dt namespace
temporal groupbys
resampling

Any thoughts?

	```text
	# Rust Cargo.toml dependencies

	[dependencies]
	polars = { version = "0.24.3", features = ["lazy"] }
	reqwest = { version = "0.11.12", features = ["blocking"] }
	color-eyre = "0.6"
	```

pola-rs / polars-book Goto Github PK

polars-book's Introduction

Polars Book

Getting Started

Deployment

polars-book's People

Contributors

Stargazers

Watchers

Forkers

polars-book's Issues

Problem

Solution

Problem

Problem

Solution

Motivation

Feedback

Proposals

Improve Getting Started page by adding a guided walk-through

Filling in API Reference gaps

Motivation

Solution

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org