Giter VIP home page Giter VIP logo

polars-book's Introduction

Warning

The user guide has been moved.

It is now part of the main Polars repository. Please open new issues there.

Find the user guide at its new location here.

Polars Book

This repo contains the User Guide for the Polars DataFrame library.

Getting Started

The User guide is made with Material for mkdocs. In order to get started with building this book perform the following steps:

make requirements

In order to serve the books run make serve. This will run all the python examples and display the output inline using the markdown-exec plugin.

Deployment

Deployment of the book is done using Github Pages and Github Workflows. The book is automatically deployed on each push to the main branch. There are a number of checks in the CI pipeline to avoid non-working examples:

  • Run all python examples, fail on any errors
  • Check all links in markdown
  • Run black formatter

polars-book's People

Contributors

andrewpollack avatar avimallu avatar braaannigan avatar c-peters avatar carnarez avatar cnpryer avatar corneliusroemer avatar ghuls avatar henrikig avatar hofer-julian avatar hpux735 avatar lucazanna avatar marcogorelli avatar mcrumiller avatar mjclarke94 avatar nickray avatar paauw avatar pea-sys avatar purfakt avatar r-brink avatar ritchie46 avatar stefanbras avatar stinodego avatar trippy3 avatar uinelj avatar universalmind303 avatar yuuuxt avatar zm711 avatar zundertj avatar zzzzaakk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

polars-book's Issues

Expressions page links to itself in "Next Chapter" button

When clicking on the "Next Chapter" button (the big arrow on the right) from this page: https://pola-rs.github.io/polars-book/user-guide/dsl/intro.html, it goes back to that same page, while it should be going to https://pola-rs.github.io/polars-book/user-guide/dsl/contexts.html

This happens when the chapter page and first sub chapter page are the same (in SUMMARY.md):

- [Polars expressions](dsl/intro.md)
  - [Expressions](dsl/intro.md)

It might be a bug (or feature) in mdBook, but couldn't find anything about it.

Docs on concat and join

There is no mention of concat or join in the current book. Propose adding a section to howcani with a first version of this

Cleaning up the getting started steps

Repurposing to yank Docker to prevent confusion.

Problem

As a new contributor my number one goal on this repo is to get set up. I like the idea of using a Docker image, but I'm not sure if the README provides complete instructions. I'd like to use Docker more, but my experience with it is limited to using it for a couple projects in the past.

Note the mdBook executable is downloaded and not compiled to fasten the building of the image.

In order to compile mdBook wouldn't Rust need to be installed (along with cargo) on the image?

Suggested improvements for User Guide

Copied from: pola-rs/polars#2540

Hi,

I was going through the user guide (https://pola-rs.github.io/polars-book/user-guide/index.html) to get a feel for the library and ran into some discrepancies I thought you might want to know about. I wasn't sure how to provide this information otherwise, so here is just an overview of the things I noticed:

  • Some snippets have "print(df)", while they should have "print(out)". In the Expressions page (3.1) + Contexts page (3.2) + Time-series page (8).
  • Expressions page (3.1) arrow right to go to next page, returns you to the Expressions page again.
  • Getting the US congress dataset code sample on GroupBy page (3.3) is very unclear about how to replicate ("from .dataset import dataset" doesn't work and is rather unclear how to execute other than that snippet)
  • Gotcha's link in Numpy universal functions page (3.6) gives a 404 error.
  • Performance- Strings page (10.1), final lazy join on categorical data snippet results in error: "Any(ValueError("joins on categorical dtypes can only happen if they are created under the same global string cache"))"

Cheers,
Pieter

`black` is failing in tests

Problem

As mentioned on Discord

Run black --check .
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.12/x64/bin/black", line 8, in <module>
    sys.exit(patched_main())
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/black/__init__.py", line 6606, in patched_main
    patch_click()
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/black/__init__.py", line 6595, in patch_click
    from click import _unicodefun  # type: ignore
ImportError: cannot import name '_unicodefun' from 'click' (/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/click/__init__.py)

Documenting this to either pick up myself or for someone else to grab it before me.

Solution

TBD

Alternative Documentation with Mkdocs

Hi everyone. I have developed an alternative documentation with mkdocs, google colab and github actions. This repository lives at fralfaro/polars-book (documentation: link). In each chapter there is a jupyter notebook connected with google colab. This project consider some chapters right now.

This documentation is similar than mdBook, since mkdocs uses .md files. The main difference is about is possible compile .ipynb files. This is a benefits, because you can use a .ipynb for replicate an example and it is not necessary use this kind of syntaxis:

{{#include ../examples/expressions/window_1.py:0:}}
{{#include ../outputs/expressions/window_1.txt}}

For this reason, it is not necessary to have a folder for the python output.

I hope this is a good alternative. If you have any questions, happy to answer them!

Rename snippet folder and subfolders

The user_guide/src/_examples/ does not need a _ prefix.
The names of the subfolders contained in this directory should not contain -, but _ instead (to allow python -m user_guide.src.examples.<SUBFOLDER>.snippetX.py calls).
Those changes involve fixing paths over the whole repo, several files would be impacted.

Coming from pandas

In the python cookbook the "Coming from Pandas" section is key for new users. However, it could be structured better. I propose to re-work it so it starts with the key conceptual differences (rust, arrow, no index, lazy evaluation, query optimization...) before going into differences in writing queries.

@ritchie46 - I can start a PR on this basis if there are no objections.

Documentation uses "col" without mentioning how to get the function

The documentation nowhere mentions the feature flags needed to actually get started using polars. It jumps straight into using "col" which is not available.

One seems to at least need the "lazy" flag but I'm not sure.
The docs also switch between polars::prelude and polars_core::prelude which is also something that's never mentioned anywhere.

The Getting Started guide for example: https://pola-rs.github.io/polars-book/user-guide/quickstart/intro.html
It mentions cargo add polars and then jumps right into something called lazy polars without ever explaining what that is and it uses the col function as well. None of that works without enabling at least one feature.

Adding walk-through content, more documentation hand-holding, and examples completeness

This issue covers some of the feedback I have as a new (and currently limited) user of Polars. There's a lot to unpack for these proposals, so I'll start with some context for a foundation.

I'm a long time Pandas user who's worked on projects ranging from discovery work to implementing packaged compute pipelines in Python. I love what Pandas can do, but not how it's designed and often used. This is a major factor behind why I've grown interested in Polars and hope to get more involved to help polish and promote it as the viable DataFrame library it is.

Motivation

Before listing my feedback I want to establish some core assumptions behind my perspective.

First I believe that (1) as an onboarding user the onus is on you to learn the library. (2) Sufficient documentation isn't absolutely necessary to learn Polars, but it will scale Polars' ability to onboard new users (in turn hopefully surrounding the project with more support), and thus should be prioritized within the context of documentation. (3) The API design is a primary feature of a DataFrame library and is a great selling point for a library like Polars. (4) Some people are thorough and will read through the documentation entirely, while others will look for quicker ways to get hands on. IME some people are just hands on and once you get them started it's easier to guide them through more details. It's really difficult to incentivize these users to read through the docs in their entirety without being more creative with the presentation/flow of the docs. FWIW I started out by reading through the entire user guide.

Feedback

  • Introduction page is perfect. I wouldn't change it.
  • Getting started feels short. This is where I think the gap from motivation#4 can be filled. Getting Started content itself can and probably should be compact and to the point, but we can bridge into details and concepts more effectively from here
  • For example Expressions could be a new concept to both Pandas and non-Pandas users. I'd guess that's a gigantic chunk of potential new users (if it's not the majority). Leading into expressions immediately may start some users off with more information overhead to digest than they should realize. This can be improved by providing 5-10 minutes of opinionated walk-through style content that covers an unspecific workflow using familiar DataFrame library behaviors (Create a dataframe, modify a dataframe, view data from a df, common operations on df, important and relevant data types for dfs, etc.)
  • User Guide needs more directional hand-holding to educate new users. We can use links to redirect users to needed context/details for different topics that are covered. This leads into 2 major directions for documentation: (1) Topic introduction and (2) example completeness. Having actually been introduced to relatable DataFrame library behaviors, we can help users connect the dots between things like typical DataFrame usage behaviors and concepts like expressions, contexts, etc.
  • I've been thinking a lot about example robustness and I'm struggling to see it as a net positive for the User Guide specifically. IMO the User Guide should focus on introducing users to Polars enough to get them moving and familiar enough to prevent anti-patterns or minimize user turnover. Otherwise you can end up with a pretty busy Cookbook. Cookbooks shouldn't be too busy and should have clear objectives and recipes for visiting users (another IMO). My feedback is that I'd expect example robustness to be an initiative for the Reference Guide. That doesn't mean it can't be augmented or supported in some form by the User Guide.
  • At times I'd expect to find examples in the Reference Guide when I don't. This can be solved with the example completeness initiative.
  • Maybe refining the Cookbook within the User Guide could help organize some of these changes. In other words making User Guide vs Cookbook more clear.

Proposals

I'm putting this issue together to gauge interest. Please feel free to tear this apart.

  • #154
  • Add Fundamentals setup page
  • Fill in some API docs gaps

Improve Getting Started page by adding a guided walk-through

The walkthrough could be composed of the following:

  1. Standards - Includes import, API philosophy, references.
  2. Create data - Create a DataFrame, create a column/Series.
  3. Modify data - Modify columns and values in a DataFrame.
  4. View data - Selection of data from DataFrames with and without filters. Could include config references here.
  5. General Ops - Merge, group, morph, etc.
  6. Datetime behaviors and usage - Units, methods, etc.
  7. Additional data types - Categoricals, struct, etc.
  8. I/O - .csv, .parquet, SQL, why not Excel.

This gives the Cookbook a chance to provide surface-level explanations for certain decisions Polars makes through a relatable medium that includes pointers, recommendations, or just links to more relevant content. To me this is closer to actually guiding users through their onboarding rather than just giving them a topic-by-topic guide.

Filling in API Reference gaps

Edit: I'm getting this out to get this out. Reading this over I want to start with clearly defining directions for User Guide content, Cookbook content, and Reference content. I think that's the common denominator here.

User Guide organizational enhancments

Successor to #153

tl;dr There's a lot of great content in Cookbook 1.0 which, if reorganized and updated, could provide a powerful revised flow for new and existing users reading through the book.

graph LR
  A[Intro] --> B[Getting Started] --> C[Fundamentals] --> D[Walkthrough] --> E[Recipes] --> F[References and Contributing]

Motivation

If you're aware of what you're about to begin learning it's easier to spot where the fundamental information to learn is located. And from there you can branch out. For broader success we can revise the Cookbook to emphasize what should be emphasized (like Expressions) and refine the surrounding flow for the user to follow.

Solution

Step Content Description
1 Intro An overview of the Polars DataFrame library.
2 Getting Started Create a DataFrame and get started.
3 Fundamentals Foundations and fundamental concepts for new and existing Polars users.
4 Walk-through A 5 minute walk-through of basic Polars API usage.
5 Recipes, References, and Contributing More material to refer to while learning and using Polars.

To do this we can break up the work into a few PRs.

  1. Update index tree structure to emphasize number 3 above and get the organization right (adds refined table above).
  2. Add a fundamentals page leading into the existing expressions page.
  3. Add #154 but as its own page.
  4. Organize recipes and remaining updates.

Also the other topic I'd love to solve here would be the User Guide vs Cookbook confusion. IMO the book should be one or the other. If a third option The Polars Book could be considered that'd be sweet 😎.

The challenge we'll have with doing this is coordinating the cohesive revision of the book. If you're interested in coordinating with me on this please reach out πŸ‘‹.

Coming from Javascript

Polars book could benefit from a Javascript section with a few basic examples.

some good ones that come to mind are

  • reading a csv from a fetch call
  • select/filter operations
  • simple transforms like csv -> json
  • nodejs stream based approach vs polars

Docs on how to programatically create a DataFrame

A section on how to create a DataFrame in code (Rust) would be great.
There is some in the rustdocs on Series etc. but I believe a section in the book makes sense.
And neither rustdoc nor the book show an example of how to create a series/column of type List

How to add a column to a polars DataFrame using .with_columns()

How to create new columns in polars is not very intuitive when coming from pandas.

Hence I would recommend to add the content of this stack overflow page to the Coming from Pandas guide.
https://stackoverflow.com/questions/72245243/polars-how-to-add-a-column-with-numerical/72245435#72245435

Especially the part on how to add a list:

my_list = [10, 20, 30, 40, 50]
df.with_column(pl.Series(name="col_list", values=my_list))

This would fit well into the Column assignment section.

Window function examples (ordering within group) seem to produce wrong results

Given the Pokemon dataset example, the code snippet

out = filtered.with_columns(
    [
        pl.col(["Name", "Speed"]).sort(reverse=True).over("Type 1"),
    ]
)
print(out)

apparently sorts Name and Speed individually. While not obviously meaningful (rows now combine names of Pokemon with speed values of different Pokemon, see for example the suggested "Speed of Slowbro" now being 15 instead of 30), this is probably still correct.

Furthermore, the last example should probably try to order descending or call the first 3 by "speed" the slowest 3 instead of the fastest 3.

I would like to (understand and) see the connection to usual 'window functions', and how to write things such as lead or lag known from SQL, as well as the explicit connection between SQL's OVER (PARTITION BY x ORDER BY y), where the documentation currently only illustrates the PARTITION BY x part (and the API is calling it over), while the ordering remains unclear as of the seemingly meaningless results mentioned above.

Remove static files from this repo

All images (.png logo and .svg diagrams) should be found in https://github.com/ritchie46/static.
All images created during the make run step remain in the user_guide/src/examples_outputs/ folder.
The user_guide/src/_images/ folder could be entirely removed.

Proposed changes to syntax highlight on "Getting started"

user_guide/src/quickstart/intro.md

```text
# Rust Cargo.toml dependencies
[dependencies]
polars = { version = "0.24.3", features = ["lazy"] }
reqwest = { version = "0.11.12", features = ["blocking"] }
color-eyre = "0.6"
```

I think it should be toml rather than text.

```toml
 # Rust Cargo.toml dependencies 
  
 [dependencies] 
 polars = { version = "0.24.3", features = ["lazy"] } 
 reqwest =  { version = "0.11.12", features = ["blocking"] } 
 color-eyre = "0.6" 
```
 # Rust Cargo.toml dependencies 
  
 [dependencies] 
 polars = { version = "0.24.3", features = ["lazy"] } 
 reqwest =  { version = "0.11.12", features = ["blocking"] } 
 color-eyre = "0.6" 

Operation with custom data types in Coming from Pandas

In chapter Coming from Pandas, would you consider to include an example comparing manipulation with custom data types? The point here is to highlight that polars is as handy as pandas when dealing with different data types.
For instance

data = {"x": [1, 2, 3], "d": [{"k": True}, {"k": False}, {"k": True, "l": [1, "2"]}]}

Pandas

(
    pd.DataFrame(data)
    .assign(y=lambda df: df.apply(lambda row: [row["x"], len(row["d"])], axis=1))
    .loc[lambda df: df["d"].apply(lambda d: d["k"])]
)

Polars

(
    pl.DataFrame(data)
    .with_column(pl.map(["x", "d"], lambda ls: pl.Series([[x, len(d)] for x, d in zip(*ls)])).alias("y"))
    .filter(pl.col("d").map(lambda s: pl.Series([d["k"] for d in list(s)])))
)

Selecting with Expressions examples do not match associated sample code

See for e.g. here:

image

In the first example, the assignment variable is wrong, as are in the input arguments to df.select:

single_select_df = df.select(["id"]) # not list_select_df
print(single_select_df)
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚ id  β”‚
β”‚ --- β”‚
β”‚ i64 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 1   β”‚
β”œβ•Œβ•Œβ•Œβ•Œβ•Œβ”€
β”‚ 2   β”‚
β”œβ•Œβ•Œβ•Œβ•Œβ•Œβ”€
β”‚ 3   β”‚
β””β”€β”€β”€β”€β”€β”˜

In the second example, the assigned variable should be list_select_df.

In the section Selecting Rows and Columns, the example only selects rows, not rows and columns.

Fix mdBook preprocessor

The preprocessor fails during CI, expectedly due to a wrong path when invoking mdbook outside of the user_guide/ folder (?).

Note the ghp-import: command not found issue is fixed by this PR.

Time series docs

There are a number of different topics for time series. I propose an intro page with a table of contents and pages for:

  • parsing datetime data e.g. auto-parsing files, converting strings to dates
  • indexing
  • transformations using the .dt namespace
  • temporal groupbys
  • resampling

Any thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.