pola-rs / polars Goto Github PK

View Code? Open in Web Editor NEW

28.8K 160.0 1.8K 140.78 MB

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Home Page: https://docs.pola.rs

License: Other

Rust 65.32% Python 34.59% Makefile 0.08% CSS 0.01%

dataframe-library dataframe dataframes rust arrow python out-of-core polars

polars's Introduction

Documentation: Python - Rust - Node.js - R | StackOverflow: Python - Rust - Node.js - R | User guide | Discord

Polars: Blazingly fast DataFrames in Rust, Python, Node.js, R, and SQL

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

Lazy | eager execution
Multi-threaded
SIMD
Query optimization
Powerful expression API
Hybrid Streaming (larger-than-RAM datasets)
Rust | Python | NodeJS | R | ...

To learn more, read the user guide.

Python

>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4, 5],
...         "fruits": ["banana", "banana", "apple", "apple", "banana"],
...         "B": [5, 4, 3, 2, 1],
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
...     }
... )

# embarrassingly parallel execution & very expressive query language
>>> df.sort("fruits").select(
...     "fruits",
...     "cars",
...     pl.lit("fruits").alias("literal_string_fruits"),
...     pl.col("B").filter(pl.col("cars") == "beetle").sum(),
...     pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...     pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
...     pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
... )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

SQL

>>> df = pl.scan_csv("docs/data/iris.csv")
>>> ## OPTION 1
>>> # run SQL queries on frame-level
>>> df.sql("""
...	SELECT species,
...	  AVG(sepal_length) AS avg_sepal_length
...	FROM self
...	GROUP BY species
...	""").collect()
shape: (3, 2)
┌────────────┬──────────────────┐
│ species    ┆ avg_sepal_length │
│ ---        ┆ ---              │
│ str        ┆ f64              │
╞════════════╪══════════════════╡
│ Virginica  ┆ 6.588            │
│ Versicolor ┆ 5.936            │
│ Setosa     ┆ 5.006            │
└────────────┴──────────────────┘
>>> ## OPTION 2
>>> # use pl.sql() to operate on the global context
>>> df2 = pl.LazyFrame({
...    "species": ["Setosa", "Versicolor", "Virginica"],
...    "blooming_season": ["Spring", "Summer", "Fall"]
...})
>>> pl.sql("""
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """).collect()

SQL commands can also be run directly from your terminal using the Polars CLI:

# run an inline SQL query
> polars -c "SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;"

# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;

Refer to the Polars CLI repository for more information.

Performance 🚀🚀

Blazingly fast

Polars is very fast. In fact, it is one of the best performing solutions available. See the TPC-H benchmarks results.

Lightweight

Polars is also very lightweight. It comes with zero required dependencies, and this shows in the import times:

polars: 70ms
numpy: 104ms
pandas: 520ms

Handles larger-than-RAM data

If you have data that does not fit into memory, Polars' query engine is able to process your query (or parts of your query) in a streaming fashion. This drastically reduces memory requirements, so you might be able to process your 250GB dataset on your laptop. Collect with collect(streaming=True) to run the query streaming. (This might be a little slower, but it is still very fast!)

Setup

Python

Install the latest Polars version with:

pip install polars

We also have a conda package (conda install -c conda-forge polars), however pip is the preferred way to install Polars.

Install Polars with all optional dependencies.

pip install 'polars[all]'

You can also install a subset of all optional dependencies.

pip install 'polars[numpy,pandas,pyarrow]'

See the User Guide for more details on optional dependencies

To see the current Polars version and a full list of its optional dependencies, run:

pl.show_versions()

Releases happen quite often (weekly / every few days) at the moment, so updating Polars regularly to get the latest bugfixes / features might not be a bad idea.

Rust

You can take latest release from crates.io, or if you want to use the latest features / performance improvements point to the main branch of this repo.

polars = { git = "https://github.com/pola-rs/polars", rev = "<optional git tag>" }

Requires Rust version >=1.79.

Contributing

Want to contribute? Read our contributing guide.

Python: compile Polars from source

If you want a bleeding edge release or maximal performance you should compile Polars from source.

This can be done by going through the following steps in sequence:

Install the latest Rust compiler
Install maturin: pip install maturin
cd py-polars and choose one of the following:
- make build-release, fastest binary, very long compile times
- make build-opt, fast binary with debug symbols, long compile times
- make build-debug-opt, medium-speed binary with debug assertions and symbols, medium compile times
- make build, slow binary with debug assertions and symbols, fast compile times
Append -native (e.g. make build-release-native) to enable further optimizations specific to your CPU. This produces a non-portable binary/wheel however.

Note that the Rust crate implementing the Python bindings is called py-polars to distinguish from the wrapped Rust crate polars itself. However, both the Python package and the Python module are named polars, so you can pip install polars and import polars.

Using custom Rust functions in Python

Extending Polars with UDFs compiled in Rust is easy. We expose PyO3 extensions for DataFrame and Series data structures. See more in https://github.com/pola-rs/pyo3-polars.

Going big...

Do you expect more than 2^32 (~4.2 billion) rows? Compile Polars with the bigidx feature flag or, for Python users, install pip install polars-u64-idx.

Don't use this unless you hit the row boundary as the default build of Polars is faster and consumes less memory.

Legacy

Do you want Polars to run on an old CPU (e.g. dating from before 2011), or on an x86-64 build of Python on Apple Silicon under Rosetta? Install pip install polars-lts-cpu. This version of Polars is compiled without AVX target features.

polars's People

Contributors

Stargazers

Watchers

Forkers

elidhu ninkoze qezz cgmossa isgasho koaning gitter-badger andrei-papou thedan64 jkelleyrtp justanotherdot marioloko dandandan zerounnet fgadaleta erikdesmedt axect sicspe niederb abreis vaaaaanquish paq westonsteimel rustbunker khamutov zoumingxin zeta1999 haixuantao spirans cschin rich-murphey shahsunny fulmicoton nickray smazumder05 wseaton paauw constre83 rezaprimasatya baajarmeh kuikuitage thanos tushushu anubhabb bluss thomaub oren0e prettywood dd5ht lucascr91 iloleg rkarp zhaijunyu kevinwkc ginochen i-spark chexiangyu elferherrera chungengtian muyixi315 vincentroest jornpeters org-mars mtoub stjordanis datadevopscloud dukeantt carnarez ghuls imvansh25 gunjanrt04 laplacekorea florianwilhelm dystudio closechoice jorgecarleitao h-m-h elsuizo anidaniel rohith295 olivier-lacroix hey-rum-ba lzhao14 multimeric vishalbelsare arkrde dateninschenoere therustmonk tawawhite zirconium-n mstump tversteeg emg110 ansrivas pathcl rjcortese benmaier scaevola jmcconnell26 tiphaineruy

polars's Issues

outer join

Dataframe Union.

Series cast

Cast numerical types option. This can be used for checking equality between different types.

If single chunk, create an iterator with a single bounds check.

This probably needs some runtime checks to make possible. iter() should return Box<dyn Iterator?

postgres connection

Docker image with rust Notebook + Polars

https://github.com/google/evcxr

Nice project

This is pretty interesting project, shows how powerful things can be built using Rust + Arrow.
A few basic questions regarding your future plans:

There is some overlap across the Polars, ndarray and DataFusion projects. E.g. sum() is implemented by all of them. Do you plan to converge on the long run or is this the intended level of abstraction&separation?
Compared to Pandas, will polars run multi-threaded, using all the cores e.g. using rayon? Like Modin or Dask.
Do you plan to create a Python API (e.g. pyo3)?

static dispatch for intoiterator chunkedarray

Can use enum_dispatch for that

Make Utf8 nullable

Currently nullable strings are represented as an empty string. It probably entails more info if we use Option<String>

Python bindings

Not a great priority, but would be nice to create Python bindings with Pyo3

[Docs] Series documentation to Series enum

The docs that is now at module level would be better fitting at Series enum.

[rust]|[python] add power/ log arithmetic

Add all rust native numeric types

Int8,
Int16,
UInt8,
UInt16,
UInt64,

are currently missing.

add all arrow temporal types

Create RecordBatches from DF

Needed for serialization to CSV, JSON and Parquet

outer join. hash shorter relation

[Python] add true divison

In place true division by casting i32/u32 to f32 and i36 to f36

groupby aggregate functions

Can also implement median for ChunkedArray

[Python] Utilize numpy ufunc

Series can be used by ufunc by using array
https://numpy.org/devdocs/user/basics.dispatch.html

Create an aligned array in the Rust heap and use numpy out keyword https://numpy.org/doc/stable/reference/ufuncs.html to write ufunc output to Rust. This buffer can be used to create an arrow array.

slice method on df

create nullable in dataframe [python]

resample funtionality

Support for reading Feather format.

It looks like polars is still in early stages, so it might be too soon for this feature request, but it would be nice if polars can read and write Dataframes to Apache Arrow Feather format.

I currently have some pandas code that reads a Feather file with 25000 columns and 1 milion rows (of float32 ==> 93 GB) on which I need to apply a function on each column:

create a random permutation of the same length as the number of rows
read a column but access the elements in the order created in the random permutation step (this is to make sure than when argsorting later that tied scores that appear more to the top of the column are not ranked higher
argsort that array in reverse
undo the random permutation step on the previous array (so we have a ranking for the input column)

The problem with pandas is that pd.read_feather consolidates the data (so it makes a memory copy) as it stores all data in a big numpy array, while Feather should be zero copy, so if polars would support zero copy operations on feather files, it would be great.

As far as I can see, argsort is recently implemented.
I guess the zero copy numpy view would be important too.

import pandas as pd

df_scores__motifs_vs_regions_or_genes = pd.read_feather('motifs_vs_regions_or_genes.scores.feather')

def rank_CRM_scores_and_assign_random_ranking_in_range_for_ties_func(crm_scores_with_ties_for_motif_numpy):
            # Create random permutation so tied scores will have a different ranking each time.
            random_permutations_to_break_ties_numpy = np.random.permutation(crm_scores_with_ties_for_motif_numpy.shape[0])

            rank_column_with_broken_ties_numpy = random_permutations_to_break_ties_numpy[
                (-crm_scores_with_ties_for_motif_numpy)[random_permutations_to_break_ties_numpy].argsort()
            ].argsort().astype(np.int32)

            return rank_column_with_broken_ties_numpy

# Create feature table ranking.
df_ranking__motifs_vs_regions_or_genes = df_scores__motifs_vs_regions_or_genes.apply(
    rank_CRM_scores_and_assign_random_ranking_in_range_for_ties_func,
    axis='index',
    raw=True
)


df_ranking__motifs_vs_regions_or_genes.reset_index(inplace=True)

df_ranking__motifs_vs_regions_or_genes.to_feather(path='motifs_vs_regions_or_genes.rankings.feather')

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

pola-rs / polars Goto Github PK

polars's Introduction

Polars: Blazingly fast DataFrames in Rust, Python, Node.js, R, and SQL

Python

SQL

Performance 🚀🚀

Blazingly fast

Lightweight

Handles larger-than-RAM data

Setup

Python

Rust

Contributing

Python: compile Polars from source

Using custom Rust functions in Python

Going big...

Legacy

Sponsors

polars's People

Contributors

Stargazers

Watchers

Forkers

polars's Issues

Recommend Projects

Recommend Topics

Recommend Org