pola-rs / polars Goto Github PK
View Code? Open in Web Editor NEWDataframes powered by a multithreaded, vectorized query engine, written in Rust
Home Page: https://docs.pola.rs
License: MIT License
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Home Page: https://docs.pola.rs
License: MIT License
Sort merge join can be faster than hash join when Series are sorted and maybe when they are not.
fixed in apache/arrow#7554
Maybe apply fix in vendored arrow version until next release.
ChunkedArray.take
has a parameter TakeOptions
which is now ignored.
We can currently create a Series
with iterator.collect()
. This doesn't give you the option to pass a know capacity and can be more expensive that it has to be. Add a constructor for iterators and known capacity.
Not a great priority, but would be nice to create Python bindings with Pyo3
Int8,
Int16,
UInt8,
UInt16,
UInt64,
are currently missing.
Can use enum_dispatch for that
Can also implement median for ChunkedArray
Arithmetic is now only implemented for numerical types. Add boolean arithmetic with casting on the fly.
In place true division by casting i32/u32 to f32 and i36 to f36
Apply currently collects to standard vecs. This leads to copy once the arrow array gets created.
instead of df.select(&str)
, Use a Selection trait that can create a Vec<&str>
. Implement for &str
, Vec<&str>
, (String, String)
etc.
This is pretty interesting project, shows how powerful things can be built using Rust + Arrow.
A few basic questions regarding your future plans:
Arrow comparison function seems to have flawed results for sliced arrays. Fallback to own iterators in f12c920
Make mwe and test if this is an upstream bug.
The docs that is now at module level would be better fitting at Series enum.
Added in aff668e
Needed for serialization to CSV, JSON and Parquet
This probably needs some runtime checks to make possible. iter()
should return Box<dyn Iterator
?
Series can be used by ufunc by using array
https://numpy.org/devdocs/user/basics.dispatch.html
Create an aligned array in the Rust heap and use numpy out keyword https://numpy.org/doc/stable/reference/ufuncs.html to write ufunc output to Rust. This buffer can be used to create an arrow array.
Use UInt32Chunked
for a left join directly (this has the concept of null) and saves a full iteration over the data.
Currently nullable strings are represented as an empty string. It probably entails more info if we use Option<String>
It looks like polars is still in early stages, so it might be too soon for this feature request, but it would be nice if polars can read and write Dataframes to Apache Arrow Feather format.
I currently have some pandas code that reads a Feather file with 25000 columns and 1 milion rows (of float32 ==> 93 GB) on which I need to apply a function on each column:
The problem with pandas is that pd.read_feather
consolidates the data (so it makes a memory copy) as it stores all data in a big numpy array, while Feather should be zero copy, so if polars would support zero copy operations on feather files, it would be great.
As far as I can see, argsort
is recently implemented.
I guess the zero copy numpy view would be important too.
import pandas as pd
df_scores__motifs_vs_regions_or_genes = pd.read_feather('motifs_vs_regions_or_genes.scores.feather')
def rank_CRM_scores_and_assign_random_ranking_in_range_for_ties_func(crm_scores_with_ties_for_motif_numpy):
# Create random permutation so tied scores will have a different ranking each time.
random_permutations_to_break_ties_numpy = np.random.permutation(crm_scores_with_ties_for_motif_numpy.shape[0])
rank_column_with_broken_ties_numpy = random_permutations_to_break_ties_numpy[
(-crm_scores_with_ties_for_motif_numpy)[random_permutations_to_break_ties_numpy].argsort()
].argsort().astype(np.int32)
return rank_column_with_broken_ties_numpy
# Create feature table ranking.
df_ranking__motifs_vs_regions_or_genes = df_scores__motifs_vs_regions_or_genes.apply(
rank_CRM_scores_and_assign_random_ranking_in_range_for_ties_func,
axis='index',
raw=True
)
df_ranking__motifs_vs_regions_or_genes.reset_index(inplace=True)
df_ranking__motifs_vs_regions_or_genes.to_feather(path='motifs_vs_regions_or_genes.rankings.feather')
Cast numerical types option. This can be used for checking equality between different types.
Outer join left and right hashing can be done in parallel.
Now all serialization is done in one go. Will lead to memory problems.
get indexes of unique values in a chunkedarray and series. Can be used to filter a whole dataframe on the unique values of one series.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.