Comments (5)
Currently, rechunk
only makes sure that each individual Series is contiguous in memory.
It would make sense to have a parameter on rechunk
that ensures the entire dataframe is contiguous. Thanks for the suggestion!
I think you can hack around this by converting to numpy and then back. The resulting DataFrame will be contiguous in memory.
from polars.
I don't think we should want that @stinodego. That requires unsafe allocations and will be super hard to enfore throughout the engine.
And besides all much more costly than simply paying the copy at the end.
Either way there needs to be made a copy. It doesn't matter if we do it internally or when moving out to numpy. I will close this as it will have no benefit.
from polars.
Right, I was thinking that if you do many to_numpy
calls, it would be cheaper to first convert the DataFrame to Fortran layout once, and it will save the copy in subsequent calls.
However, any operations you do on the DataFrame inbetween those calls will not guarantee that the Fortran layout is preserved. So better to just not give any guarantees about it.
from polars.
Thank you so much for the quick answer. I did assume that the layout is by always fortran internally (except maybe for some weird numpy arrays), but yeah I can see how this could create some complications at other places. Makes sense to not implement it then.
By the way, why does to_numpy
need to make a copy if the layout is not fortran? If you support other layouts throughout the engine, couldn't you just move it back to numpy in that format?
from polars.
Right, I was thinking that if you do many
to_numpy
calls, it would be cheaper to first convert the DataFrame to Fortran layout once, and it will save the copy in subsequent calls.
In such a case, I think people should cache their numpy array. I don't think our methods should be focussed with caching.
If you support other layouts throughout the engine, couldn't you just move it back to numpy in that format?
If we could, we would. A numpy array is backed by a single contiguous allocation. Polars DataFrames are backed by multiple buffers.
Don't worry about that copy too much. They happen implicitly all the time.
from polars.
Related Issues (20)
- Github release for rust-polars 0.40.0 HOT 1
- Getting panic when calling `LazyFrame.group_by().map_groups` and intermitten panic when calling `LazyFrame.columns` HOT 4
- GitHub release seems created with wrong commit? HOT 1
- Ergonomic improvements to `struct.with_fields` HOT 1
- Support converting to NumPy masked arrays
- `write_parquet` on chunked data is pathological
- LazyFrame() not omitting hive partition columns
- Panic when trying to use List(Categorical) set_intersection with concat_list of other column with nulls or empty frame HOT 1
- read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame HOT 1
- `struct.field("*")` duplicate column ComputeError
- `from_repr` generates DecprecationWarning about `apply` when Duration type is present
- In `expr.str.slice()` indicate whether an index of 0 or 1 means "start at the start of the string"
- Add argument to `df.to_dicts()` and `df.to_dict()` - `maintain_column_order: bool` HOT 3
- Support zero copy for Datetime/Duration/Array types in `DataFrame.to_numpy`
- Reading parquet with PyArrow ignores rechunk argument HOT 1
- Add `pl.col(...).is_not_in(<iterable>)` method HOT 2
- `search_sorted` in an order of magnitude slower when single element chunk vstacked to the original dataframe
- Rust to_ndarray does not cast Null in f64 column to NaN
- .hash() return Int64 instead of UInt64
- Add argument to `Series.value_counts` to set the name of the new column created
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.