Comments (9)
@ritchie46 See attached - the dataframe from df = pl.read_database_uri(...)
contains chunks of size 155 which matches the error message, I wrote out 2 columns using write_parquet("tasks.parquet", row_group_size=155)
which I assume preserves the chunks.
To reproduce
pl.read_parquet("tasks.parquet").filter(~pl.col("task_id").is_in([1]))
ShapeError: filter's length: 155 differs from that of the series: 0
(I think the json encoding was a furphy but I included that column)
from polars.
Thanks for the report. Would love to get a repro on this.
from polars.
Sorry, saving the dataframe (to parquet or json) and then reloading it "fixes" the issue otherwise I'd be happy to privately share the data. I can reproduce at will on both my Ubuntu 22.04 workstation and M3 Macbook Pro
If I have time I'll try spinning up a postgres docker image and see if I can create a simpler reproduction.
I also couldn't work our which specific version introduced the issue but it's not present in 0.20.26
from polars.
So I just discovered that
df = pl.read_database_uri(...)
df.with_columns(pl.col("status").str.json_decode().filter(pl.col("id")==1)
fails with the error above and
df = pl.read_database_uri(...)
df.rechunk().with_columns(pl.col("status").str.json_decode().filter(pl.col("id")==1)
works - I'm assuming that's why saving/loading also "fixes" it.
Although I found I had to add rechunk
in multiple points in the query to make it work for every query (in general before any filter operation, so when I filtered twice I had to rechunk twice?)
from polars.
What is the schema? I think if you create a chunked dataframe with the same schema you should be able to create a repro.
from polars.
I think I have an issue that may share similarities with this one : After concatenating dataframes, some operations raise a ShapeError. My issue is also "fixed" by writing the dataframe to parquet and loading back, or by using rechunk.
I am not sure if this is same underlying issue, however it provides an easy repro : #16516
from polars.
I can reproduce the error with the attached file.
I'm not sure if it is the same issue, but I also get a panic trying to rewrite it to disk:
>>> pl.read_parquet("Downloads/tasks.parquet").write_parquet("1.parquet")
thread 'polars-0' panicked at crates/polars-arrow/src/array/struct_/mod.rs:117:52:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n However, the values at index 7 have a length of 161, which is different from values at index 0, 0."))
from polars.
Also note join
on chunked dataframes is also raising errors, i.e.
thread 'polars-9' panicked at crates/polars-ops/src/chunked_array/gather/chunked.rs:84:5:
assertion `left == right` failed: implementation error
left: 1
right: 12
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Edit: I'm seeing this on v0.30.26 as well
from polars.
Got a complete query @david-waterworth ?
from polars.
Related Issues (20)
- pl.Enum equivalence is category order dependent HOT 3
- `Decimal[*, scale>0] * Int` has differing result type than `Decimal[*, scale>0] * Decimal[*,scale=0]` HOT 1
- The `pivot` feature does not compile in Rust polars v38-40. HOT 2
- Struct with decimals not read properly in parquet HOT 7
- Regression from 0.20.21 -> 0.20.22-rc.1 `pl.Expr.list.to_array(n)` is throwing `polars.exceptions.ComputeError: not all elements have the specified width n` HOT 1
- performance issue with tpch q7 after dropping columns and using sink_parquet HOT 4
- `map_elements` doesn't respect `return_dtype` within an `over` statement on an empty DataFrame HOT 1
- `pl.lit(None, dtype=pl.Struct({"a": pl.Int64()}))` gives `{'a': None}`, not `None` HOT 1
- Support equality operation on nested Array types
- Unordered enum data type HOT 4
- Support interval expressions in Python SQL Context
- minimal `dyn int` when reading from python HOT 1
- Panic when casting Array of Categoricals to Array of String HOT 2
- dt.epoch() is much slower than truediv() for the same operations HOT 1
- PanicException when using collect(streaming=True) on two LazyFrames from `scan_parquet()` calls.
- Allow Zero width no-break space in float parser HOT 7
- Alternative method 10x faster than dt.offset_by() HOT 2
- Sampling with groupby HOT 1
- Sample by Group HOT 4
- Add `make test-ci` to (mostly) replicate CI tests HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.