Checks <input type="c

`.alias()` causes `ComputeError` when applied to expression in GroupBy context `agg()` about polars HOT 2 CLOSED

TNieuwdorp commented on September 24, 2024

`.alias()` causes `ComputeError` when applied to expression in GroupBy context `agg()`

from polars.

Comments (2)

cmdlineluser commented on September 24, 2024 1

Can reproduce.

1000 seems to the be minimum number of rows needed for it to error on my system.

(pl.read_csv("sales_data.csv").head(1000)
    .select("Country", "Profit")
    .group_by("Country")
    .agg(
        (pl.col("Profit") > 1000).alias("Profit > 1000")
    )
)
# ComputeError: returned aggregation is of different length: 125 than the groups length: 6

.head(999) works as expected.

from polars.

mcrumiller commented on September 24, 2024

There is a bug here, but I think there is also confusion about what you're trying to do.

Calling .agg(pl.col("Profit")) returns a list of all profits--one for each country value:

>>> df.group_by("Country").agg(col("Profit"))
shape: (6, 2)
┌────────────────┬─────────────────────┐
│ Country        ┆ Profit              │
│ ---            ┆ ---                 │
│ str            ┆ list[i64]           │
╞════════════════╪═════════════════════╡
│ Canada         ┆ [590, 590, … 630]   │
│ Germany        ┆ [160, 53, … 746]    │
│ Australia      ┆ [1366, 1188, … 655] │
│ United Kingdom ┆ [1053, 1053, … 112] │
│ France         ┆ [427, 427, … 655]   │
│ United States  ┆ [524, 407, … 542]   │
└────────────────┴─────────────────────┘

Calling col("Profit") > 1000 then returns lists of True/False values:

df.group_by("Country").agg(col("Profit") > 1000))
shape: (6, 2)
┌────────────────┬─────────────────────────┐
│ Country        ┆ Profit                  │
│ ---            ┆ ---                     │
│ str            ┆ list[bool]              │
╞════════════════╪═════════════════════════╡
│ Germany        ┆ [false, false, … false] │
│ United States  ┆ [false, false, … false] │
│ United Kingdom ┆ [true, true, … false]   │
│ Canada         ┆ [false, false, … false] │
│ France         ┆ [false, false, … false] │
│ Australia      ┆ [true, true, … false]   │
└────────────────┴─────────────────────────┘

This is probably not what you intended (my guess is you meant col("Profit").sum() > 1000). But regardless, here is where the alias fails:

df.group_by("Country").agg((col("Profit") > 1000).alias("test"))

polars.exceptions.ComputeError: returned aggregation is of different length: 193 than the groups length: 2

This appears to be failing because the lists are of different lengths. The 193 changes if you rerun the command over and over.

This error doesn't occur if you simply aggregate into a list, but once you do another operation, the alias appears to fail.

from polars.

`.alias()` causes `ComputeError` when applied to expression in GroupBy context `agg()` about polars HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent