Checks <input type="c

For the moment, I use this temporary fix before the group by: <div class="highligh

Alternatively, you could do <div class="highlight highlight-source-python notransl

Of course: <div class="highlight highlight-source-python notranslate position-rela

First rows get ignored by `group_by_dynamic` when using `offset` about polars HOT 5 CLOSED

michelbl commented on June 16, 2024 1

First rows get ignored by `group_by_dynamic` when using `offset`

from polars.

Comments (5)

michelbl commented on June 16, 2024

For the moment, I use this temporary fix before the group by:

first_timestamp = df["t"][0]
floored = floor_timestamp(timestamp=first_timestamp, offset=5)
prefix = pl.DataFrame(
    data={
        "t": pl.Series([floored]),
        "v": [0.0],
    }
)
df = prefix.vstack(df)

from polars.

MarcoGorelli commented on June 16, 2024

Alternatively, you could do

   ...: resampled = df.group_by_dynamic(
   ...:     index_column="t", every="1d", offset=timedelta(hours=5-24)
   ...: ).agg(
   ...:     [
   ...:         pl.sum("v").alias("v"),
   ...:     ]
   ...: )
   ...:
   ...: print(resampled)
shape: (2, 2)
┌─────────────────────────┬──────┐
│ t                       ┆ v    │
│ ---                     ┆ ---  │
│ datetime[ms, UTC]       ┆ i64  │
╞═════════════════════════╪══════╡
│ 2024-03-21 05:00:00 UTC ┆ 11   │
│ 2024-03-22 05:00:00 UTC ┆ 1100 │
└─────────────────────────┴──────┘

I'm wondering if Polars should do this for you. Currently the rule is

‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.

but maybe it would be more user-friendly to do

‘window’: Start by taking the earliest timestamp, truncating it with every, subtracting 'every', and then adding "offset". Note that weekly windows start on Monday.

from polars.

michelbl commented on June 16, 2024

Thanks for the solution! It solves my problem when every is also daily. But if every is monthly, this offsets the label of the windows.

I believe that there is an assumption, just from the name of that method, that every row is taken into account.

Nevertheless, if you consider the behavior to be as expected, I think the documentation should be extra clear about that. For the moment, it only warns that "Different from a normal group by is that a row can be member of multiple groups." It should also says that rows may not belong to any window and be dropped.

You are right that the first window is computed as documented. But the major consequence that some rows may not belong to any window is implicit, and cannot be easily deduced. A warning could also be added.

from polars.

MarcoGorelli commented on June 16, 2024

But if every is monthly, this offsets the label of the windows.

could you show an example please?

from polars.

michelbl commented on June 16, 2024

Of course:

from datetime import UTC, datetime, timedelta

import polars as pl

df = pl.DataFrame(
    data={
        "t": pl.Series(
            [
                datetime(2024, 2, 28, 3, 0, tzinfo=UTC),
                datetime(2024, 2, 29, 3, 0, tzinfo=UTC),
                datetime(2024, 3, 1, 3, 0, tzinfo=UTC),
                datetime(2024, 3, 2, 3, 0, tzinfo=UTC),
            ]
        ).dt.cast_time_unit("ms"),
        "v": [1, 10, 100, 1000],
    }
).set_sorted("t")

resampled = df.group_by_dynamic(
    index_column="t", every="1mo", offset=timedelta(hours=5-24)
).agg(
    [
        pl.sum("v").alias("v"),
    ]
)

print(resampled)

shows:

shape: (2, 2)
┌─────────────────────────┬──────┐
│ t                       ┆ v    │
│ ---                     ┆ ---  │
│ datetime[ms, UTC]       ┆ i64  │
╞═════════════════════════╪══════╡
│ 2024-01-31 05:00:00 UTC ┆ 11   │
│ 2024-02-29 05:00:00 UTC ┆ 1100 │
└─────────────────────────┴──────┘

while I would expect:

shape: (2, 2)
┌─────────────────────────┬──────┐
│ t                       ┆ v    │
│ ---                     ┆ ---  │
│ datetime[ms, UTC]       ┆ i64  │
╞═════════════════════════╪══════╡
│ 2024-02-01 05:00:00 UTC ┆ 111  │
│ 2024-03-01 05:00:00 UTC ┆ 1000 │
└─────────────────────────┴──────┘

I get that latter result when offset=timedelta(hours=5).

from polars.

First rows get ignored by `group_by_dynamic` when using `offset` about polars HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent