Deion We have a hive partitioned parquet file, partitioned o

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Create a what? A test dataset? <p dir="aut

See the sample data creation and reading up: <div class="highlight highlight-sourc

Partition pruning does not work with predicates that have type-casting,about pola-rs/polars

Comments (21)

ritchie46 commented on August 22, 2024

Can it be, that the order of the filtering criteria determines if partition filtering will be used or not?

I think this is the case. We should keep hive partitioned predicates separate.

@nameexhaustion another data point.

from polars.

ritchie46 commented on August 22, 2024

Hmm.. We did some checks and we do apply predicate pushdown based on hive partitions. Can you create a that shows the difference? That helps us to fix it.

from polars.

lmocsi commented on August 22, 2024

Create a what? A test dataset?
There is one similar that can be created: apache/arrow#39768
But I do not know if you can reproduce the problem on that.
How could I test it on the real data?

from polars.

deanm0000 commented on August 22, 2024

@ritchie46 See #13908 and subsequently #14244

@lmocsi it would be more helpful if you brought up the old issues from the start. Did it work in 0.20.7, which is the first version that the above PR was in?

from polars.

lmocsi commented on August 22, 2024

The performance difference is there with older versions (eg. 0.20.3), as well.
I did not go further back in time.

from polars.

lmocsi commented on August 22, 2024

Can it be, that datetime partition name formatting is confusing scan_parquet?
Thinking of the 00%3A00%3A00 part.
Eg:
/my_path/my_table/CALENDAR_DATE=2019-01-01 00%3A00%3A00/part-00000-35c18ead-whatever.c000.snappy.parquet

from polars.

ritchie46 commented on August 22, 2024

Create a what? A test dataset?

A full reproducable example that shows the problem. So test data and the query that goes with it.

from polars.

lmocsi commented on August 22, 2024

Create a what? A test dataset?

A full reproducable example that shows the problem. So test data and the query that goes with it.

I tried, but it is not clear, which part of the dataset the error comes from.
Unfortunately I cannot disclose the dataset.
Is there a way I can test, what you'd be testing on a full reproducable example?

from polars.

ritchie46 commented on August 22, 2024

Try to create some fake data with the same hive partition schemas. It probably is only the hive partitions and the query that's important.

from polars.

lmocsi commented on August 22, 2024

See the sample data creation and reading up:

#create data
import polars as pl
from faker import Faker
import random as rnd
from datetime import datetime,date
import pyarrow.dataset as ds

print(f"polars version: {pl.__version__}")

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

fake = Faker()

print(ido(),'Started')

path = '/mypath/'
dflen = 10000000
df = pl.DataFrame({'ID': pl.Series(fake.unique.random_int(min=13127924000, max=14127924000) for i in range(dflen)),
                   'BA_ID': pl.Series(fake.unique.random_int(min=2, max=2585456410) for i in range(dflen)),
                   'PART_ID': pl.Series(fake.unique.random_int(min=2163520, max=16320804) for i in range(dflen)),
                   'CU_ID': pl.Series(rnd.choice([1096, 3342, 3374, 4272, 3098, 3099]) for i in range(dflen)),
                   'DEA_ID': pl.Series(fake.unique.random_int(min=996000, max=53237133) for i in range(dflen)),
                   'AM_CY': pl.Series(fake.pyfloat(min_value=10000.0, max_value=990000.0, right_digits=1) for i in range(dflen)),
                   'CR_FL': pl.Series(rnd.choice(['Y', 'N']) for i in range(dflen)),
                   'PA_COM': pl.Series(rnd.choice(["######", None]) for i in range(dflen)),
                   'CO_TE': pl.Series(rnd.choice(["######", "A:Techn. part######", " -5755.00 MAD -5755.0"]) for i in range(dflen)),
                   'OT_AC': pl.Series(rnd.choice(["121223234545565678788989", "111122224444555577778888",  None]) for i in range(dflen)),
                   'OP_AC_NA': pl.Series(rnd.choice(["Donald Arthur Biden###########", "Joe William Trump#############",  None]) for i in range(dflen)),
                   'DWS_ID': pl.Series(rnd.choice([198, 1395, 5121, 2473]) for i in range(dflen)),
                   'ADT_ID': pl.Series(rnd.choice([570, 1309, 1680, 1798, 1916, 13856, 355136]) for i in range(dflen)),
                   'ADC_ID': pl.Series(rnd.choice([1019, 1134, 1455]) for i in range(dflen)),
                   'ADK_ID': pl.Series(rnd.choice([2058, 2185, 160279, 240274]) for i in range(dflen)),
                   'ABDO_ID': pl.Series(rnd.choice([2, 31248967]) for i in range(dflen)),
                   'ADS_ID': pl.Series(rnd.choice([1271, 1265, 1399, 1342, 1652, 1266]) for i in range(dflen)),
                   'INT_FL': pl.Series(rnd.choice(['Y', None]) for i in range(dflen)),
                   'MN_DIR': pl.Series(rnd.choice(['INT', 'DOM']) for i in range(dflen)),
                   'ADC_ID': pl.Series(rnd.choice([2, 2688, 2689, 24605]) for i in range(dflen)),
                   'ADO_ID': pl.Series(rnd.choice([2, 3126]) for i in range(dflen)),
                   'REF': pl.Series(rnd.choice(['12345679801','AD789789_12345645','DAS7894561230315','12345678','81051314_239_02_01_00_4566']) for i in range(dflen)),
                   'SEC_ID': pl.Series(rnd.choice([2, 93708]) for i in range(dflen)),
                   'ADL_ID': pl.Series(rnd.choice([2, 1125, 1134, 1364, 20834]) for i in range(dflen)),
                   'CH_ID': pl.Series(rnd.choice([50141, 50016, 49904, 49838, None]) for i in range(dflen)),
                   #'CALENDAR_DATE': pl.Series(fake.date_between_dates(date(2023,1,1),date(2024,4,30)) for i in range(dflen)), # calendar_date:date -> scan_parquet reads it very fast
                   'CALENDAR_DATE': pl.Series(fake.date_between_dates(datetime(2023,1,1),datetime(2024,4,30)) for i in range(dflen)),
                  }).with_columns(AM=pl.col('AM_CY'))

print(ido(),dflen,'records created')

ds.write_dataset(
        df.to_arrow(),
        path+'my_transaction',
        format="parquet",
        partitioning=["CALENDAR_DATE"],
        partitioning_flavor="hive",
        existing_data_behavior="delete_matching",
    )

print(ido(),'finished')
# 2024.06.19. 17:26:12 Started
# 2024.06.19. 17:41:46 10000000 records created
# 2024.06.19. 17:43:02 finished

Then reading the data shows the difference: scan_pyarrow_dataset runs in 2 seconds, while scan_parquet runs in 21 seconds:

import polars as pl
from datetime import datetime
import pyarrow.dataset as ds

print(f"polars version: {pl.__version__}")

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

parq_path = '/mypath/'
ext = '/**/*.parquet'

tr = pl.scan_pyarrow_dataset(ds.dataset(parq_path+"my_transaction", partitioning='hive'))
df = (tr.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-12-01'), pl.lit('2023-12-31'))) &
                   (pl.col('CR_FL') == 'I') &
                   (pl.col('SEC_ID') > 3) &
                   (pl.col('ADL_ID') == 2905) &
                   (~pl.col('PART_ID').is_in([5086634, 2149316, 6031676])) &
                   (pl.col('ADT_ID') != 7010)
                   )
           .select('PART_ID').unique()
           .rename({'PART_ID':'PARTY_ID'})
           .with_columns(pl.lit(1).alias('LU_NEXT_FL'))
         )
print(ido(),'scan_pyarrow_dataset started') # 2 sec
df.collect()
print(ido(),'scan_pyarrow_dataset finished')

# explain plan:
# WITH_COLUMNS:
#  [dyn int: 1.alias("LU_NEXT_FL")] 
#   RENAME
#     UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#       simple π 1/6 ["PART_ID"]
#           PYTHON SCAN 
#           PROJECT 6/26 COLUMNS
#           SELECTION: [([([([([([(col("ADL_ID")) == (2905)]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADT_ID")) != (7010)])]) & ([(col("CR_FL")) == (String(I))])]) & (col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)]))]


tr2 = pl.scan_parquet(parq_path+"my_transaction"+ext)
df2 = (tr2.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-12-01'), pl.lit('2023-12-31'))) &
                   (pl.col('CR_FL') == 'I') &
                   (pl.col('SEC_ID') > 3) &
                   (pl.col('ADL_ID') == 2905) &
                   (~pl.col('PART_ID').is_in([5086634, 2149316, 6031676])) &
                   (pl.col('ADT_ID') != 7010)
                   )
           .select('PART_ID').unique()
           .rename({'PART_ID':'PARTY_ID'})
           .with_columns(pl.lit(1).alias('LU_NEXT_FL'))
         )
print(ido(),'scan_parquet started') # 21 sec
df2.collect()
print(ido(),'scan_parquet finished')

# explain plan:
 # WITH_COLUMNS:
 # [dyn int: 1.alias("LU_NEXT_FL")] 
 #  RENAME
 #    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
 #      simple π 1/6 ["PART_ID"]
 #          Parquet SCAN 485 files: first file: /mypath/my_transaction/CALENDAR_DATE=2023-01-01/part-0.parquet
 #          PROJECT 6/26 COLUMNS
 #          SELECTION: [([([([([([(col("ADL_ID")) == (2905)]) & ([(col("ADT_ID")) != (7010)])]) & (col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)]))]) & ([(col("SEC_ID")) > (3)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("CR_FL")) == (String(I))])]

from polars.

lmocsi commented on August 22, 2024

One small addition: in the data creation part the CALENDAR_DATE field should be calculated like this:

   'CALENDAR_DATE': pl.Series(fake.date_time_between_dates(datetime(2023,1,1),datetime(2024,4,30)) for i in range(dflen)).dt.truncate('1d'),

from polars.

lmocsi commented on August 22, 2024

@ritchie46 Any chance of this issue being solved soon?
I guess this bug hinders usage of polars in corporate environment, since without partition filtering all queries run almost for eternity...

from polars.

ritchie46 commented on August 22, 2024

I would recommend use polars scan_parquet directly. It now has proper hive partitioning support and can deal with all of Polars predicates.

from polars.

lmocsi commented on August 22, 2024

I would recommend use polars scan_parquet directly. It now has proper hive partitioning support and can deal with all of Polars predicates.

Unfortunately it still lags behind scan_pyarrow_dataset in performance on polars==1.3.0
Here are some run times on real data (which I cannot disclose, sorry):

polars==1.3.0, scan_pyarrow_ds: 8 sec
polars==1.3.0, scan_pq:         1 min 31 sec
polars==1.4.1, scan_pyarrow_ds: <runs out of 32 GB memory>
polars==1.4.1, scan_pq:         1 min 32 sec

So it seems, that something has changed in bad direction from polars 1.3.0 to 1.4.1 regarding scan_pyarrow_dataset: partition filtering is no longer pushed down in 1.4.1. :(

Regarding scan_pq vs scan_pyarrow_ds, the only difference I see is that scan_pyarrow_ds uses PYTHON SCAN [], whereas scan_pq uses Parquet SCAN [/mypath/mytable/part-00000-whatever.snappy.parquet] in the explain plan.

from polars.

ritchie46 commented on August 22, 2024

If you can make a reproducable query we fix the optimization.

from polars.

deanm0000 commented on August 22, 2024

Here's a small example

Make dataset:

import polars as pl
import numpy as np
from pathlib import Path
from urllib.parse import quote

df = (
    pl.select(
        cal_date=pl.datetime_range(
            pl.datetime(2024, 1, 1), pl.datetime(2024, 2, 5), "1d"
        )
    )
    .with_columns(pl.Series("data", np.random.normal(0, 1, 360)).reshape((36, 10)))
    .explode("data")
)

for uniq_date in df["cal_date"].unique():
    folder_path = Path(f"./example/CALENDAR_DATE={quote(uniq_date.isoformat())}")
    folder_path.mkdir(parents=True, exist_ok=True)
    df.filter(pl.col("cal_date") == uniq_date).drop("cal_date").write_parquet(
        folder_path / "0000.parquet"
    )

Do scan

pl.Config.set_verbose(True)

lf = (
    pl.scan_parquet("./example/**/*.parquet", hive_partitioning=True)
    .filter(
        pl.col("CALENDAR_DATE").is_between(pl.lit("2024-01-02"), pl.lit("2024-01-04"))
    )
    .collect()
)

and then just a wall of "parquet file must be read, statistics not sufficient for predicate."

Doing it with pl.datetime instead of string

lf = (
    pl.scan_parquet("./example/**/*.parquet", hive_partitioning=True)
    .filter(
        pl.col("CALENDAR_DATE").is_between(pl.datetime(2024,1,2), pl.datetime(2024,1,4))
    )
    .collect()
)

same result.

If do I ge and le then I don't get any verbose printing so I assume it's also reading all the files but hard to tell with small local example

lf = (
    pl.scan_parquet("./example/**/*.parquet", hive_partitioning=True)
    .filter(
        (pl.col("CALENDAR_DATE")>=pl.datetime(2024,1,2)) & (pl.col("CALENDAR_DATE")<=pl.datetime(2024,1,4))
    )
    .collect()
)

from polars.

lmocsi commented on August 22, 2024

If you can make a reproducable query we fix the optimization.

My dataset creation and querying is up here in the comments, but I can give you the link: #17045 (comment)

Running it on 1.3.0 polars:

scan_pyarrow_dataset: < 1 sec
scan_parquet: 8 sec

On 1.4.1:

scan_pyarrow_dataset: 37 sec
scan_parquet: 7 sec

from polars.

nameexhaustion commented on August 22, 2024

is_between is casting to supertypes to default. This means that the following filter:

pl.col("date1").is_between(pl.lit("2023-12-01"), pl.lit("2023-12-01"))

Is effectively translated to:

pl.col("date1").cast(pl.String).is_between(pl.lit("2023-12-01"), pl.lit("2023-12-01"))

and casting is currently not supported in the optimizer.

@lmocsi I'd advise for you to use filter against datetime objects instead, i.e. pl.col("date1").is_between(datetime(2023, 12, 1), datetime(2023, 12, 1)).

The following is a minimal reproducible example showing the different filters and their outputs:

from pathlib import Path
from datetime import datetime
import polars as pl

root = Path(".env/data")
partition = root / "date1=1970-01-01%2000%3A00%3A00.000000"
partition.mkdir(exist_ok=True, parents=True)

pl.DataFrame({"x": 1}).write_parquet(partition / "1.parquet")

lf = pl.scan_parquet(root, hive_partitioning=True).filter(
    pl.col("date1").is_between(pl.lit("2023-12-01"), pl.lit("2023-12-01"))
)
### Output
# FILTER col("date1").cast(String).is_between([String(2023-12-01), String(2023-12-01)]) FROM
#   Parquet SCAN [.env/data/date1=1970-01-01%2000%3A00%3A00.000000/1.parquet]
#   PROJECT */2 COLUMNS
# parquet file must be read, statistics not sufficient for predicate.
# parquet file must be read, statistics not sufficient for predicate.
print(lf.explain(optimized=False))
lf.collect()

lf = pl.scan_parquet(root, hive_partitioning=True).filter(
    pl.col("date1").is_between(datetime(2023, 12, 1), datetime(2023, 12, 1))
)
print(lf.explain(optimized=False))
print(lf.collect())
### Output
# FILTER col("date1").is_between([2023-12-01 00:00:00, 2023-12-01 00:00:00]) FROM
#   Parquet SCAN [.env/data/date1=1970-01-01%2000%3A00%3A00.000000/1.parquet]
#   PROJECT */2 COLUMNS
# parquet file can be skipped, the statistics were sufficient to apply the predicate.

There are a few things we could consider doing about this:

Change is_between to not cast to supertypes by default, this could be a breaking change.
Support type casting in the stats_evaluator framework so that predicates with type-casting can run.

from polars.

ritchie46 commented on August 22, 2024

I will close this as @nameexhaustion states it should work on datetimes directly.

Working on strings was also semantically wrong as you probably don't want to check if the columns string repr is between the given strings. You want to work on temporal types.

from polars.

lmocsi commented on August 22, 2024

On the above dataset, I tried filtering on the partition column as a string and as a datetime. Here are the results.
It shows, that scan_pyarrow_dataset in the 1.3.0 polars version was the fastest, either filtering as a string or as a datetime

polars version: 1.3.0:

method	filter	time
scan_pyarrow_dataset	str	< 1 sec
scan_pyarrow_dataset	datetime	< 1 sec
scan_parquet	str	3 sec
scan_parquet	datetime	3 sec

polars version: 1.4.1:

method	filter	time
scan_pyarrow_dataset	str	15 sec
scan_pyarrow_dataset	datetime	15 sec
scan_parquet	str	3 sec
scan_parquet	datetime	3 sec

scan_pyarrow_dataset str explain plans:

# 1.3.0: # < 1 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        PYTHON SCAN []
#        PROJECT 6/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 15 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        FILTER [([([([([(col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])] FROM
#          PYTHON SCAN []
#          PROJECT 6/26 COLUMNS

scan_pyarrow_dataset datetime explain plans:

# 1.3.0: # < 1 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        PYTHON SCAN []
#        PROJECT 6/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([String(1701388800000000), String(1703980800000000)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 15 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        FILTER [([([([([(col("CALENDAR_DATE").is_between([String(1701388800000000), String(1703980800000000)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])] FROM
#          PYTHON SCAN []
#          PROJECT 6/26 COLUMNS

scan_parquet str explain plans:

# 1.3.0: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").cast(String).is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").cast(String).is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

scan_parquet datetime explain plans:

# 1.3.0: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([2023-12-01 00:00:00, 2023-12-31 00:00:00])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([2023-12-01 00:00:00, 2023-12-31 00:00:00])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

So it seems, that this filtering, pushed down is the fastest, either with strings like String(2023-12-01), or with dates converted to Unix epochs, like String(1701388800000000):

#      simple π 1/6 ["PART_ID"]
#        PYTHON SCAN []
#        PROJECT 6/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

from polars.

lmocsi commented on August 22, 2024

@nameexhaustion Can you have a look at the above execution plans?

from polars.

Partition pruning does not work with predicates that have type-casting about polars HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent