Giter VIP home page Giter VIP logo

Comments (21)

ritchie46 avatar ritchie46 commented on August 22, 2024

Can it be, that the order of the filtering criteria determines if partition filtering will be used or not?

I think this is the case. We should keep hive partitioned predicates separate.

@nameexhaustion another data point.

from polars.

ritchie46 avatar ritchie46 commented on August 22, 2024

Hmm.. We did some checks and we do apply predicate pushdown based on hive partitions. Can you create a that shows the difference? That helps us to fix it.

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

Create a what? A test dataset?
There is one similar that can be created: apache/arrow#39768
But I do not know if you can reproduce the problem on that.
How could I test it on the real data?

from polars.

deanm0000 avatar deanm0000 commented on August 22, 2024

@ritchie46 See #13908 and subsequently #14244

@lmocsi it would be more helpful if you brought up the old issues from the start. Did it work in 0.20.7, which is the first version that the above PR was in?

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

The performance difference is there with older versions (eg. 0.20.3), as well.
I did not go further back in time.

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

Can it be, that datetime partition name formatting is confusing scan_parquet?
Thinking of the 00%3A00%3A00 part.
Eg:
/my_path/my_table/CALENDAR_DATE=2019-01-01 00%3A00%3A00/part-00000-35c18ead-whatever.c000.snappy.parquet

from polars.

ritchie46 avatar ritchie46 commented on August 22, 2024

Create a what? A test dataset?

A full reproducable example that shows the problem. So test data and the query that goes with it.

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

Create a what? A test dataset?

A full reproducable example that shows the problem. So test data and the query that goes with it.

I tried, but it is not clear, which part of the dataset the error comes from.
Unfortunately I cannot disclose the dataset.
Is there a way I can test, what you'd be testing on a full reproducable example?

from polars.

ritchie46 avatar ritchie46 commented on August 22, 2024

Try to create some fake data with the same hive partition schemas. It probably is only the hive partitions and the query that's important.

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

See the sample data creation and reading up:

#create data
import polars as pl
from faker import Faker
import random as rnd
from datetime import datetime,date
import pyarrow.dataset as ds

print(f"polars version: {pl.__version__}")

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

fake = Faker()

print(ido(),'Started')

path = '/mypath/'
dflen = 10000000
df = pl.DataFrame({'ID': pl.Series(fake.unique.random_int(min=13127924000, max=14127924000) for i in range(dflen)),
                   'BA_ID': pl.Series(fake.unique.random_int(min=2, max=2585456410) for i in range(dflen)),
                   'PART_ID': pl.Series(fake.unique.random_int(min=2163520, max=16320804) for i in range(dflen)),
                   'CU_ID': pl.Series(rnd.choice([1096, 3342, 3374, 4272, 3098, 3099]) for i in range(dflen)),
                   'DEA_ID': pl.Series(fake.unique.random_int(min=996000, max=53237133) for i in range(dflen)),
                   'AM_CY': pl.Series(fake.pyfloat(min_value=10000.0, max_value=990000.0, right_digits=1) for i in range(dflen)),
                   'CR_FL': pl.Series(rnd.choice(['Y', 'N']) for i in range(dflen)),
                   'PA_COM': pl.Series(rnd.choice(["######", None]) for i in range(dflen)),
                   'CO_TE': pl.Series(rnd.choice(["######", "A:Techn. part######", " -5755.00 MAD -5755.0"]) for i in range(dflen)),
                   'OT_AC': pl.Series(rnd.choice(["121223234545565678788989", "111122224444555577778888",  None]) for i in range(dflen)),
                   'OP_AC_NA': pl.Series(rnd.choice(["Donald Arthur Biden###########", "Joe William Trump#############",  None]) for i in range(dflen)),
                   'DWS_ID': pl.Series(rnd.choice([198, 1395, 5121, 2473]) for i in range(dflen)),
                   'ADT_ID': pl.Series(rnd.choice([570, 1309, 1680, 1798, 1916, 13856, 355136]) for i in range(dflen)),
                   'ADC_ID': pl.Series(rnd.choice([1019, 1134, 1455]) for i in range(dflen)),
                   'ADK_ID': pl.Series(rnd.choice([2058, 2185, 160279, 240274]) for i in range(dflen)),
                   'ABDO_ID': pl.Series(rnd.choice([2, 31248967]) for i in range(dflen)),
                   'ADS_ID': pl.Series(rnd.choice([1271, 1265, 1399, 1342, 1652, 1266]) for i in range(dflen)),
                   'INT_FL': pl.Series(rnd.choice(['Y', None]) for i in range(dflen)),
                   'MN_DIR': pl.Series(rnd.choice(['INT', 'DOM']) for i in range(dflen)),
                   'ADC_ID': pl.Series(rnd.choice([2, 2688, 2689, 24605]) for i in range(dflen)),
                   'ADO_ID': pl.Series(rnd.choice([2, 3126]) for i in range(dflen)),
                   'REF': pl.Series(rnd.choice(['12345679801','AD789789_12345645','DAS7894561230315','12345678','81051314_239_02_01_00_4566']) for i in range(dflen)),
                   'SEC_ID': pl.Series(rnd.choice([2, 93708]) for i in range(dflen)),
                   'ADL_ID': pl.Series(rnd.choice([2, 1125, 1134, 1364, 20834]) for i in range(dflen)),
                   'CH_ID': pl.Series(rnd.choice([50141, 50016, 49904, 49838, None]) for i in range(dflen)),
                   #'CALENDAR_DATE': pl.Series(fake.date_between_dates(date(2023,1,1),date(2024,4,30)) for i in range(dflen)), # calendar_date:date -> scan_parquet reads it very fast
                   'CALENDAR_DATE': pl.Series(fake.date_between_dates(datetime(2023,1,1),datetime(2024,4,30)) for i in range(dflen)),
                  }).with_columns(AM=pl.col('AM_CY'))

print(ido(),dflen,'records created')

ds.write_dataset(
        df.to_arrow(),
        path+'my_transaction',
        format="parquet",
        partitioning=["CALENDAR_DATE"],
        partitioning_flavor="hive",
        existing_data_behavior="delete_matching",
    )

print(ido(),'finished')
# 2024.06.19. 17:26:12 Started
# 2024.06.19. 17:41:46 10000000 records created
# 2024.06.19. 17:43:02 finished

Then reading the data shows the difference: scan_pyarrow_dataset runs in 2 seconds, while scan_parquet runs in 21 seconds:

import polars as pl
from datetime import datetime
import pyarrow.dataset as ds

print(f"polars version: {pl.__version__}")

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

parq_path = '/mypath/'
ext = '/**/*.parquet'

tr = pl.scan_pyarrow_dataset(ds.dataset(parq_path+"my_transaction", partitioning='hive'))
df = (tr.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-12-01'), pl.lit('2023-12-31'))) &
                   (pl.col('CR_FL') == 'I') &
                   (pl.col('SEC_ID') > 3) &
                   (pl.col('ADL_ID') == 2905) &
                   (~pl.col('PART_ID').is_in([5086634, 2149316, 6031676])) &
                   (pl.col('ADT_ID') != 7010)
                   )
           .select('PART_ID').unique()
           .rename({'PART_ID':'PARTY_ID'})
           .with_columns(pl.lit(1).alias('LU_NEXT_FL'))
         )
print(ido(),'scan_pyarrow_dataset started') # 2 sec
df.collect()
print(ido(),'scan_pyarrow_dataset finished')

# explain plan:
# WITH_COLUMNS:
#  [dyn int: 1.alias("LU_NEXT_FL")] 
#   RENAME
#     UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#       simple π 1/6 ["PART_ID"]
#           PYTHON SCAN 
#           PROJECT 6/26 COLUMNS
#           SELECTION: [([([([([([(col("ADL_ID")) == (2905)]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADT_ID")) != (7010)])]) & ([(col("CR_FL")) == (String(I))])]) & (col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)]))]


tr2 = pl.scan_parquet(parq_path+"my_transaction"+ext)
df2 = (tr2.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-12-01'), pl.lit('2023-12-31'))) &
                   (pl.col('CR_FL') == 'I') &
                   (pl.col('SEC_ID') > 3) &
                   (pl.col('ADL_ID') == 2905) &
                   (~pl.col('PART_ID').is_in([5086634, 2149316, 6031676])) &
                   (pl.col('ADT_ID') != 7010)
                   )
           .select('PART_ID').unique()
           .rename({'PART_ID':'PARTY_ID'})
           .with_columns(pl.lit(1).alias('LU_NEXT_FL'))
         )
print(ido(),'scan_parquet started') # 21 sec
df2.collect()
print(ido(),'scan_parquet finished')

# explain plan:
 # WITH_COLUMNS:
 # [dyn int: 1.alias("LU_NEXT_FL")] 
 #  RENAME
 #    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
 #      simple π 1/6 ["PART_ID"]
 #          Parquet SCAN 485 files: first file: /mypath/my_transaction/CALENDAR_DATE=2023-01-01/part-0.parquet
 #          PROJECT 6/26 COLUMNS
 #          SELECTION: [([([([([([(col("ADL_ID")) == (2905)]) & ([(col("ADT_ID")) != (7010)])]) & (col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)]))]) & ([(col("SEC_ID")) > (3)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("CR_FL")) == (String(I))])]

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

One small addition: in the data creation part the CALENDAR_DATE field should be calculated like this:

   'CALENDAR_DATE': pl.Series(fake.date_time_between_dates(datetime(2023,1,1),datetime(2024,4,30)) for i in range(dflen)).dt.truncate('1d'),

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

@ritchie46 Any chance of this issue being solved soon?
I guess this bug hinders usage of polars in corporate environment, since without partition filtering all queries run almost for eternity...

from polars.

ritchie46 avatar ritchie46 commented on August 22, 2024

I would recommend use polars scan_parquet directly. It now has proper hive partitioning support and can deal with all of Polars predicates.

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

I would recommend use polars scan_parquet directly. It now has proper hive partitioning support and can deal with all of Polars predicates.

Unfortunately it still lags behind scan_pyarrow_dataset in performance on polars==1.3.0
Here are some run times on real data (which I cannot disclose, sorry):

polars==1.3.0, scan_pyarrow_ds: 8 sec
polars==1.3.0, scan_pq:         1 min 31 sec
polars==1.4.1, scan_pyarrow_ds: <runs out of 32 GB memory>
polars==1.4.1, scan_pq:         1 min 32 sec

So it seems, that something has changed in bad direction from polars 1.3.0 to 1.4.1 regarding scan_pyarrow_dataset: partition filtering is no longer pushed down in 1.4.1. :(

Regarding scan_pq vs scan_pyarrow_ds, the only difference I see is that scan_pyarrow_ds uses PYTHON SCAN [], whereas scan_pq uses Parquet SCAN [/mypath/mytable/part-00000-whatever.snappy.parquet] in the explain plan.

from polars.

ritchie46 avatar ritchie46 commented on August 22, 2024

If you can make a reproducable query we fix the optimization.

from polars.

deanm0000 avatar deanm0000 commented on August 22, 2024

Here's a small example

Make dataset:

import polars as pl
import numpy as np
from pathlib import Path
from urllib.parse import quote

df = (
    pl.select(
        cal_date=pl.datetime_range(
            pl.datetime(2024, 1, 1), pl.datetime(2024, 2, 5), "1d"
        )
    )
    .with_columns(pl.Series("data", np.random.normal(0, 1, 360)).reshape((36, 10)))
    .explode("data")
)

for uniq_date in df["cal_date"].unique():
    folder_path = Path(f"./example/CALENDAR_DATE={quote(uniq_date.isoformat())}")
    folder_path.mkdir(parents=True, exist_ok=True)
    df.filter(pl.col("cal_date") == uniq_date).drop("cal_date").write_parquet(
        folder_path / "0000.parquet"
    )

Do scan

pl.Config.set_verbose(True)

lf = (
    pl.scan_parquet("./example/**/*.parquet", hive_partitioning=True)
    .filter(
        pl.col("CALENDAR_DATE").is_between(pl.lit("2024-01-02"), pl.lit("2024-01-04"))
    )
    .collect()
)

and then just a wall of "parquet file must be read, statistics not sufficient for predicate."

Doing it with pl.datetime instead of string

lf = (
    pl.scan_parquet("./example/**/*.parquet", hive_partitioning=True)
    .filter(
        pl.col("CALENDAR_DATE").is_between(pl.datetime(2024,1,2), pl.datetime(2024,1,4))
    )
    .collect()
)

same result.

If do I ge and le then I don't get any verbose printing so I assume it's also reading all the files but hard to tell with small local example

lf = (
    pl.scan_parquet("./example/**/*.parquet", hive_partitioning=True)
    .filter(
        (pl.col("CALENDAR_DATE")>=pl.datetime(2024,1,2)) & (pl.col("CALENDAR_DATE")<=pl.datetime(2024,1,4))
    )
    .collect()
)

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

If you can make a reproducable query we fix the optimization.

My dataset creation and querying is up here in the comments, but I can give you the link: #17045 (comment)

Running it on 1.3.0 polars:

  • scan_pyarrow_dataset: < 1 sec
  • scan_parquet: 8 sec

On 1.4.1:

  • scan_pyarrow_dataset: 37 sec
  • scan_parquet: 7 sec

from polars.

nameexhaustion avatar nameexhaustion commented on August 22, 2024

is_between is casting to supertypes to default. This means that the following filter:

  • pl.col("date1").is_between(pl.lit("2023-12-01"), pl.lit("2023-12-01"))

Is effectively translated to:

  • pl.col("date1").cast(pl.String).is_between(pl.lit("2023-12-01"), pl.lit("2023-12-01"))

and casting is currently not supported in the optimizer.

@lmocsi I'd advise for you to use filter against datetime objects instead, i.e. pl.col("date1").is_between(datetime(2023, 12, 1), datetime(2023, 12, 1)).

The following is a minimal reproducible example showing the different filters and their outputs:

from pathlib import Path
from datetime import datetime
import polars as pl

root = Path(".env/data")
partition = root / "date1=1970-01-01%2000%3A00%3A00.000000"
partition.mkdir(exist_ok=True, parents=True)

pl.DataFrame({"x": 1}).write_parquet(partition / "1.parquet")

lf = pl.scan_parquet(root, hive_partitioning=True).filter(
    pl.col("date1").is_between(pl.lit("2023-12-01"), pl.lit("2023-12-01"))
)
### Output
# FILTER col("date1").cast(String).is_between([String(2023-12-01), String(2023-12-01)]) FROM
#   Parquet SCAN [.env/data/date1=1970-01-01%2000%3A00%3A00.000000/1.parquet]
#   PROJECT */2 COLUMNS
# parquet file must be read, statistics not sufficient for predicate.
# parquet file must be read, statistics not sufficient for predicate.
print(lf.explain(optimized=False))
lf.collect()

lf = pl.scan_parquet(root, hive_partitioning=True).filter(
    pl.col("date1").is_between(datetime(2023, 12, 1), datetime(2023, 12, 1))
)
print(lf.explain(optimized=False))
print(lf.collect())
### Output
# FILTER col("date1").is_between([2023-12-01 00:00:00, 2023-12-01 00:00:00]) FROM
#   Parquet SCAN [.env/data/date1=1970-01-01%2000%3A00%3A00.000000/1.parquet]
#   PROJECT */2 COLUMNS
# parquet file can be skipped, the statistics were sufficient to apply the predicate.

There are a few things we could consider doing about this:

  • Change is_between to not cast to supertypes by default, this could be a breaking change.
  • Support type casting in the stats_evaluator framework so that predicates with type-casting can run.

from polars.

ritchie46 avatar ritchie46 commented on August 22, 2024

I will close this as @nameexhaustion states it should work on datetimes directly.

Working on strings was also semantically wrong as you probably don't want to check if the columns string repr is between the given strings. You want to work on temporal types.

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

On the above dataset, I tried filtering on the partition column as a string and as a datetime. Here are the results.
It shows, that scan_pyarrow_dataset in the 1.3.0 polars version was the fastest, either filtering as a string or as a datetime

polars version: 1.3.0:

methodfiltertime
scan_pyarrow_datasetstr< 1 sec
scan_pyarrow_datasetdatetime< 1 sec
scan_parquetstr3 sec
scan_parquetdatetime3 sec

polars version: 1.4.1:

methodfiltertime
scan_pyarrow_datasetstr15 sec
scan_pyarrow_datasetdatetime15 sec
scan_parquetstr3 sec
scan_parquetdatetime3 sec

scan_pyarrow_dataset str explain plans:

# 1.3.0: # < 1 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        PYTHON SCAN []
#        PROJECT 6/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 15 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        FILTER [([([([([(col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])] FROM
#          PYTHON SCAN []
#          PROJECT 6/26 COLUMNS

scan_pyarrow_dataset datetime explain plans:

# 1.3.0: # < 1 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        PYTHON SCAN []
#        PROJECT 6/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([String(1701388800000000), String(1703980800000000)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 15 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        FILTER [([([([([(col("CALENDAR_DATE").is_between([String(1701388800000000), String(1703980800000000)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])] FROM
#          PYTHON SCAN []
#          PROJECT 6/26 COLUMNS

scan_parquet str explain plans:

# 1.3.0: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").cast(String).is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").cast(String).is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

scan_parquet datetime explain plans:

# 1.3.0: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([2023-12-01 00:00:00, 2023-12-31 00:00:00])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

# 1.4.1: # 3 sec
# WITH_COLUMNS:
# [dyn int: 1.alias("LU_NEXT_FL")]
#  RENAME
#    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#      simple π 1/6 ["PART_ID"]
#        Parquet SCAN [my_transaction\CALENDAR_DATE=2023-01-01%2000%3A00%3A00.000000\part-0.parquet, ... 485 other files]
#        PROJECT 5/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([2023-12-01 00:00:00, 2023-12-31 00:00:00])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

So it seems, that this filtering, pushed down is the fastest, either with strings like String(2023-12-01), or with dates converted to Unix epochs, like String(1701388800000000):

#      simple π 1/6 ["PART_ID"]
#        PYTHON SCAN []
#        PROJECT 6/26 COLUMNS
#        SELECTION: [([([([([(col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)])) & ([(col("CR_FL")) == (String(I))])]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADL_ID")) == (2905)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("ADT_ID")) != (7010)])]

from polars.

lmocsi avatar lmocsi commented on August 22, 2024

@nameexhaustion Can you have a look at the above execution plans?

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.