Checks <input type="c

I believe this is a regression from <a class="issue-link js-issue-link" data-error-tex

SQL Query Hangs When Using scan_parquet off S3 about polars HOT 5 CLOSED

stephenskory commented on September 24, 2024

SQL Query Hangs When Using scan_parquet off S3

from polars.

Comments (5)

nameexhaustion commented on September 24, 2024 2

I believe this is a regression from #15083.

Local MRE:

import os

os.environ["POLARS_FORCE_ASYNC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "1"

import polars as pl

df = pl.Series("x", [1]).to_frame()

p = ".env/x.parquet"
df.write_parquet(p)

print(pl.collect_all([pl.scan_parquet(p)]))

Basically, every thread in the rayon pool blocks on an async task, and then one of those async tasks end up spawning a new task on the rayon pool:

polars/crates/polars-io/src/parquet/read_impl.rs

Line 673 in 474ac34

POOL.spawn(move || {

and then blocks on that task, which never executes since all the rayon threads are blocked, so we deadlock

from polars.

stephenskory commented on September 24, 2024

FWIW I just ran the above examples on my laptop (M1 Mac) and all examples including the lazy-S3 version worked. This indicates that something is amiss on the EC2 instance with my Python/Polars stack, but I'm still stuck. Any ideas or suggestions? Just re-install everything? Is there a way to get any kind of useful error message?

from polars.

ritchie46 commented on September 24, 2024

Does this occur in the dataframe API as well?

from polars.

stephenskory commented on September 24, 2024

Yes, this example below hangs, but with read_parquet and no .collect() it works fine.

import polars as pl
import boto3
session = boto3.session.Session()
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
        "aws_access_key_id": credentials.access_key,
        "aws_secret_access_key": credentials.secret_key,
        "aws_session_token": credentials.token,
        "aws_region": session.region_name,
    }
df0 = pl.scan_parquet("s3://bucket/df0.parquet",
                    storage_options=storage_options)
df1 = pl.scan_parquet("s3://bucket/df1.parquet",
                    storage_options=storage_options)
dfboth = df0.join(df1, on="bar", how="inner")
dfboth = dfboth.filter(pl.col("bar").is_in((0, 1))).collect()
print(dfboth)

I tried upgrading a number of the packages related to Polars but it didn't change anything:

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-5.15.0-1055-aws-aarch64-with-glibc2.31
Python:               3.11.5 (main, Sep 11 2023, 13:14:08) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0 # upgraded
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1 # upgraded
gevent:               <not installed>
hvplot:               0.9.0
matplotlib:           3.8.0
numpy:                1.24.3
openpyxl:             3.0.10
pandas:               2.1.4
pyarrow:              15.0.2 # upgraded
pydantic:             2.6.4 # upgraded
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.21
xlsx2csv:             <not installed>
xlsxwriter:           3.1.9

from polars.

aberres commented on September 24, 2024

I am runing into this issue with 0.20.16 when reading Parquet files from Google Cloud Buckets with code like this

df_pl = pl.concat(
    (pl.scan_parquet(c.url, storage_options=storage_options) for c in results),
    how="diagonal",
    # Ideally this reduces the memory usage
    rechunk=False,
).collect()

from polars.

SQL Query Hangs When Using scan_parquet off S3 about polars HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent