Giter VIP home page Giter VIP logo

Comments (5)

nameexhaustion avatar nameexhaustion commented on September 24, 2024 2

I believe this is a regression from #15083.

Local MRE:

import os

os.environ["POLARS_FORCE_ASYNC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "1"

import polars as pl

df = pl.Series("x", [1]).to_frame()

p = ".env/x.parquet"
df.write_parquet(p)

print(pl.collect_all([pl.scan_parquet(p)]))

Basically, every thread in the rayon pool blocks on an async task, and then one of those async tasks end up spawning a new task on the rayon pool:

and then blocks on that task, which never executes since all the rayon threads are blocked, so we deadlock

from polars.

stephenskory avatar stephenskory commented on September 24, 2024

FWIW I just ran the above examples on my laptop (M1 Mac) and all examples including the lazy-S3 version worked. This indicates that something is amiss on the EC2 instance with my Python/Polars stack, but I'm still stuck. Any ideas or suggestions? Just re-install everything? Is there a way to get any kind of useful error message?

from polars.

ritchie46 avatar ritchie46 commented on September 24, 2024

Does this occur in the dataframe API as well?

from polars.

stephenskory avatar stephenskory commented on September 24, 2024

Yes, this example below hangs, but with read_parquet and no .collect() it works fine.

import polars as pl
import boto3
session = boto3.session.Session()
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
        "aws_access_key_id": credentials.access_key,
        "aws_secret_access_key": credentials.secret_key,
        "aws_session_token": credentials.token,
        "aws_region": session.region_name,
    }
df0 = pl.scan_parquet("s3://bucket/df0.parquet",
                    storage_options=storage_options)
df1 = pl.scan_parquet("s3://bucket/df1.parquet",
                    storage_options=storage_options)
dfboth = df0.join(df1, on="bar", how="inner")
dfboth = dfboth.filter(pl.col("bar").is_in((0, 1))).collect()
print(dfboth)

I tried upgrading a number of the packages related to Polars but it didn't change anything:

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-5.15.0-1055-aws-aarch64-with-glibc2.31
Python:               3.11.5 (main, Sep 11 2023, 13:14:08) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0 # upgraded
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1 # upgraded
gevent:               <not installed>
hvplot:               0.9.0
matplotlib:           3.8.0
numpy:                1.24.3
openpyxl:             3.0.10
pandas:               2.1.4
pyarrow:              15.0.2 # upgraded
pydantic:             2.6.4 # upgraded
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.21
xlsx2csv:             <not installed>
xlsxwriter:           3.1.9

from polars.

aberres avatar aberres commented on September 24, 2024

I am runing into this issue with 0.20.16 when reading Parquet files from Google Cloud Buckets with code like this

df_pl = pl.concat(
    (pl.scan_parquet(c.url, storage_options=storage_options) for c in results),
    how="diagonal",
    # Ideally this reduces the memory usage
    rechunk=False,
).collect()

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.