danielgafni / dagster-polars Goto Github PK

View Code? Open in Web Editor NEW

42.0 2.0 4.0 338 KB

[Project moved] Polars integration for Dagster

License: Apache License 2.0

Python 100.00%

dagster dataframe etl polars

dagster-polars's Introduction

Hey there 👋

👷 ML/MLOps Engineer by day
❄️ NixOS ricer by night
📚 I'm currently learning Rust

My website: gafni.dev

🔥 My GitHub Stats

dagster-polars's People

Contributors

Stargazers

Watchers

Forkers

mjkanji seandavi edgbr

dagster-polars's Issues

Move some parameters to the IOManager config

Some parameters, like overwrite_schema from deltalake, should be set from the IOManager config, not only from the metadata.

pyarrow has no attribute dataset

I'm seeing this with dagster 1.5.5, pyarrow 13.0.0, dagster-polars 0.1.3. Could be a local problem, but in looking through versions, etc., there may be a version difference between pyarrow versions and their behavior, so asking here anyway.

The context is with a polars parquet io manager pointing at google cloud storage.

AttributeError: module 'pyarrow' has no attribute 'dataset'

Stack Trace:
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/execution/plan/inputs.py", line 831, in _load_input_with_input_manager
    value = input_manager.load_input(context)
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/storage/upath_io_manager.py", line 402, in load_input
    return self._load_single_input(path, context)
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/storage/upath_io_manager.py", line 239, in _load_single_input
    obj = self.load_from_path(context=context, path=path)
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster_polars/io_managers/base.py", line 193, in load_from_path
    ldf = self.scan_df_from_path(path=path, context=context)  # type: ignore
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster_polars/io_managers/parquet.py", line 83, in scan_df_from_path
    ds = pyarrow.dataset.dataset(
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/pyarrow/__init__.py", line 317, in __getattr__
    raise AttributeError(

Look into writing Parquet Statistics into Dagster metadata

https://arrow.apache.org/docs/python/generated/pyarrow.parquet.Statistics.html

Currently dagster-polars is calculating similar statistics manually.
The pyarrow statistics might be useful, and they would always be consistent with the actual Parquet file, not the polars DataFrame

Code duplication in `utils.py` and `base.py`

Hi there! I'm really liking this project.

As I was trying to read through your code, I noticed that there seems to be a lot of duplication for the metadata columns, where you've got the same functions in base.py and utils.py.

To make the code easier to read, it would be great if you removed these functions from base.py and kept them in utils.py only.

Change how read/write args are read from metadata

The current implementation is unnecessarily susceptible to breaking changes. Say a new version of Polars (or pyarrow.dataset) removes/renames a given write_parquet argument. Because the arguments are currently hard-coded in the PolarsParquetIOManager, this would break user scripts if they upgrade to the newer version of Polars.

I think a good solution would be switching to a dictionary (called something along the lines of polars_io_args -- I'm sure there's a better name) as follows:

@asset(
    metadata={
        "polars_io_args": {
            "compression": "snappy"
        }
    }
)
def some_asset():
    ...

And the dump_df_to_path and scan_df_from_path method then forward these args to Polars, without hard-coding their names:

class PolarsParquetIOManager(BasePolarsUPathIOManager):
    ...

    def dump_df_to_path(self, context: OutputContext, df: pl.DataFrame, path: UPath):
        assert context.metadata is not None
        io_args = context.metadata.get("polars_io_args", {})

        with path.open("wb") as file:
            df.write_parquet(
                file,
                **io_args 
            )

    def scan_df_from_path(self, path: UPath, context: InputContext) -> pl.LazyFrame:
        assert context.metadata is not None
        io_args = context.metadata.get("polars_io_args", {})
        io_args.setdefault("format", "parquet")  # sets format to Parquet if not defined already


        fs: Union[fsspec.AbstractFileSystem, None] = None

        try:
            fs = path._accessor._fs
        except AttributeError:
            pass

        return pl.scan_pyarrow_dataset(
            ds.dataset(
                str(path),
                filesystem=fs,
                **io_args
            ),
            allow_pyarrow_filter=context.metadata.get("allow_pyarrow_filter", True),
        )

This

helps reduces the amount of "coupling" we have in terms of expected arguments
is overall "cleaner", as we're separating IO-related metadata in a different 'namespace' compared to other metadata.
helps the metadata section in Dagit be less 'cluttered' with IO minutia by hiding it under a single JSON entry.

Finally, this might also enable (in the future) 'hybrid' IO managers which save the same asset in two ways, because you can separate each IO manager's args in different keys, even if the names of the children-args clash/overlap.

The only problem with the above solution is that I can't think of a "clean" solution that also allows you to pass allow_pyarrow_filer values because io_args is unpacked at a different place.

Suggestion: expand the docs

Hi there!

I've found this integration very useful and hope it'll get much more steam in the future. As such, it would be great to have expanded docs that ease new users into the following:

how to use metadata to select columns in the InputContext
how to use metadata to pass configuration options like compression type, etc., to the IO manager
how the IO managers interact with lazy frames. For example, the load_from_path method of the base IO manager does a collect() if the input is a LazyFrame, but given that any upstream asset would have to be materialized to write to disk anyways, how would one receive a lazy frame as an input in the first place? This bit is rather ambiguous to me.
- The dump_df_to_path function also only works with pl.DataFrame, but there is presumably a use case for the user simply returning a LazyFrame as the output of an asset. Perhaps it might be helpful to have a collect() in the dump_df_to_path (or the use of sink_parquet) for those scenarios?

danielgafni / dagster-polars Goto Github PK

dagster-polars's Introduction

Hey there 👋

🔥 My GitHub Stats

dagster-polars's People

Contributors

Stargazers

Watchers

Forkers

dagster-polars's Issues

Move some parameters to the IOManager config

pyarrow has no attribute dataset

Look into writing Parquet Statistics into Dagster metadata

Code duplication in `utils.py` and `base.py`

Change how read/write args are read from metadata

Suggestion: expand the docs

Use lazy_import to handle modules with optional dependencies

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent