Giter VIP home page Giter VIP logo

dagster-polars's Introduction

Hey there ๐Ÿ‘‹

๐Ÿ‘ท ML/MLOps Engineer by day
โ„๏ธ NixOS ricer by night
๐Ÿ“š I'm currently learning Rust

My website: gafni.dev

๐Ÿ”ฅ My GitHub Stats

danielgafni danielgafni

dagster-polars's People

Contributors

danielgafni avatar seandavi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dagster-polars's Issues

pyarrow has no attribute dataset

I'm seeing this with dagster 1.5.5, pyarrow 13.0.0, dagster-polars 0.1.3. Could be a local problem, but in looking through versions, etc., there may be a version difference between pyarrow versions and their behavior, so asking here anyway.

The context is with a polars parquet io manager pointing at google cloud storage.

AttributeError: module 'pyarrow' has no attribute 'dataset'

Stack Trace:
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/execution/plan/inputs.py", line 831, in _load_input_with_input_manager
    value = input_manager.load_input(context)
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/storage/upath_io_manager.py", line 402, in load_input
    return self._load_single_input(path, context)
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster/_core/storage/upath_io_manager.py", line 239, in _load_single_input
    obj = self.load_from_path(context=context, path=path)
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster_polars/io_managers/base.py", line 193, in load_from_path
    ldf = self.scan_df_from_path(path=path, context=context)  # type: ignore
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/dagster_polars/io_managers/parquet.py", line 83, in scan_df_from_path
    ds = pyarrow.dataset.dataset(
  File "/Users/seandavis/Library/Caches/pypoetry/virtualenvs/dagster-infra-h7rMZSN8-py3.9/lib/python3.9/site-packages/pyarrow/__init__.py", line 317, in __getattr__
    raise AttributeError(

Code duplication in `utils.py` and `base.py`

Hi there! I'm really liking this project.

As I was trying to read through your code, I noticed that there seems to be a lot of duplication for the metadata columns, where you've got the same functions in base.py and utils.py.

To make the code easier to read, it would be great if you removed these functions from base.py and kept them in utils.py only.

Change how read/write args are read from metadata

The current implementation is unnecessarily susceptible to breaking changes. Say a new version of Polars (or pyarrow.dataset) removes/renames a given write_parquet argument. Because the arguments are currently hard-coded in the PolarsParquetIOManager, this would break user scripts if they upgrade to the newer version of Polars.

I think a good solution would be switching to a dictionary (called something along the lines of polars_io_args -- I'm sure there's a better name) as follows:

@asset(
    metadata={
        "polars_io_args": {
            "compression": "snappy"
        }
    }
)
def some_asset():
    ...

And the dump_df_to_path and scan_df_from_path method then forward these args to Polars, without hard-coding their names:

class PolarsParquetIOManager(BasePolarsUPathIOManager):
    ...

    def dump_df_to_path(self, context: OutputContext, df: pl.DataFrame, path: UPath):
        assert context.metadata is not None
        io_args = context.metadata.get("polars_io_args", {})

        with path.open("wb") as file:
            df.write_parquet(
                file,
                **io_args 
            )

    def scan_df_from_path(self, path: UPath, context: InputContext) -> pl.LazyFrame:
        assert context.metadata is not None
        io_args = context.metadata.get("polars_io_args", {})
        io_args.setdefault("format", "parquet")  # sets format to Parquet if not defined already


        fs: Union[fsspec.AbstractFileSystem, None] = None

        try:
            fs = path._accessor._fs
        except AttributeError:
            pass

        return pl.scan_pyarrow_dataset(
            ds.dataset(
                str(path),
                filesystem=fs,
                **io_args
            ),
            allow_pyarrow_filter=context.metadata.get("allow_pyarrow_filter", True),
        )

This

  • helps reduces the amount of "coupling" we have in terms of expected arguments
  • is overall "cleaner", as we're separating IO-related metadata in a different 'namespace' compared to other metadata.
  • helps the metadata section in Dagit be less 'cluttered' with IO minutia by hiding it under a single JSON entry.

Finally, this might also enable (in the future) 'hybrid' IO managers which save the same asset in two ways, because you can separate each IO manager's args in different keys, even if the names of the children-args clash/overlap.

The only problem with the above solution is that I can't think of a "clean" solution that also allows you to pass allow_pyarrow_filer values because io_args is unpacked at a different place.

Suggestion: expand the docs

Hi there!

I've found this integration very useful and hope it'll get much more steam in the future. As such, it would be great to have expanded docs that ease new users into the following:

  • how to use metadata to select columns in the InputContext
  • how to use metadata to pass configuration options like compression type, etc., to the IO manager
  • how the IO managers interact with lazy frames. For example, the load_from_path method of the base IO manager does a collect() if the input is a LazyFrame, but given that any upstream asset would have to be materialized to write to disk anyways, how would one receive a lazy frame as an input in the first place? This bit is rather ambiguous to me.
    • The dump_df_to_path function also only works with pl.DataFrame, but there is presumably a use case for the user simply returning a LazyFrame as the output of an asset. Perhaps it might be helpful to have a collect() in the dump_df_to_path (or the use of sink_parquet) for those scenarios?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.