data-as-code / dac Goto Github PK

View Code? Open in Web Editor NEW

6.0 1.0 0.0 834 KB

Python Data as Code core implementation

License: MIT License

Python 100.00%

data-contract data-quality data-science data-version-control data-versioning hacktoberfest python

dac's People

Stargazers

Watchers

dac's Issues

Open to the possibility of passing optional arguments to `load()`

As a dac package user
I would like to have the possibility to pass optional arguments to the load() function
so that, for example, I can specify what kind of DataFrame engine I want to use

Expose load and schema modules

Currently when dac create a package from load.py and schema.py, it exposes the load function and the Schema class, but any other object defined in those two modules are hidden under DAC_PKG_NAME._load and DAC_PKG_NAME._schema.

In many scenarios it is useful to have access to objects defined in those two modules. Therefore I would be better to make the accessible as DAC_PKG_NAME.load and DAC_PKG_NAME.schema

Make dac independent from pydantic

As dac needs to be installed along with all packages, it is crucial to keep the dependency list as short as possible.

Because of this, it would be very helpful to delete the unnecessary dependency on pydantic.

`load.py` and `schema.py` examples from cli

As a dac package producer
I would like to see example implementations of load.py and schema.py
so that I can understand what I am expected to implement

Open question: is this sufficient? Perhaps some documentation is also needed?

`dac next-v`

In most scenario, a new data release corresponds to a minor version update.
It would be useful to have a way to achieve this with a dac command.

As of now I can think of a new subcommand that can be called like this:

dac get-next-minor-v --pkg-name PKG_NAME --major-v MAJOR

and that returns the next full minor version (e.g. 1.4.0).

Here a sample script:

import re
import subprocess

import typer
from rich import print

app = typer.Typer()


@app.command()
def main(
    library_name: str = typer.Option(..., envvar="DAC_PKG_NAME"),
    major: int = typer.Option(..., envvar="DAC_PKG_MAJOR"),
) -> str:
    """
    Determine what should be the next version of a library, assuming a minor version increase.
    """
    latest_version = determine_latest_version(library_name, major)
    next_version = increase_version_minor(latest_version, str(major))
    print(next_version)
    return next_version


def determine_latest_version(library_name: str, major: int) -> str:
    try:
        output = subprocess.check_output(
            ["pip", "install", f"{library_name}=={major}.*", "--dry-run"],
            stderr=subprocess.DEVNULL,
        )
        last_line = output.decode("utf-8").splitlines()[-1]
        regex_rule = f"{library_name.replace('_', '-')}-{major}\.[^ ]+"
        match = re.search(regex_rule, last_line)
        assert match is not None
        return match[0][len(f"{library_name}-") :]
    except Exception:
        return "None"


def increase_version_minor(version: str, major: str = "0") -> str:
    if version == "None":
        return f"{major}.0.0"
    major, minor, _ = version.split(".")
    return f"{major}.{int(minor) + 1}.0"


if __name__ == "__main__":
    app()

`Schema.example()` is verified during `dac pack`

As a dac package user
I would like to have the Schema.example() functionality guaranteed to work
so that I can trust it

Support `load.py` and `schema.py` as templates with values injected during `dac pack`

As a dac package producer
I would like to have the possibility to insert parameters in the load.py and schema.py files that will be filled when running dac pack
so that I can re-use the same templates

Example

import dask.dataframe as dd


def load() -> dd.DataFrame:
    return dd.read_parquet(path="az://STORAGE_CONTAINER_NAME/DATA_FOLDER/*.parquet", 
                           storage_options={"account_name": "STORAGE_ACCOUNT_NAME", "anon": False})

Here, STORAGE_ACCOUNT_NAME, STORAGE_CONTAINER_NAME, and DATA_FOLDER could be parametrized to allow re-usability

Abstract away from pandera

Currently the implementation relies on pandera. This introduces some limitations:

pandera does not support all dataframe engines (e.g. polars and vaex are currently not supported)
pandera example support is limited in the case of ad-hoc constraints, and there is no easy/elegant way to get around it (see this discussion)

Because of this, I would like to make the dac core implementation independent from pandera, and rather provide a plugin option to accept a pandera schema and make it fit the alternative requirements.

The problem is: how should the independent implementation look like?
For sure we want the following functionalities:

validation (types + custom constraints)
column names
example generation

Currently I am thinking to stick to a Schema class that looks like this

class Schema:
    col1 = "column_1_name"
    col2 = ...

    @classmethod
    def validate(cls) -> None:
        pass

    @classmethod
    def example(cls) -> object:
        pass

data-as-code / dac Goto Github PK

dac's People

Stargazers

Watchers

dac's Issues

Open to the possibility of passing optional arguments to `load()`

Expose load and schema modules

Make dac independent from pydantic

`load.py` and `schema.py` examples from cli

`dac next-v`

`Schema.example()` is verified during `dac pack`

Support `load.py` and `schema.py` as templates with values injected during `dac pack`

Example

Abstract away from pandera

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent