Giter VIP home page Giter VIP logo

dac's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

dac's Issues

Expose load and schema modules

Currently when dac create a package from load.py and schema.py, it exposes the load function and the Schema class, but any other object defined in those two modules are hidden under DAC_PKG_NAME._load and DAC_PKG_NAME._schema.

In many scenarios it is useful to have access to objects defined in those two modules. Therefore I would be better to make the accessible as DAC_PKG_NAME.load and DAC_PKG_NAME.schema

Make dac independent from pydantic

As dac needs to be installed along with all packages, it is crucial to keep the dependency list as short as possible.

Because of this, it would be very helpful to delete the unnecessary dependency on pydantic.

`load.py` and `schema.py` examples from cli

As a dac package producer
I would like to see example implementations of load.py and schema.py
so that I can understand what I am expected to implement

Open question: is this sufficient? Perhaps some documentation is also needed?

`dac next-v`

In most scenario, a new data release corresponds to a minor version update.
It would be useful to have a way to achieve this with a dac command.

As of now I can think of a new subcommand that can be called like this:

dac get-next-minor-v --pkg-name PKG_NAME --major-v MAJOR

and that returns the next full minor version (e.g. 1.4.0).

Here a sample script:

import re
import subprocess

import typer
from rich import print

app = typer.Typer()


@app.command()
def main(
    library_name: str = typer.Option(..., envvar="DAC_PKG_NAME"),
    major: int = typer.Option(..., envvar="DAC_PKG_MAJOR"),
) -> str:
    """
    Determine what should be the next version of a library, assuming a minor version increase.
    """
    latest_version = determine_latest_version(library_name, major)
    next_version = increase_version_minor(latest_version, str(major))
    print(next_version)
    return next_version


def determine_latest_version(library_name: str, major: int) -> str:
    try:
        output = subprocess.check_output(
            ["pip", "install", f"{library_name}=={major}.*", "--dry-run"],
            stderr=subprocess.DEVNULL,
        )
        last_line = output.decode("utf-8").splitlines()[-1]
        regex_rule = f"{library_name.replace('_', '-')}-{major}\.[^ ]+"
        match = re.search(regex_rule, last_line)
        assert match is not None
        return match[0][len(f"{library_name}-") :]
    except Exception:
        return "None"


def increase_version_minor(version: str, major: str = "0") -> str:
    if version == "None":
        return f"{major}.0.0"
    major, minor, _ = version.split(".")
    return f"{major}.{int(minor) + 1}.0"


if __name__ == "__main__":
    app()

Support `load.py` and `schema.py` as templates with values injected during `dac pack`

As a dac package producer
I would like to have the possibility to insert parameters in the load.py and schema.py files that will be filled when running dac pack
so that I can re-use the same templates

Example

import dask.dataframe as dd


def load() -> dd.DataFrame:
    return dd.read_parquet(path="az://STORAGE_CONTAINER_NAME/DATA_FOLDER/*.parquet", 
                           storage_options={"account_name": "STORAGE_ACCOUNT_NAME", "anon": False})

Here, STORAGE_ACCOUNT_NAME, STORAGE_CONTAINER_NAME, and DATA_FOLDER could be parametrized to allow re-usability

Abstract away from pandera

Currently the implementation relies on pandera. This introduces some limitations:

  1. pandera does not support all dataframe engines (e.g. polars and vaex are currently not supported)
  2. pandera example support is limited in the case of ad-hoc constraints, and there is no easy/elegant way to get around it (see this discussion)

Because of this, I would like to make the dac core implementation independent from pandera, and rather provide a plugin option to accept a pandera schema and make it fit the alternative requirements.

The problem is: how should the independent implementation look like?
For sure we want the following functionalities:

  • validation (types + custom constraints)
  • column names
  • example generation

Currently I am thinking to stick to a Schema class that looks like this

class Schema:
    col1 = "column_1_name"
    col2 = ...

    @classmethod
    def validate(cls) -> None:
        pass

    @classmethod
    def example(cls) -> object:
        pass

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.