data-as-code / dac Goto Github PK
View Code? Open in Web Editor NEWPython Data as Code core implementation
License: MIT License
Python Data as Code core implementation
License: MIT License
As a dac package user
I would like to have the possibility to pass optional arguments to the load()
function
so that, for example, I can specify what kind of DataFrame engine I want to use
Currently when dac
create a package from load.py
and schema.py
, it exposes the load
function and the Schema
class, but any other object defined in those two modules are hidden under DAC_PKG_NAME._load
and DAC_PKG_NAME._schema
.
In many scenarios it is useful to have access to objects defined in those two modules. Therefore I would be better to make the accessible as DAC_PKG_NAME.load
and DAC_PKG_NAME.schema
As dac needs to be installed along with all packages, it is crucial to keep the dependency list as short as possible.
Because of this, it would be very helpful to delete the unnecessary dependency on pydantic.
As a dac package producer
I would like to see example implementations of load.py
and schema.py
so that I can understand what I am expected to implement
Open question: is this sufficient? Perhaps some documentation is also needed?
In most scenario, a new data release corresponds to a minor version update.
It would be useful to have a way to achieve this with a dac
command.
As of now I can think of a new subcommand that can be called like this:
dac get-next-minor-v --pkg-name PKG_NAME --major-v MAJOR
and that returns the next full minor version (e.g. 1.4.0).
Here a sample script:
import re
import subprocess
import typer
from rich import print
app = typer.Typer()
@app.command()
def main(
library_name: str = typer.Option(..., envvar="DAC_PKG_NAME"),
major: int = typer.Option(..., envvar="DAC_PKG_MAJOR"),
) -> str:
"""
Determine what should be the next version of a library, assuming a minor version increase.
"""
latest_version = determine_latest_version(library_name, major)
next_version = increase_version_minor(latest_version, str(major))
print(next_version)
return next_version
def determine_latest_version(library_name: str, major: int) -> str:
try:
output = subprocess.check_output(
["pip", "install", f"{library_name}=={major}.*", "--dry-run"],
stderr=subprocess.DEVNULL,
)
last_line = output.decode("utf-8").splitlines()[-1]
regex_rule = f"{library_name.replace('_', '-')}-{major}\.[^ ]+"
match = re.search(regex_rule, last_line)
assert match is not None
return match[0][len(f"{library_name}-") :]
except Exception:
return "None"
def increase_version_minor(version: str, major: str = "0") -> str:
if version == "None":
return f"{major}.0.0"
major, minor, _ = version.split(".")
return f"{major}.{int(minor) + 1}.0"
if __name__ == "__main__":
app()
As a dac package user
I would like to have the Schema.example()
functionality guaranteed to work
so that I can trust it
As a dac package producer
I would like to have the possibility to insert parameters in the load.py
and schema.py
files that will be filled when running dac pack
so that I can re-use the same templates
import dask.dataframe as dd
def load() -> dd.DataFrame:
return dd.read_parquet(path="az://STORAGE_CONTAINER_NAME/DATA_FOLDER/*.parquet",
storage_options={"account_name": "STORAGE_ACCOUNT_NAME", "anon": False})
Here, STORAGE_ACCOUNT_NAME
, STORAGE_CONTAINER_NAME
, and DATA_FOLDER
could be parametrized to allow re-usability
Currently the implementation relies on pandera. This introduces some limitations:
Because of this, I would like to make the dac core implementation independent from pandera, and rather provide a plugin option to accept a pandera schema and make it fit the alternative requirements.
The problem is: how should the independent implementation look like?
For sure we want the following functionalities:
Currently I am thinking to stick to a Schema
class that looks like this
class Schema:
col1 = "column_1_name"
col2 = ...
@classmethod
def validate(cls) -> None:
pass
@classmethod
def example(cls) -> object:
pass
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.