z3z1ma / cdf Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 1.0 846 KB

A framework to manage data, continuously

Home Page: https://z3z1ma.github.io/cdf/

License: Apache License 2.0

Python 100.00%

data framework pipelines transformation

cdf's Introduction

Hi there 👋

🔭 I’m currently working on managing data systems at scale to deliver maximum business value at work and open source work to maximize developer delight outside work!
🌱 I’m currently learning more Rust
👯 I’m looking to collaborate on anything dealing with analytical data systems
💬 Ask me about Software Delivery
📫 How to reach me: Linkedin
😄 Pronouns: He/Him
⚡ Fun fact: I love martial arts

cdf's People

Contributors

Stargazers

Watchers

Forkers

cr-lough

cdf's Issues

Ensure affinity check is bound to sink name

cdf/src/cdf/core/spec/publisher.py

Line 76 in 423ccbd

if context.config.default_gateway not in self.affinity:

Just a note to myself to double check this

Component names should be unique across workspaces when a project level sink exists

This is more a callout for myself because the implications in a global sink are somewhat more nuanced than those of workspace specific sinks.

In the presence of a global sink, suddenly enforcements necessarily become stricter. We could argue that this should always be enforced but I am not so sure.

Given:

alexb.hackernews && tobym.hackernews

in their respective workspaces with isolated sinks, we don't care that they share a name. IF the workspace-specific sinks are actually pointing to the same place, then it is user-error. Therefore it is safe to have the same component name since workspace qualification is enough to delineate components AND physical destinations.

The only logical way to share a single physical destination between workspaces would be to use a project level workspace.

Split out automatically aggregated metadata by sink name

Currently, cdf uses data-based configuration for sinks.

We are moving to python based configuration which provides the actual objects collecting them into a dataclass to logically group them. Config injection means that we are still leveraging data in our config file / env but with drastically increased flexibility as well as concrete code making sinks less ephemeral.

Our assumption right now is that a sink is labelled prod and that is exclusively used for automatic metadata generation.
This burdens us though with a disconnected development experience. We must deploy to prod before we can properly generate metadata. This is not ideal. We should be able to generate metadata without deploying to prod.

Therefore we can solve this by having our metadata folder structure be:
<workspace>/metadata/<sink_name>/*.yaml

In this case a dedicated development sink which writes to duckdb can generate metadata (gitignored by user?)
We can even promote the metadata eagerly if useful by copying auto-derived files over to the appropriate sink folder...

Now we need to consider, I suppose, a single destination should still be considered prod within a workspace? such that we can use it for generate-staging-layer? Or should we leave that up to the user? Perhaps up to the user is good here. In which case we do cdf generate-staging-layer <workspace>.<sink_name>. That is quite nice since cdf will not delete staging models during this process, only add. In which case we can eagerly add models for more holistic PRs and workflows.

The ONLY consideration is that we cannot do sqlmesh plans when prod does not yet have the data, even if we have done absolutely everything end-to-end in dev + staging.

So a disjunct flow may be unavoidable. PIpeline development + deployment to prod must precede Model development.
Unless we can dynamically trim the transformation subgraphs, which we technically could with our custom Loader by grabbing appropriate /metadata folder and pruning all models where depends_on is not found upstream.
There are a faire number of ways to "break" this but I think it might actually tackle a sufficient number of use cases to make it work putting behind a flag. Epic indeed 😄

Generate staging layer should be a GUI with realtime feedback as you transform the AST

With this, perhaps it can be generalized to any model.
So generate staging layer creates SQL files which just select all columns from underlying table
Subsequent command enters a GUI where functions can be applied and output visually inspected, GPT annotator output can be cached for duration of process unless re-requested since it applies to the root column

New update staging layer command should diff the metadata export against the staging model projections and...

...append new projections to automate ongoing maintenance

z3z1ma / cdf Goto Github PK

cdf's Introduction

Hi there 👋

cdf's People

Contributors

Stargazers

Watchers

Forkers

cdf's Issues

Ensure affinity check is bound to sink name

Component names should be unique across workspaces when a project level sink exists

Split out automatically aggregated metadata by sink name

Generate staging layer should be a GUI with realtime feedback as you transform the AST

New update staging layer command should diff the metadata export against the staging model projections and...

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent