Giter VIP home page Giter VIP logo

cdf's Introduction

Hi there ๐Ÿ‘‹

  • ๐Ÿ”ญ Iโ€™m currently working on managing data systems at scale to deliver maximum business value at work and open source work to maximize developer delight outside work!
  • ๐ŸŒฑ Iโ€™m currently learning more Rust
  • ๐Ÿ‘ฏ Iโ€™m looking to collaborate on anything dealing with analytical data systems
  • ๐Ÿ’ฌ Ask me about Software Delivery
  • ๐Ÿ“ซ How to reach me: Linkedin
  • ๐Ÿ˜„ Pronouns: He/Him
  • โšก Fun fact: I love martial arts

cdf's People

Contributors

z3z1ma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

cr-lough

cdf's Issues

Component names should be unique across workspaces when a project level sink exists

This is more a callout for myself because the implications in a global sink are somewhat more nuanced than those of workspace specific sinks.

In the presence of a global sink, suddenly enforcements necessarily become stricter. We could argue that this should always be enforced but I am not so sure.

Given:

alexb.hackernews && tobym.hackernews

in their respective workspaces with isolated sinks, we don't care that they share a name. IF the workspace-specific sinks are actually pointing to the same place, then it is user-error. Therefore it is safe to have the same component name since workspace qualification is enough to delineate components AND physical destinations.

The only logical way to share a single physical destination between workspaces would be to use a project level workspace.

Split out automatically aggregated metadata by sink name

Currently, cdf uses data-based configuration for sinks.

We are moving to python based configuration which provides the actual objects collecting them into a dataclass to logically group them. Config injection means that we are still leveraging data in our config file / env but with drastically increased flexibility as well as concrete code making sinks less ephemeral.

Our assumption right now is that a sink is labelled prod and that is exclusively used for automatic metadata generation.
This burdens us though with a disconnected development experience. We must deploy to prod before we can properly generate metadata. This is not ideal. We should be able to generate metadata without deploying to prod.

Therefore we can solve this by having our metadata folder structure be:
<workspace>/metadata/<sink_name>/*.yaml

In this case a dedicated development sink which writes to duckdb can generate metadata (gitignored by user?)
We can even promote the metadata eagerly if useful by copying auto-derived files over to the appropriate sink folder...

Now we need to consider, I suppose, a single destination should still be considered prod within a workspace? such that we can use it for generate-staging-layer? Or should we leave that up to the user? Perhaps up to the user is good here. In which case we do cdf generate-staging-layer <workspace>.<sink_name>. That is quite nice since cdf will not delete staging models during this process, only add. In which case we can eagerly add models for more holistic PRs and workflows.

The ONLY consideration is that we cannot do sqlmesh plans when prod does not yet have the data, even if we have done absolutely everything end-to-end in dev + staging.

So a disjunct flow may be unavoidable. PIpeline development + deployment to prod must precede Model development.
Unless we can dynamically trim the transformation subgraphs, which we technically could with our custom Loader by grabbing appropriate /metadata folder and pruning all models where depends_on is not found upstream.
There are a faire number of ways to "break" this but I think it might actually tackle a sufficient number of use cases to make it work putting behind a flag. Epic indeed ๐Ÿ˜„

Generate staging layer should be a GUI with realtime feedback as you transform the AST

With this, perhaps it can be generalized to any model.
So generate staging layer creates SQL files which just select all columns from underlying table
Subsequent command enters a GUI where functions can be applied and output visually inspected, GPT annotator output can be cached for duration of process unless re-requested since it applies to the root column

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.