Giter VIP home page Giter VIP logo

Comments (7)

dosubot avatar dosubot commented on June 12, 2024 3

🤖

It's great to see your enthusiasm for improving the functionality of Cosmos! Both of your implementation ideas sound promising and could definitely enhance the user experience for running individual model runs.

Implementing these ideas would certainly make it easier for users to work with Cosmos and would be a valuable addition to the platform. Your suggestions show a clear understanding of the current limitations and how to address them.

I encourage you to go ahead and create a pull request with your proposed changes. Your contributions will be greatly appreciated by the Cosmos community. Keep up the great work!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from astronomer-cosmos.

dwreeves avatar dwreeves commented on June 12, 2024 3

Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?

The pattern is basically this:

  • Create a normal Airflow DAG (not a cosmos.DbtDag) with a single dbt operator.
  • The DAG is parameterized for the --select and --full-refresh.
  • This operator should be able to run seeds, snapshots, models, tests, etc.

^ Those are the requirements of our little system for manually scheduling dbt nodes.

So how do you achieve this with the current Cosmos API? The answer right now is you need to subclass a dbt operator, and you'd still need to do this even after adding the template fields.

Alternate approaches and why they don't work:

  • You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.
  • You can use the dbt base operator and parametrize the base_cmd, but this means no --full-refresh.

So basically, the only way to meet the requirements of the system is to subclass. Template fields alone don't fulfill the requirements of the system.

I think the requirements of the system are reasonable, as per the notes in my original post. It will usually be the case that you are running model nodes, but not always; for example, sometimes a downstream system like a dashboard may be selecting directly from a seed and you need to update the seed mid-day. Or maybe, to save on time and compute, your company have a policy of only running seeds manually. I don't know, but there are various reasons to want to run both seeds and models using the same parametrized DAG.


I'm not necessarily saying the dbt operator should have --full-refresh. That is one option though. Another option is to have an operator for dbt build.

from astronomer-cosmos.

dwreeves avatar dwreeves commented on June 12, 2024 3

With 1.4.0, I think we have an acceptable solution that makes this easier for users, and this issue can probably be closed although I will keep it open for now, with one caveat.

The DbtBuildOperator with full_refresh and select as templated fields means that setting up a manual, parametrized DAG which runs arbitrary node types does not require any subclassing. (Users who also want to parametrize the command will still need to subclass, but that's fine.)

The rest of the work for set something like that up (mostly passing params={} and then having a task like DbtBuildOperator(select="{{ params.select }}") is very idiomatic within Airflow world, and in my opinion does not require further, potentially un-idiomatic or obtrusive abstraction + simplification.

Here's the caveat for maybe why this issue should stay open, or perhaps more appropriately this issue gets closed and we open a new issue: I do think that the documentation should document this pattern, since it is not clear to users (1) that they should even do it in the first place, and (2) how to do it, if they are new to Airflow and/or dbt. As I've mentioned elsewhere, there is not a great place to put something like this in the docs, which is part of the issue. Once this is documented, I would consider the issue fully complete.

from astronomer-cosmos.

dwreeves avatar dwreeves commented on June 12, 2024
  1. An implementation note-- This would require running multiple dbt nodes in a single operator. This is feasible, but not a typical pattern in the Cosmos context. This makes the threads arg in the profiles matter a little more than it currently does.

  2. Come to think of it, the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access --full-refresh via the base operator (unless that API decision were to change), meaning a single static operator type cannot handle for users every possible thing they would want to run-- except for the dbt build command, which doesn't have an associated operator.

from astronomer-cosmos.

joppevos avatar joppevos commented on June 12, 2024

Great description @dwreeves . I am really interested in this problem! For me the addition I wrote in #623 has been usable in production. However, I agree that we can improve it even further. I prefer the first solution as abstracting the parameters in their own DAG seems limited benefit to the end-user.

the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access --full-refresh via the base operator

Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?

from astronomer-cosmos.

joppevos avatar joppevos commented on June 12, 2024

What do you think of keeping the operations separated?
More like the example below: ( need to check how to get this working)

with DAG(
    ...,
    params={
        "full_refresh": Param(default=True, ...),
        "select": Param(default="+my_models")
    },
) as dag:
    seed = DbtSeedLocalOperator(
        task_id="seed",
        project_dir=CURRENT_DIR,
        profile_config=profile_config,
        full_refresh="{{ params.full_refresh }}",
        select="{{ params.select }}"
    )

    tg = DbtTaskGroup(
        task_id="dbt_task",
        full_refresh="{{ params.full_refresh }}",
        select="{{ params.select }}"
    )
    seed > tg

You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.

This could run whatever operators the person has within the DAG. Then the user does not need to provide a parametrized base_cmd

A user gives the two requested parameters in Airflow Console. The DAG will still be rendered with all nodes, but would only run the selected nodes. If the select contains a seed, then it will run the seed. Otherwise dbt will skip over it. In my case, I only experienced users wanting to run dbt run from the console, mostly to backfill.

Do you feel a strong case for having other dbt commands being trigger from console?

from astronomer-cosmos.

dwreeves avatar dwreeves commented on June 12, 2024

Sorry, just getting back to this since I am looking to just implement the templating of these fields.

@joppevos A note-- the issue with that code example is that you cannot parametrize the task group. Although Airflow supports dynamically generated tasks, such tasks cannot have inter-dependencies. You could create code that works to dynamically generate all nodes selected via --select mymodel+, but the nodes would not have proper dependencies among each other.

You can have both a seed operator and model run operator and pass to both the same {{ params.select }} and that does support; alternatively you can just have two separate DAGs. Or, just do what I am currently doing, and subclass for dbt build. Those are all valid options.

I do feel that the most sensible way to have a manual run that triggers on multiple nodes, and multiple node types in a one-shot fashion that doesn't require subclassing is to have an operator that supports dbt build. I also do feel this is a good pattern, and I would advocate for it. But I also understand that this is a little more on the niche side, since an operator for dbt build would not be used anywhere other than for this specific use case.


I'm going to split template fields and either dbt build or a "DbtParametrizedDag" / "DbtBackfillDag" into separate PRs. I think adding template fields should be uncontroversial. I would like maintainers to give their $0.02 on the other thing regarding manual run patterns, as needing to manually run comes up a lot in practical settings, and needing to manually run seeds or snapshots specifically is rarer but still does happen.

from astronomer-cosmos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.