Comments (7)
🤖
It's great to see your enthusiasm for improving the functionality of Cosmos! Both of your implementation ideas sound promising and could definitely enhance the user experience for running individual model runs.
Implementing these ideas would certainly make it easier for users to work with Cosmos and would be a valuable addition to the platform. Your suggestions show a clear understanding of the current limitations and how to address them.
I encourage you to go ahead and create a pull request with your proposed changes. Your contributions will be greatly appreciated by the Cosmos community. Keep up the great work!
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from astronomer-cosmos.
Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?
The pattern is basically this:
- Create a normal Airflow DAG (not a
cosmos.DbtDag
) with a single dbt operator. - The DAG is parameterized for the
--select
and--full-refresh
. - This operator should be able to run seeds, snapshots, models, tests, etc.
^ Those are the requirements of our little system for manually scheduling dbt nodes.
So how do you achieve this with the current Cosmos API? The answer right now is you need to subclass a dbt operator, and you'd still need to do this even after adding the template fields.
Alternate approaches and why they don't work:
- You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.
- You can use the dbt base operator and parametrize the
base_cmd
, but this means no--full-refresh
.
So basically, the only way to meet the requirements of the system is to subclass. Template fields alone don't fulfill the requirements of the system.
I think the requirements of the system are reasonable, as per the notes in my original post. It will usually be the case that you are running model nodes, but not always; for example, sometimes a downstream system like a dashboard may be selecting directly from a seed and you need to update the seed mid-day. Or maybe, to save on time and compute, your company have a policy of only running seeds manually. I don't know, but there are various reasons to want to run both seeds and models using the same parametrized DAG.
I'm not necessarily saying the dbt operator should have --full-refresh
. That is one option though. Another option is to have an operator for dbt build
.
from astronomer-cosmos.
With 1.4.0, I think we have an acceptable solution that makes this easier for users, and this issue can probably be closed although I will keep it open for now, with one caveat.
The DbtBuildOperator
with full_refresh
and select
as templated fields means that setting up a manual, parametrized DAG which runs arbitrary node types does not require any subclassing. (Users who also want to parametrize the command will still need to subclass, but that's fine.)
The rest of the work for set something like that up (mostly passing params={}
and then having a task like DbtBuildOperator(select="{{ params.select }}")
is very idiomatic within Airflow world, and in my opinion does not require further, potentially un-idiomatic or obtrusive abstraction + simplification.
Here's the caveat for maybe why this issue should stay open, or perhaps more appropriately this issue gets closed and we open a new issue: I do think that the documentation should document this pattern, since it is not clear to users (1) that they should even do it in the first place, and (2) how to do it, if they are new to Airflow and/or dbt. As I've mentioned elsewhere, there is not a great place to put something like this in the docs, which is part of the issue. Once this is documented, I would consider the issue fully complete.
from astronomer-cosmos.
-
An implementation note-- This would require running multiple dbt nodes in a single operator. This is feasible, but not a typical pattern in the Cosmos context. This makes the
threads
arg in the profiles matter a little more than it currently does. -
Come to think of it, the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access
--full-refresh
via the base operator (unless that API decision were to change), meaning a single static operator type cannot handle for users every possible thing they would want to run-- except for thedbt build
command, which doesn't have an associated operator.
from astronomer-cosmos.
Great description @dwreeves . I am really interested in this problem! For me the addition I wrote in #623 has been usable in production. However, I agree that we can improve it even further. I prefer the first solution as abstracting the parameters in their own DAG seems limited benefit to the end-user.
the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access --full-refresh via the base operator
Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?
from astronomer-cosmos.
What do you think of keeping the operations separated?
More like the example below: ( need to check how to get this working)
with DAG(
...,
params={
"full_refresh": Param(default=True, ...),
"select": Param(default="+my_models")
},
) as dag:
seed = DbtSeedLocalOperator(
task_id="seed",
project_dir=CURRENT_DIR,
profile_config=profile_config,
full_refresh="{{ params.full_refresh }}",
select="{{ params.select }}"
)
tg = DbtTaskGroup(
task_id="dbt_task",
full_refresh="{{ params.full_refresh }}",
select="{{ params.select }}"
)
seed > tg
You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.
This could run whatever operators the person has within the DAG. Then the user does not need to provide a parametrized base_cmd
A user gives the two requested parameters in Airflow Console. The DAG will still be rendered with all nodes, but would only run the selected nodes. If the select contains a seed, then it will run the seed. Otherwise dbt will skip over it. In my case, I only experienced users wanting to run dbt run
from the console, mostly to backfill.
Do you feel a strong case for having other dbt commands being trigger from console?
from astronomer-cosmos.
Sorry, just getting back to this since I am looking to just implement the templating of these fields.
@joppevos A note-- the issue with that code example is that you cannot parametrize the task group. Although Airflow supports dynamically generated tasks, such tasks cannot have inter-dependencies. You could create code that works to dynamically generate all nodes selected via --select mymodel+
, but the nodes would not have proper dependencies among each other.
You can have both a seed operator and model run operator and pass to both the same {{ params.select }}
and that does support; alternatively you can just have two separate DAGs. Or, just do what I am currently doing, and subclass for dbt build
. Those are all valid options.
I do feel that the most sensible way to have a manual run that triggers on multiple nodes, and multiple node types in a one-shot fashion that doesn't require subclassing is to have an operator that supports dbt build
. I also do feel this is a good pattern, and I would advocate for it. But I also understand that this is a little more on the niche side, since an operator for dbt build
would not be used anywhere other than for this specific use case.
I'm going to split template fields and either dbt build
or a "DbtParametrizedDag" / "DbtBackfillDag" into separate PRs. I think adding template fields should be uncontroversial. I would like maintainers to give their $0.02 on the other thing regarding manual run patterns, as needing to manually run comes up a lot in practical settings, and needing to manually run seeds or snapshots specifically is rarer but still does happen.
from astronomer-cosmos.
Related Issues (20)
- Decouple `LoadMode.AUTOMATIC` from `load()` method in `DbtGraph`
- Cosmos tasks randomly getting marked as zombie tasks HOT 3
- Fix hard to read AirflowException error logs from dbt command HOT 2
- bug please help me this is my project (ubuntu) HOT 6
- Permission issue with Cosmos cache in some restricted environments HOT 2
- [Bug]: Trial Issue using Bug report template HOT 1
- This a new feature request using Feature request Issue template HOT 1
- Link to dbt docs from task HOT 1
- [Cosmos CI] 2.7 unit tests taking a long time to run
- Remove dependency on Pydantic
- Improve how we authorise non-commiters to run Cosmos integration tests
- Review dbt in Airflow Survey results
- Support docs generated with `dbt docs generate --static`
- Test task is generated though it is not defined HOT 2
- Reduce task queueing latency when using Cosmos HOT 5
- Emit Airflow metrics to support analysing Cosmos performance HOT 2
- `LoadMode.AUTOMATIC` does not seem to work as documented HOT 4
- Using EMR transient cluster with Cosmos HOT 1
- AthenaAccessKeyProfileMapping does not work as expected locally HOT 3
- [Bug] cosmos 1.4.1 does not create virtualenv when using `ExecutionMode.VIRTUALENV` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from astronomer-cosmos.