brooklyn-data / dbt_artifacts Goto Github PK

View Code? Open in Web Editor NEW

301.0 25.0 112.0 2.76 MB

A dbt package for modelling dbt metadata. https://brooklyn-data.github.io/dbt_artifacts

License: Apache License 2.0

Shell 100.00%

dbt

dbt_artifacts's People

Contributors

Stargazers

Watchers

Forkers

dbt-labs ktshannon mackweldon kon-mal gaurav-atheon djbelknapdbs sdggroup boxysean jsnb-devoted anilktechie gary-beautypie immuta groveco nbandre timothyjang123 altons incrediblehealth niallrees1 alanmcruickshank mrevelo-clickup pospischil fivetran-joemarkiewicz barak1986 lukass29 cpierson gittoby devin-hc notonthehighstreet asanko1 swdotcom charles-astrafy gthesheep yvathrey asteichen ydai713 99designs viniciusnunest patkearns10 ma1ster kurkim0661 smitsrr foxmulder42 borjavb lumiqai chenxuanrong iribarnesy analyticsworld1 ahmedrad addie-cloudnative nhrebinka ataft mharrisb1 mhmtsrmn robertholmes3 kramstrom visma-data-analytics-dbt datacoves willcsw jgilfillan tannerhopkins tusharkalecam ljones9119 eon-collective davidhaye lisandrococca robomill bendiktv2 ejoranlienea jens-koster zsombor-flds svenrudolph snowboard-software rmt-ivan pierrevaneck-procore d-cole samw430 adamantike earnestresearch sptkl farmersdog calkan-cue poloaraujo mirnawong1 maxfirman dpguthrie sfc-gh-kumaurya featcrush al2a2d2m alexrogalskiy byrontuckett kenpokiwi il-dat antonysouthworth-halter brendan-cook-87 matt-lam-thinkific thinkific rtkfan cg-kevin-lu eminuzun whiteh

dbt_artifacts's Issues

I am not able to use Custom Schema for dbt-artifacts package

I am not able to configure a custom schema. Can somebody please help with this?

I am getting still the Public schema used when I run dbt run -m dbt_artifacts.

I have set dbt_artifacts schema name and materialization but still, I am getting models been built under Public Schema.
My Configuration below in my root dbt project (dbt_project.yml)

# DBT_ARTIFACTS ----------------------------------------
    dbt_artifacts:
      +schema: dbt_artifacts
      +materialized: table
      incremental:
        +schema: dbt_artifacts
        +materialized: incremental
      staging:
        +schema: dbt_artifacts
        +materliazed: view

Improve Performance of Source/Model/Exposure extraction

With ~1 year of historical manifest.json and run_results.json data, we have started experiencing timeouts running --full-refresh of dbt_artifacts.


1832 | 2021-04-15 15:43:49: 2021-04-15 15:43:49,243 - root - INFO - Database Error in model dim_dbt__sources (models/incremental/dim_dbt__sources.sql)
-- | --
2021-04-15 15:43:49: 2021-04-15 15:43:49,243 - root - INFO - 000630 (57014): Statement reached its statement or warehouse timeout of 1,200 second(s) and was canceled.

Possible solutions:

Archive old data.
Replace the 3 individual models that extract Sources, Models and (soon) Exposures into 1 wide table, so json extraction happens once per artefact, rather than at least 3 times.
Snapshot the extracted json(?)
Unpack all required fields in the COPY command executed by either the run-operation or by Snowpipe. This is probably the most performant but least flexible. What happens when we want to extract more details from previous runs? Gets a bit difficult 😬 It does however solve the case where JSON objects become too big to fit into a single VARIANT (16mb).
Do something with the orchestration layer to avoid ever running full-refresh on these models - effectively relying on incremental models to only extract JSON values once per artefact. We'd probably use tags.

0.7.0 version - Unique test failures on dim and fct models

we are seeing below test failures on dbt_artifacts models.

Failure in test unique_dim_dbt__current_models_manifest_model_id (models/schemas.yml)
Failure in test unique_dim_dbt__models_manifest_model_id (models/schemas.yml)
Failure in test unique_dim_dbt__sources_manifest_source_id (models/schemas.yml)
Failure in test unique_fct_dbt__critical_path_node_id (models/schemas.yml)
Failure in test unique_fct_dbt__latest_full_model_executions_node_id (models/schemas.yml)
Failure in test unique_fct_dbt__model_executions_model_execution_id (models/schemas.yml)

on investigation, we found that the below inner join fix for issue 75 is doing a 1 to many join and causing entire row duplicates.
"-- #75 inner join run_results on dbt_sources.artifact_run_id = run_results.artifact_run_id ) "

This is because we upload the run results after every dbt run where we generate multiple run results and single manifest. If we can change the join to where exists this duplication will be resolved as the idea is to only enforce and not pull any other fields from inne r join view.

We fixed it for time being by using 0.6.0 version.

Note: We are running this on a single tenancy dbt cloud

Support tap-dbt-artifacts

I think a really interesting use case for Meltano users would be to pair dbt_artifacts with https://github.com/prratek/tap-dbt-artifacts

Just from eyeballing the staging models in dbt_artifacts and comparing them with the the schemas generated by tap-dbt-artifacts, it looks like it should be pretty easy to get these two talking to each other; I can imagine a few potential approaches but would might be to create staging tables that shape the tap-dbt-artifacts generated tables into the form expected by the sources defined at https://github.com/brooklyn-data/dbt_artifacts/blob/main/models/staging/sources.yml

I'm interested in taking a stab at this but would be curious for feedback on best approach from the perspective of this package's maintainers.

Make failures visible in fct_dbttestexecutions

Right now none of the models gives the information about the number of failures for a given test. If we write a test that gives a number of faulty records, then having this information in model will open the possibility to present it on some kind of dashboard

in stg_dbt__node_executions.sql:
`surrogate_key as (

select
    {{ dbt_utils.surrogate_key(['fields.command_invocation_id', 'fields.node_id']) }} as node_execution_id,
    fields.command_invocation_id,
    fields.dbt_cloud_run_id,
    fields.artifact_run_id,
    fields.artifact_generated_at,
    fields.was_full_refresh,
    fields.node_id,
    base_nodes.resource_type,
    split(fields.result_json:thread_id::string, '-')[1]::integer as thread_id,
    fields.status,
    fields.result_json:message::string as message,
    fields.compile_started_at,
    fields.query_completed_at,
    fields.total_node_runtime,
    fields.result_json:adapter_response:rows_affected::int as rows_affected,
    fields.result_json:failures::int as failures,
    fields.result_json
from fields
-- Inner join so that we only represent results for nodes which definitely have a manifest
-- and visa versa.
inner join base_nodes on (
    fields.artifact_run_id = base_nodes.artifact_run_id
    and fields.node_id = base_nodes.node_id)

)in fct_dbt__test_executions.sql:fields as (

select
    node_execution_id as test_execution_id,
    command_invocation_id,
    dbt_cloud_run_id,
    artifact_run_id,
    artifact_generated_at,
    was_full_refresh,
    node_id,
    thread_id,
    status,
    compile_started_at,
    query_completed_at,
    total_node_runtime,
    rows_affected,
    failures
from test_executions_incremental

If you could add this, that would be awesome :)

Clarify purpose of fct_dbt__latest_full_model_executions

This is a question raised by #75. Incidentally, my suggestion here would be an alternate solution, which would fix the bug in that issue.

The documented intent of the fct_dbt__latest_full_model_executions model is:

A list of all models and executions times from the most recent, incremental run.

Our setup as a segmented approach to refreshing different marts, and so the most recent run overall, will almost never contain all models. I'd like to raise the possibility of changing how this model works - or at least clarifying it's purpose. In my mind there are two options:

[and this is closest to the current implementation] This model has the runtimes of all the models which ran in the most recent incremental run.
[and this would be what I think might be more useful] This model has the runtimes of all the models in the project, based on the most recent run in which they existed.

The second interpretation would be much more useful in our context, but would be a change in the intent of what's going on. For any users who refresh the whole project monolithically - the main difference would be that any models which had been removed from the project would still appear in this model, just with a much older "last run" timestamp.

@NiallRees - do you think this change makes sense, or should we make a new model for this purpose?

Proposal: Add exposure, source, and model metadata to `dim_` tables

I would like to be able to leverage the package to present data source information in downstream tools, including object descriptions, tags, and metadata.

Proposal would be to add the following columns to dim_dbt__sources, dim_dbt__models and dim_dbt__exposures:

meta (variant type)
description (text)
tags (array type)

Has this been considered before? I may have some time to contribute this if it's desired.

dim_dbt__exposures.manifest_model_id is not unique

The unique test for the field dim_dbt__exposures.manifest_model_id is failing when depends_on_nodes has more than one node. It seems like for this model, the depends_on shouldn't get expanded, and there should be a separate model to cross-reference exposures and their depends-on models?

Alternately, the test should get switched to a dbt_utils.unique_combination_of_columns with manifest_model_id and output_feeds.

I'm glad to contribute to this if you let me know which way you want it to go - split out output_feeds or keep it in and switch the test.

Proposal: Give seed and snapshot executions their own `fct_` models

It appears that nodes where resource_type = 'seed' and resource_type = 'snapshot' are excluded from this package, although these are important operations to track. I would propose adding these as additional dim_ and fct_ tables to complement the existing test_executions and model_executions models.

The reason these are currently being exclude is due to this line in stg_dbt__model_executions:

dbt_run as (

    select *
    from run_results
    where data:args:which = 'run'

),

Filtering on the run argument will filter out dbt snapshot and dbt seed calls.

Overcome 16MB variant limit in Snowflake

Some users with very large projects have encountered issues in the https://github.com/brooklyn-data/dbt_artifacts/blob/main/macros/upload_artifacts.sql#L25 copy into step, where the manifest.json is too large to load into a variant column. We currently use a single table for all artifact types, and each artifact becomes a single row.

We can solve this issue by flattening the data before loading into a Snowflake table. The schema for each artifact differs at the first level of keys however, and so we will need a separate table for each artifact.

Transformations are heavily restricted in the existing copy into command (Snowflake docs). If we instead use an insert into command to load from stage to table, we can run any select statement needed without any transformation restrictions. For example, an illustrative command similar to https://docs.snowflake.com/en/user-guide/data-load-transform.html#flatten-semi-structured-data:

insert into {{ src_dbt_artifacts_manifest }}(node, generated_at, path, artifact_type)
    select
        nodes.name as name,
        nodes.unique_id as unique_id,
        nodes.package_name as package_name,
        nodes.path as path,
        nodes.value as value,
        artifacts.metadata:generated_at::timestamp_ntz as generated_at,
        metadata$filename as stage_filename,
        regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type
    from  @{{ src_dbt_artifacts }} as artifacts,
        lateral flatten(input => artifacts.nodes) as nodes
    where regexp_substr(metadata$filename, '([a-z_]+.json)') = 'manifest.json'

Different ids on upload? V2

I ran dbt build followed by dbt run-operation upload_dbt_artifacts_v2 - four times on empty tables

the results from this query:

select distinct
      artifact_run_id, command_invocation_id, 'dbt_run_results_nodes' as _source 
from dbt_artifacts.dbt_run_results_nodes

union all select distinct
    artifact_run_id, command_invocation_id, 'dbt_run_results' as _source
from dbt_artifacts.dbt_run_results

union all select distinct
    artifact_run_id, command_invocation_id, 'dbt_manifest_nodes' as _source
from dbt_artifacts.dbt_manifest_nodes

order by  _source, command_invocation_id

show that the ids from the manifest do not match those with the run_result nodes.

The result is empty tables after running the models from dbt_artifacts

Implement and document usage for dbt Cloud

A few questions have come though on the dbt slack workspace about deployment on dbt Cloud relating to artefact uploads. It would be useful to have a definitive method for deployment on dbt Cloud, in addition to the manual upload and CLI run-operation options.

As I understand it, the key issue revolves around the invocation_id generated by dbt and stored in run-results.json and manifest.json. This is the same for a given command, however run-operation (used to call the macro to upload artefacts) creates its own manifest.json replacing the one from any previous operation 🤦‍♂️ This results in dissimilar invocation_ids in run-results.json and manifest.json, breaking functionality that relies on joining the two artefacts.

For those calling run-operation via the CLI, the --no-write-json global flag is reported to solve this particular issue. It would also be possible to configure an env_var prefixed with DBT_ENV_CUSTOM_ENV_ that is injected into each artefact as per the docs. These vars would appear in stg_dbt__run_results_env_keys and could be used in other artefacts to correlate artefacts from the same 'run' (this is not implemented in dbt_artifacts but could be if it was of use).

However, dbt Cloud supports neither --no-write-json nor custom env_vars 🤔 dbt Cloud does however provide env_vars with job and run id's that could potentially be used for this purpose. I cannot determine if these make it into the artefacts or not. If they don't, an alternative approach might be to inject them directly via the upload_artifacts copy query. Something like:

begin;
        copy into {{ src_dbt_artifacts }} from
            (
                select
                    $1 as data,
                    $1:metadata:generated_at::timestamp_ntz as generated_at,
                    metadata$filename as path,
                    regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type,
                    -- dbt cloud env_vars
                    {{ env_var('DBT_CLOUD_PROJECT_ID', 'not on cloud') }} as dbt_cloud_project_id,
                    {{ env_var('DBT_CLOUD_JOB_ID', 'not on cloud') }} as dbt_cloud_job_id,
                    {{ env_var('DBT_CLOUD_RUN_ID', 'not on cloud') }} as dbt_cloud_run_id,
                    {{ env_var('DBT_CLOUD_RUN_REASON_CATEGORY', 'not on cloud') }} as dbt_cloud_run_reason_category,
                    {{ env_var('DBT_CLOUD_RUN_REASON', 'not on cloud') }} as dbt_cloud_run_reason,
                    {{ env_var('DBT_CLOUD_PR_ID', 'not on cloud') }} as dbt_cloud_pr_id,
                    {{ env_var('DBT_CLOUD_GIT_SHA', 'not on cloud') }} as dbt_cloud_git_sha
                from  @{{ src_dbt_artifacts }}
            )
            file_format=(type='JSON')
            on_error='skip_file';
        commit;

Help, testing and comments on this very welcome 🙏

Make use of artifacts produced by executing dbt test

Hey there!
After analysing dbt models from this package I came to the conclusion that all artifacts that are produced by running dbt test (or, simply representing tests in manifest.json with resource_type = 'test') are not being used at all.
In all stg models either data:args:which = 'run' for run_results.json or node.value:resource_type = 'model' for manifest.json.
Hence, if my assumption above is correct, running

$ dbt  test
$ dbt  run-operation upload_dbt_artifacts --args '{filenames: [run_results]}'

will load run_results.json into Snowflake but it won't be picked up by any of the consequent models (except for stg_dbt__artifacts, but it stops there actually).
Long story short, I suggest adding the following to the package:

Adding 'which run' column (name TBD) to the source table as data:args:which which should allow distinguishing between artifacts produced - this can also be taken from the data column, so it's a nice to have.
Building some models to analyse the results produced by dbt test - initially, there can be a model called stg_dbt__test_executions and then some other models built on top of this one.

Ideally, none of us would want any tests failing in production. However, in real-world they sometimes do and it would be nice to know which ones, since when, some SLA-related metrics, etc.

Let me know what you think of this :)

Wrapper macro to upload artifacts by command

Right now users upload artifacts by specifying which artifacts to upload (manifest, catalog, etc.) That can be a pain to remember which artifacts to upload after which dbt command.

I'd be interested in seeing a wrapper macro so I could upload artifacts via:

dbt run-operation upload_dbt_artifacts_by_command --args '{command: run}'

This would just be a convenience feature that would map the command to the appropriate artifacts and run the existing macro to upload them. You could still upload specific artifacts using the existing macro.

I'm open to contributing if you're interested in adding this and don't want to spend the time on your end.

Documentation and Source Versioning out of Sync

The latest README says to update to 0.8 which is still in pre-realease.

All references throughout commits and documentation reference macros upload_dbt_artifacts and upload_dbt_artifacts_v2 but the macros do not have _dbt in the name (eg: it's upload_artifacts_v2.

as such, users are unable to use the documentation without digging through source code to find the issues - the front page docs for 0.7 reflect macros that are still in pre-release as well as have name differences.

Incremental models not populating

Executing the below steps with latest dbt version and latest pre-release of dbt_artifacts, but data fails to load into nearly all of the tables when running incrementally.

dbt run-operation create_artifact_resources (builds out the starting objects as expected)
dbt run (runs the entire project and builds my project objects as well as dbt_artifact objects)
dbt run-operation upload_dbt_artifacts_v2 (uploads some data to the artifact tables correctly)
dbt run (should incrementally add some data to the dims and facts, but nothing is added)

Have tried tracing all of it backward through stage models and incremental loads, but can't seem to narrow down an area where it's breaking down.

I am getting Error in Database Error in model fct_dbt__critical_path

I am getting compilation Error in the model. Below is message it generates while running all models under dbt_atifacts module it failed with model fct_dbt__critical_path.

2022-03-24T09:44:58.342+05:30Copy Processing aborted due to error 300010:391167117; incident 7185034. | Processing aborted due to error 300010:391167117; incident 7185034.

Empty transient tables

I'm noticing some undesirable behavior for the incremental population of the transient tables.

Let's assume fct_dbt__test_executions doesn't get populated on a dbt run dbt_artifacts run, but gets created as a table with 0 records. Any subsequent run with how the incremental filter is setup like below is blocked from populating the table (which can easily happen if we only do model runs, without tests):

where artifact_generated_at > (select max(artifact_generated_at) from {{ this }})

The above will always be false since a null will be returned from the transient table. A fix may be as simple as adding another condition to is_incremental like below:

    {% if is_incremental() and var_row_count > 0 %}
    -- this filter will only be applied on an incremental run
    where artifact_generated_at > (select max(artifact_generated_at) from {{ this }})
    {% endif %}

Where var_row_count is just a count of {{ this }}. Might be overthinking this.. but it's strange when the incrementally populated table is empty while the view counterpart returns the current test executions.

Any suggestions or insights?

dbt_artifact 0.7 FCT_DBT__SNAPSHOT_EXECUTION empty

hi team:

I am scratching my head on this one, I am running DBT from DBT cloud

first step is: dbt snapshot ( I have also tried dbt snapshot -s dbt_artifacts)

2nd step I run dbt run --exclude dbt_artifacts

3nd step is : dbt run-operation upload_dbt_artifacts --args '{filenames: [manifest, run_results]}'

4th step is : dbt run -m dbt_artifacts

but after all processed, I have data in FCT_DBT__RUN_RESULTS and FCT_DBT__MODEL_EXECUTIONS
plus the DIM_DBT__SNAPSHOTS dimension is populated, I think from manifest..

I just dont know why the FCT_DBT__SNAPSHOT_EXECUTION doesnt?

thanks

fct_dbt__critical_path.sql flatten error

Hi folks,

I'm attempting to get artifacts running, but I am getting the following error when I try to dbt run --models dbt_artifacts :

Database Error in model fct_dbt__critical_path (models/fct_dbt__critical_path.sql)
  000940 (22023): SQL compilation error: error line 46 at position 12
  missing required argument [INPUT] for function [flatten]
  compiled SQL at target/run/dbt_artifacts/models/fct_dbt__critical_path.sql

At a guess this is because we don't have any manifests loaded (ours are too big to fit into a single VARIANT). We were hoping to still get some value out of the package by checking model creation times, test statuses (when those land), etc. Can you take a look?

Allow for build command artifacts to populate by adjusting filter for data:args:which = 'run'

I recently implemented this package and it has been working great; however, we recently switched away from dbt run to dbt build. In doing so, I noticed that artifacts were no longer being parsed out into the fact and dimension models during a dbt_artifacts run despite being in the dbt_artifacts table.

Upon investigation, the model execution and run results models have a filter to limit the artifacts to those which are from run commands only, and not builds. I propose the filter be adjusted to data:args:which in ('run', 'build') for these models (or at least do the reverse and filter out non-run commands for future scalability):

stg_dbt__model_executions.sql
stg_dbt__run_results
stg_dbt__run_results_env_keys

Unless there is a reason in particular while builds are not included - please let me know. Thanks!

schemas.yml: wildcard in column name (env_*) is incompatible with dbt's persist_docs option

Issue Description

Model fct_dbt__run_results has a wildcard-enhanced description of all dynamically built env_* columns.
However, if the dbt_project.yml is set to

models:
  +persist_docs:
      columns: true

this results in an error in dbt run when dbt tries to attach the description to the target table/view, as the column_name env_* is not a valid column name because of the *. Furthermore, the operation would fail anyway, since the column simply does not exist.

Summary:

15:05:40 | 41 of 62 START incremental model core_dbt_artifacts.fct_dbt__run_results [RUN]
15:05:46 | 41 of 62 ERROR creating incremental model core_dbt_artifacts.fct_dbt__run_results [ERROR� in 6.16s]
Database Error in model fct_dbt__run_results (models/incremental/fct_dbt__run_results.sql)
  001003 (42000): SQL compilation error:
  syntax error line 19 at position 12 unexpected '*'.
  compiled SQL at target/run/dbt_artifacts/models/incremental/fct_dbt__run_results.sql

Detailed log:

2021-07-09T15:05:46.042098Z: On model.dbt_artifacts.fct_dbt__run_results: /* run by philipp.***** in dbt */

    alter table eh_account_dev_philipp_leufke.core_dbt_artifacts.fct_dbt__run_results alter
    
        command_invocation_id COMMENT $$The id of the command which resulted in the source artifact's generation.$$ ,
    
        artifact_generated_at COMMENT $$Timestamp of when the source artifact was generated.$$ ,
    
        dbt_version COMMENT $$The version of dbt used to generate the source artifact.$$ ,
    
        elapsed_time COMMENT $$The total run time of the command.$$ ,
    
        execution_command COMMENT $$The actual command used.$$ ,
    
        selected_models COMMENT $$A list of model selectors used in the command.$$ ,
    
        target COMMENT $$The configured target for the command.$$ ,
    
        was_full_refresh COMMENT $$Was the run executed with a --full-refresh flag?$$ ,
    
        env_* COMMENT $$Columns for the environment variables set when the command was executed.$$ ;
2021-07-09T15:05:46.232293Z: Snowflake query id: 019d7929-0200-982d-0000-649500806272
2021-07-09T15:05:46.232396Z: Snowflake error: 001003 (42000): SQL compilation error:
syntax error line 19 at position 12 unexpected '*'.
2021-07-09T15:05:46.232539Z: finished collecting timing info

Discussion

I am unsure if the dbt is the culprit here, as I am not aware of how common it is to document a set of columns using wildcards in the column names.

Deactivating the persist_docs option for columns does fix the issue, but changes the result of the dbt run, of course.
We still would love to persist the descriptions in the column comments on the target DB.

Issue running `upload_dbt_artifacts` in production with dbt Cloud orchestrator

Hi there - this package fits our needs for monitoring model run times perfectly, and we were able to run all aspects of it locally using the dbt CLI and in our development environments on dbt Cloud using the dbt Cloud IDE; however, we're unable to get the dbt-run-operation upload_dbt_artifacts --args '{filenames: [manifest, run_results]}' to run as a step in a dbt Cloud job due to this recurring error:

Running with dbt=0.19.1
Clearing existing files from Stage:
    remove @ANALYTICS.dbt_artifacts.artifacts pattern='.*.json.gz';

Encountered an error while running operation: Database Error
  002003 (02000): SQL compilation error:
  Stage 'ANALYTICS.DBT_ARTIFACTS.ARTIFACTS' does not exist or not authorized.

Some troubleshooting steps we've taken so far:

I ensured that the user & role associated with the environments that we've tried running this from in dbt Cloud have access to the table (I've even tried granting all privileges and ownership to the roles being used) and usage privileges on the schema where it's located
I tried running using various environments including our production environment, our pull request environment configured for GitHub CI checks, and some temporary environments I made to try to emulate my local development environment and my dbt Cloud development environment
I tried manually specifying the location of the artifacts in the dbt_project.yml file and reinstalling the package
I tried using dbt versions 0.19.1 and 0.19.0 (also 0.18.0, but that failed as expected due to the artifacts version)
I tried running this using the dev and prod targets that we use internally as well as the default target

All of those steps result in the same failure message shown above only when running the operation from a dbt Cloud job.

I've also tried using our dbt Cloud user's credentials, which we typically only use for dbt Cloud jobs, and was able to get the artifact upload to work properly when running locally.

Do you have any ideas on why we might be seeing this failure or other steps I could try to troubleshoot? I've tried doing most of the troubleshooting on uploads following some sort of dbt run operation (those run operations have succeeded). I am also able to run the artifact models from a dbt Cloud job successfully, it only seems to be the uploads that are causing issues.

Any advice would be appreciated - thanks for the help!

fct_dbt__critical_path is not showing the slowest path

Hey there!
We've been collecting dbt artifacts for a while now and already back when we started I noticed that the fct_dbt__critical_path table looked very suspicious (i.e. too good to be true) in our case, and it still does:

I tried to investigate this issue by looking at the structure of the fct_dbt__critical_path model and it seems that everything until search_path (node_ids, total_time) produces the expected results (in particular, in our case). I was just confused by the fact that total_time column seems to be a string instead of float. However, starting with the next CTE called longest_path_node_ids we are getting the wrong node sequence. There must be something that this case when does that I don't understand.

Any help would be appreciated!

run_results json thread_id cannot be mapped to integer

thread_id in run_results.json can have a different format from the default, which causes the mapping to fail

dbt_artifacts/models/staging/stg_dbt__node_executions.sql

Line 55 in b42dd1e

split(fields.result_json:thread_id::string, '-')[1]::integer as thread_id,

https://docs.python.org/3/library/threading.html#threading.Thread

name is the thread name. By default, a unique name is constructed of the form “Thread-N” where N is a small decimal number, or “Thread-N (target)” where “target” is target.name if the target argument is specified.

Caused error:

Database Error in model fct_dbt__test_executions (models/incremental/fct_dbt__test_executions.sql)
    100071 (22000): Failed to cast variant value "6 (worker)" to FIXED

Database Error in model int_dbt__model_executions (models/incremental/int_dbt__model_executions.sql)
    100071 (22000): Failed to cast variant value "1 (worker)" to FIXED

Database Error in model fct_dbt__snapshot_executions (models/incremental/fct_dbt__snapshot_executions.sql)
    100071 (22000): Failed to cast variant value "4 (worker)" to FIXED

0.7.0 README references (forthcoming!) 0.8.0 functionality.

Hi! I hope this format is ok for a new issue, I didn't see any explicit guidelines in the README.

Currently, the README for 0.7.0 references ..._v2 functionality concerning artifact loads, but does not contain the code (which is in the 0.8.0 prerelease branches.)

This is somewhat confusing for people currently integrating the package into their projects and should probably not already be present in the current 0.7.0 release documentation.

Thank you for all the hard work on this, very neatly solves observability issues with dbt/dbt-Cloud!!!

dbt invocation_id changes on upload_dbt_artifacts

When running:

dbt run-operation upload_dbt_artifacts ...

This causes manifest.json to get a different invocation_id. This is intended dbt behavior and it makes sense, however it does have the side-effect of causing manifests and run-results that would usually be join-able on invocation_id to not join.

I re-named the manifest file I wanted to retain the invocation_id for prior to running the upload operation and uploaded the renamed version, using some updates in Snowflake to change the PATH and ARTIFACT_TYPE back to manifest.json and manifest.json.gz accordingly.

I'm not aware of any ways around this if uploading locally using the dbt cli. I will probably put this in a pipeline and upload it via AWS S3 and Snowflake (as per Option 2 on the README)

I think that this should at least be documented or noted somewhere because this took me a few days of headache and I think it will be worth saving anyone else the trouble in future.

Thanks!

Adding select flag information to stg_dbt__run_results

stg_dbt__run_results show models args as selected_models column, but since moving to dbt1, the models flag has been deprecated and the select flag is being used instead. It would be great to have the information from the select flag available in the model.

An example implementation could be replacing this line:
https://github.com/brooklyn-data/dbt_artifacts/blob/main/models/staging/stg_dbt__run_results.sql#L36

by something like the following code:

coalesce(data:args:models, data:args:select) as selected_models,

to account for projects transitioning from dbt prior to dbt1 to dbt1.

I think the deployed package includes the test models

I was linting a project which uses this package, and got a curious error:

RuntimeError: File C:\Users\...\data_warehouse\dbt_packages\dbt_artifacts\integration_test_project\models\incremental.sql was not found in dbt project

Now this is evidently a sqlfluff problem, but it's failing to find a file, which is a test file within dbt artifacts. I don't think this file should even be present or rendering within the host project.

Sure enough when I check the dbt_packages folder I see the test project included in the installed package, which I suspect is why dbt is trying to compile it along with the rest of the project.

Is there something in the dbt package config which we should update so that the test project isn't included in the eventual parse DAG of a host project?

Test failure in fct_dbt__model_executions

The Unique test on this model is failing when I run on dbt Cloud:

Failure in test unique_fct_dbt__model_executions_model_execution_id (models/schemas.yml)

I suspect this is a result of two things, but it might still need more digging.

If I run all the commands (seed, run, test, etc.) and upload artifacts after each one following the instructions on this package's home page, I end up with two manifests with the same dbt_cloud_run_id (one from seed and one from run) which in turn creates two records per node in dim_dbt__models.
Using the dbt_cloud_run_id environment variable for this join in fct_dbt__model_executions:


model_executions_with_materialization as (

    select
        model_executions_incremental.*,
        models.model_materialization,
        models.model_schema,
        models.name
    from model_executions_incremental
    left join models on
        (
            model_executions_incremental.command_invocation_id = models.command_invocation_id
            or model_executions_incremental.dbt_cloud_run_id = models.dbt_cloud_run_id
        )
        and model_executions_incremental.node_id = models.node_id

),

This doesn't immediately get caught if you use dbt Cloud for ci because the first dbt run doesn't have the artifacts from itself, so only the manifest from dbt seed exists the first time dbt_artifacts runs. Any subsequent run fails.

A simple workaround is not to upload manifest after dbt run (or seed, but I figure it's good to upload it at the earliest possible step), but I'm not sure if that breaks anything later on in this package? If not, it might be a simple documentation fix to take manifest out of that step. I tested this and it resolves the failing test, I just don't know if this loses any detail I should have.

Another possible solution would be to coalesce(dbt_cloud_run_id, command_invocation_id) into a single field so when the cloud_run_id is present it gets used, otherwise it falls back on the command_invocation_id. This might be a bigger task throughout the package, but I think would have the side effect of solving this issue.

Support Databricks

It would be epic to have this package support Databricks. One of the things we will need to discuss is how we will recommend loading into Databricks

Swap to result/graph objects as data source

Hi there!

I've been playing around with this package for a while, and really like all the info it provides! I see that there are currently a few issues out there to support other warehouses, and it looks like the main thing slowing that support down is the upload of the artifacts.

I'm curious if you have considered swapping to the context variables that dbt provides as the source of data. I haven't done an exhaustive review of the data between the two, but it looks like the results variable may have everything you need from run results, and the graph variable may have the things you need from the manifest. Since you can pull these directly from dbt without the need to import files, it may solve some problems on that side of things.

I'm in the very early stages of thinking about a stand-alone package to parse those two variables and load them to a database into audit style tables, and thought it may be helpful here as well.

Edit: A good example of this for the graph object is the recently released project evaluator package. These macros from that package show how you can parse the graph object to get info on the project.

Critical path model doesn't make sense with segmented project

We very rarely refresh our whole dbt project in one go - much more likely is that individual marts are refreshed separately.

This makes the critical path model a little wierd, because it currently only picks up the most recent critical path (for whichever mart ran most recently).

I have a suggestion to make this better:

In our setup, each mart is identifiable by a unique combination of model selectors (another option would be to use the dbt cloud job id, but that wouldn't work for local deployments). That could be turned into an identifier with a schema something like: <target>|<selector>. That results in default|* for most simple projects (if no explicit selectors is rendered as *), or e.g. my_target|models/mart_a|my_package for more complicated setups (if multiple selectors are also delimited by |). I'm going to call this a job_selector.
Within each job_selector, the critical path makes sense again, so adding this as a column to that model, would then mean users could see the critical path for each job_selector. For users with a basic setup, this wouldn't add any additional granularity to that model, because they would only have one.
To limit explosion in larger projects we could allow configuration to limit which job_selector ids are allowed to materialise in this model (maybe not something to build in round 1 though).
This also helps to clear up the difference betweeen the existing latest_full_model_executions model and the new current_models model - in that the latter just has one row per node, but the former has one row per nod per job selector.

Does this sound like a viable solution to the problem? @NiallRees

Support BigQuery

The dbt package is awesome. I would like to do the same thing for BigQuery. However, I am not sure we can realize the same with only run_query, because for instance we have to upload artifacts files to GCS in the case of BigQuery. As far as I know, there is no statement type to put it to GCS or BigQuery directly.

NOTE

Unfortunately, there is nothing like DDL to load data from local. It would be worthwhile asking that to Google Cloud.

https://cloud.google.com/bigquery/docs/batch-loading-data#bq

ENV property empty which means stg_dbt__run_results_env_keys and related tables do not populate

Hey thanks so much for doing this it has really helped me track our dbt run times.

In the run_results.json artefact, I see that the env property is null:

This means that stg_dbt__run_results_env_keys is not being populated because the sql is looking for a env.key field. Is there anyway we can populate this field so I can use all the models. I am running dbt on the cloud. CTE below which references this field:

Is the env variable a variable that I should be setting or an argument I should be passing in?

Inquiry about support for dbt v1.0.0

Hey dbt_artifacts Team!

So I haven't yet migrated to dbt v1.0.0 due to the fact that we use the SQLFluff linter (Thank you Tails.com for that one too!) in our pre-commit checks. With that, I suspect the dbt_artifacts doesn't support v1.0.0 right now. Is there a timeline for when this package will support v1.0.0?

If not, I'm curious to try and work on this if it seems like a relatively low lift from y'all's perspective to implement.

`dbt_project.yml` sets a custom schema

This repo sets a custom schema here:
https://github.com/tailsdotcom/dbt_artifacts/blob/main/dbt_project.yml#L8

models:
  dbt_artifacts:
    +schema: default_dbt_artifacts_schema

I think in most cases I'd recommend that the custom schema not be set (and therefore the default schema gets used).

Is there a particular reason for setting this value?

Support Redshift

It would be epic to have this package support Redshift. One of the things we will need to discuss is how we will recommend loading into Redshift (or S3 would be better for Redshift)

upload_dbt_artifacts_v2 does not overcome 16MB variant limit in Snowflake

Re: #62

Hi!

Just gave the V2 uploader a spin, and I can't seem to get around the 16MB limit still. Am I doing something wrong?

Debugging the sql tells me that the raw_data cte is where the 100069 (22P02): Error parsing JSON: document is too large, max size 16777216 bytes error comes from. So I imagine that at that point, the sql is still trying to parse the manifest as one single variant..?

I've run dbt run-operation create_artifact_resources beforehand.

Log:

dbt --no-write-json run-operation upload_dbt_artifacts_v2
13:52:38  Running with dbt=1.0.3
13:52:38  Clearing existing files from Stage: 
    remove @analytics_dev.dbt_larspetter.dbt_artifacts_stage pattern='.*\/c39ee8dc-197a-454e-9814-b4dddd921eea\/.*\.json.gz';

13:52:39  Uploading manifest.json to Stage: 
        put file://target/manifest.json @analytics_dev.dbt_larspetter.dbt_artifacts_stage/c39ee8dc-197a-454e-9814-b4dddd921eea auto_compress=true;
    
13:52:42  Persisting flattened manifest nodes manifest.json from Stage: 

    -- Merge to avoid duplicates
    merge into analytics_dev.dbt_larspetter.dbt_manifest_nodes as old_data using (
        with raw_data as (

            select
                manifests.$1:metadata as metadata,
                metadata:invocation_id::string as command_invocation_id,
                -- NOTE: DBT_CLOUD_RUN_ID is case sensitive here
                metadata:env:DBT_CLOUD_RUN_ID::int as dbt_cloud_run_id,
                
    sha2_hex(coalesce(dbt_cloud_run_id::string, command_invocation_id::string), 256)
 as artifact_run_id,
                metadata:generated_at::timestamp_tz as generated_at,
                manifests.$1 as data
            from @analytics_dev.dbt_larspetter.dbt_artifacts_stage/c39ee8dc-197a-454e-9814-b4dddd921eea as manifests

        )

        

    select
        manifests.command_invocation_id,
        manifests.dbt_cloud_run_id,
        manifests.artifact_run_id,
        manifests.generated_at::timestamp_tz as artifact_generated_at,
        node.key as node_id,
        node.value:resource_type::string as resource_type,
        node.value:database::string as node_database,
        node.value:schema::string as node_schema,
        node.value:name::string as name,
        -- Include the raw JSON to unpack other values.
        node.value as node_json
    from raw_data as manifests,
        lateral flatten(input => manifests.data:nodes) as node

    union all

    select
        manifests.command_invocation_id,
        manifests.dbt_cloud_run_id,
        manifests.artifact_run_id,
        manifests.generated_at::timestamp_tz as artifact_generated_at,
        exposure.key as node_id,
        'exposure' as resource_type,
        null as node_database,
        null as node_schema,
        exposure.value:name::string as name,
        -- Include the raw JSON to unpack other values.
        exposure.value as node_json
    from raw_data as manifests,
        lateral flatten(input => manifests.data:exposures) as exposure

    union all

    select
        manifests.command_invocation_id,
        manifests.dbt_cloud_run_id,
        manifests.artifact_run_id,
        manifests.generated_at::timestamp_tz as artifact_generated_at,
        source.key as node_id,
        'source' as resource_type,
        source.value:database::string as node_database,
        source.value:schema::string as node_schema,
        source.value:name::string::string as name,
        -- Include the raw JSON to unpack other values.
        source.value as node_json
    from raw_data as manifests,
        lateral flatten(input => manifests.data:sources) as source



    ) as new_data
    -- NB: We dedupe on artifact_run_id rather than command_invocation_id for manifest nodes
    -- to avoid holding duplicate data.
    on old_data.artifact_run_id = new_data.artifact_run_id and old_data.node_id = new_data.node_id
    -- NB: No clause for "when matched" - as matching rows should be skipped.
    when not matched then insert (
        command_invocation_id,
        dbt_cloud_run_id,
        artifact_run_id,
        artifact_generated_at,
        node_id,
        resource_type,
        node_database,
        node_schema,
        name,
        node_json
    ) values (
        new_data.command_invocation_id,
        new_data.dbt_cloud_run_id,
        new_data.artifact_run_id,
        new_data.artifact_generated_at,
        new_data.node_id,
        new_data.resource_type,
        new_data.node_database,
        new_data.node_schema,
        new_data.name,
        new_data.node_json
    )


13:52:44  Encountered an error while running operation: Database Error
  100069 (22P02): Error parsing JSON: document is too large, max size 16777216 bytes
    File 'c39ee8dc-197a-454e-9814-b4dddd921eea/manifest.json.gz', line 1, character 16777216
    Row 0, column $1

Intermittent Database Error on upload

Occasionally I get one of these, when running the v2 upload on the new release branch (basically on master):

04:38:00  Encountered an error while running operation: Database Error
  100112 (22000): Remote file 'http://.../stages/5ef57c80-83d2-4383-94ed-7c3738a0a04b/manifest.json.gz' was not found. There are several potential causes. The file might not exist. The required credentials may be missing or invalid. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.

This is a snowflake error, and I suspect it's to do with collisions in the stage. Either because the purge of the file is too aggressive (we're removing anything which matches pattern='.*.json.gz' or because we need to wait for snowflake to load the file properly before removing.

Given I think each dbt cloud run is running in an isolated environment, I think it's most likely that the issue is happening in the stage rather than locally. Suggested mitigating measures:

More specific remove command.
Get rid of remove and use the PURGE option on the stage to try and get snowflake to remove files itself automatically post-load.
Use a run specific stage (or location within the stage), probably using the command_invocation_id.

Gut feel on my part is that the last option alone might be the most elegant solution, because that totally isolates each run.

Need to use --no-write-json to avoid empty STG_DBT__NODE_EXECUTIONS model

Hello, recently I was tracking down why many of my models were empty and I tracked it down to the command I was using to invoke the new upload macro:
dbt run-operation upload_dbt_artifacts_v2

I found that my manifest.json was being replaced by the macro's execution and that was causing the STG_DBT__NODE_EXECUTIONS model to have zero rows because it has an inner join between stg_dbt__artifacts and stg_dbt__nodes. In my case the issue was resolved by using this command to upload instead:
dbt --no-write-json run-operation upload_dbt_artifacts_v2

I recommend the README.md be updated to include the --no-write-json parameter.

I also recommend committing the fix for issue #64 because that was also one of my root causes.

Refactor of granularity in tables.

#81 migrated several OR statements into a consistent artifact_run_id. This was a good improvement but results in a few tricky issues about granularity.

Situation:

dbt cloud job, which runs seed, run and test (I don't believe the problem I'm highlighting would occur in the case of running build).
After each step we run upload_artifacts macros.
The created artifacts would have run_results for each step, with each of the models/tests/seeds executed at each step (which is good) and also a duplicate manifest for each, with a different command_invocation_id but the same artifact_run_id.
The current proposed MERGE statements in #99 don't solve this situation currently because we're merging on artifact_run_id AND atrifact_generated_at.
This means that when we start joining on artifact_run_id in models we get a cartesian join and more rows than we need. Most specifically in stg_dbt__node_executions, but also in fct_dbt__snapshot_executions, fct_dbt__seed_executions & fct_dbt__model_executions.

I think the solution to this is to change the join in these cases to join only on command_invocation_id which is unique to a given artifact - and avoids the cartesian join.

Support for dbt 0.20.0

Only dbt_utils 0.7.0+ is supported for dbt 0.20.0. Is it possible to upgrade this dependency to support that?

https://github.com/tailsdotcom/dbt_artifacts/blob/main/packages.yml#L3

Version discrepancy between dbt Package hub and Latest version

When you run dbt deps to get packages from the dbt Package hub, the wrong version of dbt_artifacts is seemingly pulled.
You can easily verify this by going to https://hub.getdbt.com/brooklyn-data/dbt_artifacts/latest/ and click View on Github under the README file. Navigate to the dbt_project.yml file and see that the version is 0.6.0 instead of the latest, 0.8.0.

Add relationship tests and/or an ERD

Currently it is not self-explaining how the fact and dimension tables are to be joined.

It would be helpful to have an ERD available in the documentation of this project.

Or there should be relationship tests on the foreign-key columns, so the users can identify the suitable join conditions.

Failure in test not_null_fct_dbt__latest_full_model_executions_node_id

We're intermittently experiencing this failure:

Failure in test not_null_fct_dbt__latest_full_model_executions_node_id (models/schemas.yml)
  Got 1 result, configured to fail if != 0

  compiled SQL at target/compiled/dbt_artifacts/models/schemas.yml/schema_test/not_null_fct_dbt__latest_full_model_executions_node_id.sql

On inspection this happens when a new upload coincides with an incremental run of the package. Due to (what I'm pretty sure is...) a race condition means that while the most recent run is present in fct_dbt__run_results, it is not present in fct_dbt__model_executions - and so the most recent run appears to have no model executions. The current workaround is to run a full rebuild, which take significant time.

A better solution would be to only allow runs which have model executions in this model. It adds complexity but would explicitly eliminate this race condition.

Capturing exposures data

Thanks for creating this package - we have found it very useful in our dbt projects!

One thing we found missing was capturing information about exposures defined within our dbt projects. Especially, we were interested in exposing to our end users latest update times of different output models feeding the different exposures. We can extract the latest run of each model from the FCT_DBT__MODEL_EXECUTIONS model but exposures are not captured in any of the models.

Based on your work, we have created three models to capture this information and would be very keen to explore if you'd be interested in incorporating it into your package. Similar to your models capturing data on MODELS we built:

a staging model extracting metadata about exposures from manifest.json (STG_DBT__EXPOSURES),
a dimension model expanding exposure-model relations into separate row (DIM_DBT__EXPOSURES),
a fact model joining exposures and latest update times of each output model feeding them (FCT_DBT__EXPOSURES_UPDATES).

Let me know if that's something you'd be interested in!

Error running stg_dbt__node_executions model

When I try to run all models by running dbt build -s dbt_artifacts, I get the following error:

Database Error in model stg_dbt__node_executions (models/staging/stg_dbt__node_executions.sql)
07:36:46    001789 (42601): SQL compilation error:
07:36:46    invalid number of result columns for set operator input branches, expected 12, got 15 in branch 2
07:36:46    compiled SQL at target/run/dbt_artifacts/models/staging/stg_dbt__node_executions.sql

When checking the model compiled SQL, I noticed that indeed those two things produce different number of columns, thus UNION can't work:

-- V1 uploads
    {{ flatten_results("run_results") }}

    union all

    -- V2 uploads
    -- NB: We can safely select * because we know the schemas are the same
    -- as they're made by the same macro.
    select * from base_v2

This is my first attempt of trying to use your package, so I'm not sure whether I'm missing something. I'm also not sure what' V1 uploads sql comment means, since I've only ever used only V2 macro. I'm on 0.8.0 version.

Steps to reproduce:

run dbt run-operation create_artifact_resources
run dbt run -m +model_name
run dbt --no-write-json run-operation upload_dbt_artifacts_v2
run dbt build -s dbt_artifacts

Fix Ci/CD

CI/CD is failing on dbt seed for no obvious reason. Runs locally fine with the same credentials 🤦‍♂️

`fct_dbt__critical_path` table is empty for those using run-operations to upload artifacts

Hey there - was working on installing this package and was running into some issues with the fct_dbt__critical_path table being empty. I took out part of the CTE in fct_dbt_latestfull_model_executions that filters for full runs as this is not super helpful to us i.e.

latest_full as (
    select *
    from run_results
    where 
    --selected_models is null and
     was_full_refresh = false
    order by artifact_generated_at desc
    limit 1
),

but still no dice. I thought it was because the model joins on command_invocation_id which seems to be different for each artifact created> As in the run_result json has a different id than the manifest.json even though we are running them on the same command line?
that join happens in the same model here-

joined as (
    select
        model_executions.*
    from latest_full
    left join model_executions on model_executions.command_invocation_id = latest_full.command_invocation_id

Any help is greatly appreciated.