Giter VIP home page Giter VIP logo

dbt-codegen's Introduction

dbt-codegen

Macros that generate dbt code, and log it to the command line.

Contents

Installation instructions

New to dbt packages? Read more about them here.

  1. Include this package in your packages.yml file — check here for the latest version number:
packages:
  - package: dbt-labs/codegen
    version: X.X.X ## update to latest version here
  1. Run dbt deps to install the package.

Macros

generate_source (source)

This macro generates lightweight YAML for a Source, which you can then paste into a schema file.

Arguments

  • schema_name (required): The schema name that contains your source data
  • database_name (optional, default=target.database): The database that your source data is in.
  • table_names (optional, default=none): A list of tables that you want to generate the source definitions for.
  • generate_columns (optional, default=False): Whether you want to add the column names to your source definition.
  • include_descriptions (optional, default=False): Whether you want to add description placeholders to your source definition.
  • include_data_types (optional, default=True): Whether you want to add data types to your source columns definitions.
  • table_pattern (optional, default='%'): A table prefix / postfix that you want to subselect from all available tables within a given schema.
  • exclude (optional, default=''): A string you want to exclude from the selection criteria
  • name (optional, default=schema_name): The name of your source
  • include_database (optional, default=False): Whether you want to add the database to your source definition
  • include_schema (optional, default=False): Whether you want to add the schema to your source definition
  • case_sensitive_databases (optional, default=False): Whether you want database names to be in lowercase, or to match the case in the source table — not compatible with Redshift
  • case_sensitive_schemas (optional, default=False): Whether you want schema names to be in lowercase, or to match the case in the source table — not compatible with Redshift
  • case_sensitive_tables (optional, default=False): Whether you want table names to be in lowercase, or to match the case in the source table — not compatible with Redshift
  • case_sensitive_cols (optional, default=False): Whether you want column names to be in lowercase, or to match the case in the source table

Outputting to a file

If you use the dbt run-operation approach it is possible to output directly to a file by piping the output to a new file and using the --quiet CLI flag:

dbt --quiet run-operation generate_source --args '{"table_names": ["orders"]}' > models/staging/jaffle_shop/_sources.yml

Usage:

  1. Copy the macro into a statement tab in the dbt Cloud IDE, or into an analysis file, and compile your code
{{ codegen.generate_source('raw_jaffle_shop') }}

or for multiple arguments

{{ codegen.generate_source(schema_name= 'jaffle_shop', database_name= 'raw') }}

Alternatively, call the macro as an operation:

$ dbt run-operation generate_source --args 'schema_name: raw_jaffle_shop'

or

# for multiple arguments, use the dict syntax
$ dbt run-operation generate_source --args '{"schema_name": "jaffle_shop", "database_name": "raw", "table_names":["table_1", "table_2"]}'

or if you want to include column names and data types:

$ dbt run-operation generate_source --args '{"schema_name": "jaffle_shop", "generate_columns": true}'

or if you want to include column names without data types (the behavior dbt-codegen <= v0.9.0):

$ dbt run-operation generate_source --args '{"schema_name": "jaffle_shop", "generate_columns": true, "include_data_types": false}'
  1. The YAML for the source will be logged to the command line
version: 2

sources:
  - name: raw_jaffle_shop
    database: raw
    schema: raw_jaffle_shop
    tables:
      - name: customers
        description: ""
      - name: orders
        description: ""
      - name: payments
        description: ""
  1. Paste the output in to a schema .yml file, and refactor as required.

generate_base_model (source)

This macro generates the SQL for a base model, which you can then paste into a model.

Arguments:

  • source_name (required): The source you wish to generate base model SQL for.
  • table_name (required): The source table you wish to generate base model SQL for.
  • leading_commas (optional, default=False): Whether you want your commas to be leading (vs trailing).
  • case_sensitive_cols (optional, default=False): Whether your source table has case sensitive column names. If true, keeps the case of the column names from the source.
  • materialized (optional, default=None): Set materialization style (e.g. table, view, incremental) inside of the model's config block. If not set, materialization style will be controlled by dbt_project.yml

Usage:

  1. Create a source for the table you wish to create a base model on top of.
  2. Copy the macro into a statement tab in the dbt Cloud IDE, or into an analysis file, and compile your code
{{ codegen.generate_base_model(
    source_name='raw_jaffle_shop',
    table_name='customers',
    materialized='table'
) }}

Alternatively, call the macro as an operation:

$ dbt run-operation generate_base_model --args '{"source_name": "raw_jaffle_shop", "table_name": "customers"}'
  1. The SQL for a base model will be logged to the command line
with source as (

    select * from {{ source('raw_jaffle_shop', 'customers') }}

),

renamed as (

    select
        id,
        first_name,
        last_name,
        email,
        _elt_updated_at

    from source

)

select * from renamed
  1. Paste the output in to a model, and refactor as required.

create_base_models (source)

This macro generates a series of terminal commands (appended with the && to allow for subsequent execution) that execute the base_model_creation bash script. This bash script will write the output of the generate_base_model macro into a new model file in your local dbt project.

Note: This macro is not compatible with the dbt Cloud IDE.

Arguments:

  • source_name (required): The source you wish to generate base model SQL for.
  • tables (required): A list of all tables you want to generate the base models for.

Usage:

  1. Create a source for the table you wish to create a base model on top of.
  2. Copy the macro into a statement tab into your local IDE, and run your code
dbt run-operation codegen.create_base_models --args '{source_name: my-source, tables: ["this-table","that-table"]}'

base_model_creation (source)

This bash script when executed from your local IDE will create model files in your dbt project instance that contain the outputs of the generate_base_model macro.

Note: This macro is not compatible with the dbt Cloud IDE.

Arguments:

  • source_name (required): The source you wish to generate base model SQL for.
  • tables (required): A list of all tables you want to generate the base models for.

Usage:

  1. Create a source for the table you wish to create a base model on top of.
  2. Copy the macro into a statement tab into your local IDE, and run your code
source dbt_packages/codegen/bash_scripts/base_model_creation.sh "source_name" ["this-table","that-table"]

generate_model_yaml (source)

This macro generates the YAML for a list of model(s), which you can then paste into a schema.yml file.

Arguments:

  • model_names (required): The model(s) you wish to generate YAML for.
  • upstream_descriptions (optional, default=False): Whether you want to include descriptions for identical column names from upstream models and sources.
  • include_data_types (optional, default=True): Whether you want to add data types to your model column definitions.

Usage:

  1. Create a model.
  2. Copy the macro into a statement tab in the dbt Cloud IDE, or into an analysis file, and compile your code
{{ codegen.generate_model_yaml(
    model_names=['customers']
) }}

You can use the helper function codegen.get_models and specify a directory and/or prefix to get a list of all matching models, to be passed into model_names list.

{% set models_to_generate = codegen.get_models(directory='marts', prefix='fct_') %}
{{ codegen.generate_model_yaml(
    model_names = models_to_generate
) }}

Alternatively, call the macro as an operation:

$ dbt run-operation generate_model_yaml --args '{"model_names": ["customers"]}'
  1. The YAML for a base model(s) will be logged to the command line
version: 2

models:
  - name: customers
    description: ""
    columns:
      - name: customer_id
        data_type: integer
        description: ""
      - name: customer_name
        data_type: text
        description: ""
  1. Paste the output in to a schema.yml file, and refactor as required.

generate_model_import_ctes (source)

This macro generates the SQL for a given model with all references pulled up into import CTEs, which you can then paste back into the model.

Arguments:

  • model_name (required): The model you wish to generate SQL with import CTEs for.
  • leading_commas (optional, default=False): Whether you want your commas to be leading (vs trailing).

Usage:

  1. Create a model with your original SQL query
  2. Copy the macro into a statement tab in the dbt Cloud IDE, or into an analysis file, and compile your code
{{ codegen.generate_model_import_ctes(
    model_name = 'my_dbt_model'
) }}

Alternatively, call the macro as an operation:

$ dbt run-operation generate_model_import_ctes --args '{"model_name": "my_dbt_model"}'
  1. The new SQL - with all references pulled up into import CTEs - will be logged to the command line
with customers as (

    select * from {{ ref('stg_customers') }}

),

orders as (

    select * from {{ ref('stg_orders') }}

),

payments as (

    select * from {{ ref('stg_payments') }}

),

customer_orders as (

    select
        customer_id,
        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders
    group by customer_id

),

customer_payments as (

    select
        orders.customer_id,
        sum(amount) as total_amount
    from payments
    left join orders on
         payments.order_id = orders.order_id
    group by orders.customer_id

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value
    from customers
    left join customer_orders
        on customers.customer_id = customer_orders.customer_id
    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id

)

select * from final
  1. Replace the contents of the model's current SQL file with the compiled or logged code

Contributing

To contirbute code to this package, please follow the steps outlined in the integration_tests directory's README file.

dbt-codegen's People

Contributors

angelica-lastra avatar bodschut avatar calvingiles avatar clrcrl avatar cohms avatar dave-connors-3 avatar davesgonechina avatar dbeatty10 avatar djbelknapaw avatar erkanncelen avatar esegal avatar fivetran-joemarkiewicz avatar graciegoheen avatar gsokol avatar gwenwindflower avatar jaypeedevlin avatar jeremyholtzman avatar jeremyyeo avatar joellabes avatar jorgeng avatar jtalmi avatar jtcohen6 avatar kbrock91 avatar linbug avatar maayan-s avatar matt-winkler avatar pnadolny13 avatar rahulj51 avatar vijmen avatar yatsky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-codegen's Issues

basic jinja templates

Describe the feature

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.

Are you interested in contributing this feature?

Dash '-' in database name causes syntax error

Describe the bug

Dash '-' in database name causes syntax error while using generate_source command.

Steps to reproduce

If the database name contains '-', it will cause syntax error.

Expected results

Without syntax error.

Actual results

Syntax error.

Screenshots and log output

image

System information

packages:

  • package: dbt-labs/codegen
    version: 0.8.1

Which database are you using dbt with?
redshift

The output of dbt --version:
dbt cloud

The operating system you're using:

The output of python --version:

Additional context

The workaround here is to use '{ database_name: ""DATABASE_NAME"" }'

Macro Generate_Source is not working as expected for the generation of columns from BQ

Describe the bug

We are using the dbt-codegen Macros to help us in our work. Unfortunately, the "generate_source.sql" Macro is partially not working for some reason. The name of the source and the name of the table are retrieved normally, but the columns are not retrieved. We tried multiple way to take them, but it seems not possible. We are trying to fetch data from BQ and we put the parameters "generate_columns=True" and "include_descriptions=True".

Steps to reproduce

We are using the following command to trigger the launch of the Macro and then the output is given without columns :
dbt run-operation generate_source --args '{schema_name: tiktok_ads}'

Expected results

We were expecting to receive the list of columns that are provided in BQ

Actual results

The list of columns is empty ; no columns names are listed

Screenshots and log output

image

System information

packages:

  • package: dbt-labs/codegen
    version: 0.4.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:
installed version: 0.21.0
latest version: 1.0.0

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:

  • postgres: 0.21.0
  • redshift: 0.21.0
  • bigquery: 0.21.0
  • snowflake: 0.21.0

The operating system you're using:
Windows 10

The output of python --version:
Python 3.8.11

Additional context

Our hypothesis is that the problem comes from this part of the code, but we are unsure, could you please assist us in the resolution, please ?

image

Are you interested in contributing the fix?

We are at your disposal for any further information or help in order to achieve the resolution of this problem on our side.

Generate model yaml not importing descriptions from upstream

Describe the bug

As per the title, once I run the command, the YAML generated has all the columns in lower case but doesn't get the descriptions from upstream.

Steps to reproduce

It seems this has to do with it being case insensitive somehow. My upstream model has a column called 'currentSite', and the YAML generated creates it as 'currentsite' without any description.
This is done on DBT Cloud.

Screenshots and log output

System information

packages:
  - package: dbt-labs/codegen
    version: 0.7.0
  - package: dbt-labs/dbt_utils
    version: 0.8.6
  - package: calogica/dbt_expectations
    version: [">=0.5.0", "<0.6.0"]
  - package: tnightengale/dbt_meta_testing
    version: 0.3.5

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

1.0

Are you interested in contributing the fix?

If I could, but I've never written something to a DBT package, so I'm not sure I can help.

Feature: bash functions

Improve the UX for these macros so they feel less manual. For example, write
bash functions that move the macro output into the file

Add new macro to generate markdown file for docs blocks

Describe the feature

As per #48 we are using docs blocks as our default documentation method in dbt. It would be extremely useful for us to be able to generate a markdown file with the relevant docs blocks from a model.
e.g. {{ generate_model_docs('data_user') }} would yield

{% docs user_hk %}
{% enddocs %}

{% docs user_id %}
{% enddocs %}

{% docs create_date %}
{% enddocs %}

{% docs email_address_text %}
{% enddocs %}

Describe alternatives you've considered

We have created our on macro to perform this function.

Additional context

Who will this benefit?

Anyone who uses docs blocks as the default documentation method in their project.

Are you interested in contributing this feature?

Yes, I have a working macro in my dbt project.

generate_source on Snowflake doesn't generate any table names

Describe the bug

generate_source on Snowflake doesn't generate any table names (or columns). I'm not seeing any errors anywhere. It simply doesn't finish outputting the yml.

This only happens for some schemas in my source DB. I don't see a pattern in why it happens in some but not others.

Steps to reproduce

I executed
dbt run-operation generate_source --args '{"schema_name": "my_actual_schema_name", "database_name": "my_actual_database_name", "generate_columns": "True"}'

Expected results

version: 2

sources:
  - name: [my_actual_source_name]
    database: my_actual_database_name
    tables:
      - name: accounts
        description: ""
        columns:
          - name: id
            description: ""
          - name: country_id
            description: ""

      - name: products
        description: ""
        columns:
          - name: id
            description: ""
          - name: name
            description: ""
          - name: record_type
            description: ""

Actual results

version: 2

sources:
  - name: [my_actual_source_name]
    database: my_actual_database_name
    tables:

Screenshots and log output

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

dbt --version
Core:
  - installed: 1.3.0
  - latest:    1.3.0 - Up to date!

Plugins:
  - snowflake: 1.3.0 - Up to date!

The operating system you're using:
macOS 12.6.1 (21G217)

The output of python --version:
3.9.15

Put the letters of the generate_source function in the same format as the table

Describe the feature

When I exectute
dbt run-operation generate_source --args 'schema_name: exemple'
Tables are in lower case and my code doesn't work anymore
I would like to have the letters written in the same format as the table.

Describe alternatives you've considered

{% do sources_yaml.append(' - name: ' ~ table | lower ) %}

I think it's possible to change this line : {% do sources_yaml.append(' - name: ' ~ table | lower ) %}

Additional context

Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.

I don't think so

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.

At least, all users of Bigquery who are working on Big Data

Are you interested in contributing this feature?

I would love to, but I am a beginner on dbt so I don't think that I can help a lot

Feature: Add param for making leading commas

Describe the feature

Some people prefer to have leading commas in their select statements. This optional parameter would generate the source SQL with leading commas.

Are you interested in contributing this feature?

I'll work on this sometime since I'm probably the only person who will want this (along with @Nerdenberger ) 😂

Add ' description: "" ' to output from generate_source to match generate_model_yaml

Describe the feature

generate_model_yaml is great! I like that it adds the description line even though it is blank.

I would love that same feature for generate_source.

Describe alternatives you've considered

I could probably cobble something together with a script.

Additional context

This would be especially powerful when combined with a script and yq to merge the yml files like I currently do.

Who will this benefit?

Anyone who wants to document their data with as few a keystrokes as possible.

Are you interested in contributing this feature?

At this time that isn't an option due to legal reasons.

Thanks!

Offline installation / prevent callout to dbt hub

Due to an enterprise setting I cannot cirectly load packages from dbt hub.
I have mirrored the dbt-codegen to my internal gitlab instance.

However,

- git: "[email protected]/mirrors/github/dbt-codegen.git" 
    revision: 0.9.0

fails with:

External connection exception occurred: HTTPSConnectionPool(host='hub.getdbt.com', port=443): Max retries exceeded with url: /api/v1/index.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x00000232D005CA60>, 'Connection to [hub.getdbt.com](https://hub.getdbt.com/) timed out. (connect timeout=30)'))

due to firewalling.

Other packages like dbt-utils work just fine in this way. How can dbt-codegen support such an offline installation as well?
I did not find the place where the dbt installer is calling out to dbt hub.

Pass an (optional) list of table-names to generate_source

Describe the feature

Sometimes, if the schema has several tables, one wants generate_schema to generate schema definitions for a selected set of tables only. Currently, this is not supported.

Provide an (optional) parameter to generate_source called table_names that takes a list of tables. If provided, generate_schema will generate the definitions for only these tables.

Describe alternatives you've considered

No alternatives exist except for manually removing the unwanted lines from the generated definitions.

Additional context

Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.
No.

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.
In cases where schemas contain 100s of tables but only a few of them are of interest.

Are you interested in contributing this feature?

Yes.

Add data_type to columns

Describe the feature

To make the generated documentation more informative it could be useful to add include_data_types flag to generate_source macro which leads to adding data_type property to column YAML.

Describe alternatives you've considered

That could be done manually or with another macro.

Additional context

Who will this benefit?

Anyone who got a huge legacy of the source tables to handle.

Are you interested in contributing this feature?

Yes, I have a patch, working for my Greenplum project.

[CT-1605] [Bug] generate_source sometimes fails without explanation

Is this a new bug in dbt-snowflake?

  • I believe this is a new bug in dbt-snowflake
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

If I run
dbt run-operation generate_source --args '{"schema_name": "my_schema_name", "database_name": "my_db_name"}'

The output will sometimes be:


version: 2

sources:
  - name: my_schema_name
    database: my_db_name
    tables:

Other times, it works great.

Expected Behavior

I would expect a full sources document to be created, with each table, always.

Steps To Reproduce

It consistently happens with the same schemas, but I'm not sure why.
Run
dbt run-operation generate_source --args '{"schema_name": "my_schema_name", "database_name": "my_db_name"}'
See that output never gets to the first table.

Both schemas for which I'm currently seeing this problem are populated via Fivetran's Google Sheets connector, and have _ROW as their first column

Still, I have other schemas populated by the same connector (in the same DB), with tables that have the same first column name, that work great with this command.

Relevant log output

I don't see any logs for this, nor error messages.

Environment

- OS: macOS 12.6.1 (21G217)
- Python: Python 3.10.7
- dbt-core:
- dbt-snowflake: 1.3.0

Additional Context

No response

Issue in BigQuery when source table has column name "source"

Pretty much what the issue says. I was running your nice code gen macros but an issue came up which is that my source table already had a column called source.

After creating the base models as views, what happened was the the columns were duplicated with the first set of columns having a name in the form of source.<column name> and second with <column_name>.

I'll manually change the source that is auto-generated in the CTE, but just a heads up that this may be a bug for people.

`generate_model_yaml macro` to take original yaml file as an additional argument

Looking at the generate_model_yaml macro, I can see that it takes the following arguments:

  • model_name
  • upstream_descriptions

As I understand it, running

dbt run-operation generate_model_yaml --args '{"model_name": "dim_customer"}'

will produce a yaml similar to the following

version: 2

models:
  - name: dim_customer
    description: ""
    columns:
      - name: dim_customer_id
        description: ""

      - name: customer_id
        description: ""

      - name: customer_first_name
        description: ""

...

But what if there is an already existing yml file for dim_customers such as...

...
  - name: dim_customer
    description: ""
    columns:
      - name: dim_customer_id
        description: "this is my description"
        meta:
           - something_else:
        tests_and_other_things...:
...

Is it possible we can take these original descriptions, tests and meta things too for the new output produced by the generate_model_yaml macro?

Why someone would want this: Most of the time you are not in fact generating a model yaml from scratch, there is an already existing one, and you have made changes to it. It's really a quality of life improvement, but it would greatly improve one's workflow and make it easier to keep documentation best practices.

Support multiple arguments for generate_model_yaml

The package currently supports multiple arguments for generate_source using dict syntax - would love to be able to do the same for generate_model_yaml!

Something like:

dbt run-operation generate_model_yaml --args '{"model_name": ["model_1","model_2"]}'

Generate Model Yaml not seeing columns, just model name.

Apologies if this is not the correct place to ask this question.

When I run generate_model_yaml, the code that is returned doesn't have the column names, just the model name.

For example, running this:

{{ codegen.generate_model_yaml(
    model_name='fct_shopify_variant_metafield'
) }}

Produces this:

models:
  - name: fct_shopify_variant_metafield
    description: ""
    columns:

I must be doing something wrong, but cannot future out what it is. Any assistance would be appreciated.

Unable to override database name

Describe the bug

I'm trying to use generate_source in dbt cloud, running it in the prompt, to generate_source for my raw database in Snowflake, but it seems to be overridden by the default target database, analytics.

Steps to reproduce

I'm running the following script in the prompt:
dbt run-operation generate_source --args '{schema_name: BINGADS, database_name: raw}'

I've installed packages using dbt deps

Expected results

version: 2

sources:
  - name: bingads
    database: raw
    tables:
      - name: account_history
      - name: account_impression_performance_daily_report
      - name: etc

Actual results

version: 2

sources:
  - name: bingads
    database: raw
    tables:

Screenshots and log output

Log output:
On macro_generate_source: /* {"app": "dbt", "dbt_version": "0.18.0", "profile_name": "user", "target_name": "default", "connection_name": "macro_generate_source"} */

Part of log output showing the compiled SQL:

select distinct
    table_schema as "table_schema", table_name as "table_name"
from ANALYTICS.information_schema.tables
where table_schema ilike 'BINGADS'
and table_name ilike '%'
and table_name not ilike ''

System information

The contents of your packages.yml file:

packages:
  - package: fishtown-analytics/dbt_utils
    version: 0.6.1
  - package: fishtown-analytics/codegen
    version: 0.3.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

"dbt_version": "0.18.0"

The operating system you're using:

The output of python --version:

Additional context

Are you interested in contributing the fix?

generate_model_yaml + windows

the generate_model_yaml command example doesn't work on windows

DBT>dbt run-operation generate_model_yaml --args '{"model_name": "customers"}'
usage: dbt [-h] [--version] [-r RECORD_TIMING_INFO] [-d]
           [--log-format {text,json,default}] [--no-write-json]
           [--use-colors | --no-use-colors] [-S] [--warn-error]
           [--partial-parse | --no-partial-parse]
           {docs,source,init,clean,debug,deps,list,ls,snapshot,rpc,run,compile,parse,test,seed,run-operation}
           ...
dbt: error: unrecognized arguments: customers}'

windows cmd doesn't like the space, but the usual trick of wrapping it in quotes gives this error

DBT>dbt run-operation generate_model_yaml --args "'{"model_name": "customers"}'"
Running with dbt=0.19.0

Encountered an error while running operation: Compilation Error
  The --vars argument must be a YAML dictionary, but was of type 'str'

is there a way to get this working on windows?

Generate_model_yaml doesn't recognize nested BigQuery columns

Describe the bug

generate_model_yaml doesn't recognize nested bigquery columns

Steps to reproduce

Create a model with nested columns, either through STRUCT() or ARRAY(), for example:

STRUCT(
   source,
   medium,
   source_medium
) analytics

Expected results

The macro should print out the nested columns as well, now it prints out only the "main" column/field.
With the example above we should get
source, medium, source_medium and analytics as columns in the yaml output

Actual results

The macro prints out only the "main" column/field which is analytics

System information

The contents of your packages.yml file:

packages:
  - package: fishtown-analytics/dbt_utils
    version: 0.5.0
  - package: fishtown-analytics/codegen
    version: 0.1.0

Which database are you using dbt with?

  • bigquery

The output of dbt --version:

installed version: 0.17.1
   latest version: 0.17.1

The operating system you're using:
Ubuntu 18.04
The output of python --version:
Python 3.8.1

Are you interested in contributing the fix?

I would love to, unfortunately my knowledge of Jinja isn't sufficient so I'd appreciate some pointers

generate source to have parameter to take existing descriptions in the project

Describe the feature

It would be good if the generate_source macro could have a boolean parameter, which if set to True, looks for existing documentation to use instead of having blank descriptions.

Describe alternatives you've considered

The alternative is to use generate_source filtering to one table and then inserting that carefully into the existing sources.yml file... but this could be error prone.

Additional context

Not db specfic

Who will this benefit?

Analytics Engineers who are using codegen to generate docs, but just want to regenerate the whole lot at once when adding a new table and therefore want to keep whatever they had before for existing tables.

Are you interested in contributing this feature?

Sure I'd need to figure out how to set up an environment etc, or is there a devcontainer and then I could use Codespaces?

Generate dbt tests in generate_source for columns that have constraints in the datawarehouse

Describe the feature

Would be really cool to generate unique and not null tests in the generate_source output if there are primary key / unique / not null constraints which exist on the source table in the datawarehouse.

Describe alternatives you've considered

The output of the current generate_source macro can be manually update to create these tests. This requires users to manually check if these constraints exist in the dw and then manually add them to the source yaml.

Additional context

This would likely be database-specific as there would be different syntax to query column metadata in each warehouse.

Who will this benefit?

This would benefit anyone setting up source yamls in their dbt project. It would enhance the output of the macro, by automatically including these unique and not null tests in the configuration file. This would lead to 1) better tested (and trusted) data and 2) improved documentation that is more in line with the state of the datawarehouse.

Are you interested in contributing this feature?

I looked into this a bit and happy to chat through what I have found, but I think this would require changes outside of the codegen package (potentially) and I am not sure I understand where to make these changes.

Snowflake - Shared Tables Do Not Appear In Information Schema

Tables created in a DB via a snowflake table share do not appear in the information schema and so the source codegen doesn't work. Still figuring out where they show up and how we can expand the initial table query to include them, but wanted to flag this here.

Throw an error when no columns are found during `generate_model_yaml`

Describe the feature

I would like generate_model_yaml to throw an error if the table it is attempting to generate a .yml file for does not exist.

Describe alternatives you've considered

I could create a wrapper script to run this macro that throws an error if no columns are found, but the number of people who would benefit from that effort would be pretty small in comparison to contributing to the repo.

Additional context

I don't believe this would be database-specific. In my mind the simple way to do this is to just add a check for whether columns are returned from the get_columns_in_relation macro here. If no columns are returned, throw an error that indicates that no columns were found.

Who will this benefit?

I think this change will benefit dbt users who might be a bit newer to using dbt/codegen and may think the codegen package isn't working rather than realizing they simply haven't yet run dbt for the model they're working on or that they dropped or renamed the model since dbt was last run.

Are you interested in contributing this feature?

Yep!

Increase readability of `generate_model_import_ctes.sql` by using `VERBOSE` mode for `re`

Describe the feature

Modify generate_model_import_ctes.sql to use the verbose mode (adding x to the i flag) for re, allowing to split regexes between different lines and providing the ability to add comments to those.

Additional context

Here is an example of regex without verbose (all in 1 line):

{% set re = modules.re %}
{% set my_string = <my_sql_code> %}
{% set re_pattern = '(?i)(from|join)\s+({{\s*ref\s*\(\s*[\'\"]?)([^)\'\"]+)([\'\"]?\s*)(\)\s*}})' 
%}

{% set is_match = re.findall(re_pattern, my_string) %}

and with verbose:

{% set re = modules.re %}
{% set my_string = <my_sql_code> %}
{% set re_pattern = "(?ix)

    # first matching group
    (from|join)\s+       # from or join followed by at least a sep character

    # second matching group
    (
    {{                   # opening {{
    \s*ref\s*\(          # the word ref followed by a (
    \s*[\'\"]?
    )

    ([^)\'\"]+)([\'\"]?\s*)(
    \)\s*                # closing ) for ref
    }}                   # closing {{
    )" 
%}

{% set is_match = re.findall(re_pattern, my_string) %}

Who will this benefit?

People wanting to understand/enhance regexes to find tables in models.

Are you interested in contributing this feature?

Why not but keen to let someone else look at it

`generate_base_model` should work with seeds

Describe the feature

Wrapping a base model around a seed is frequently useful, but generate_base_model expects sources onlyl.

Describe alternatives you've considered

  • Faking up my seed as a source so that I can run the codegen.
  • Building up base models manually.
  • Manually modifying seeds to clean up column names, data.

Additional context

Not database specific, but I have seeds that are static extracts from other systems that just spit out CSV. I'd rather keep processing of the CSVs to a minimum to make updating them as simple as possible (reexport, reseed, rebuild) and so want to use a base model to e.g. rename unfriendly columns, clean up and retype data etc.

This would also benefit generate_source; a seed is a type of source and the same needs to document and wrap in base models exist.

Who will this benefit?

Users of seeds and base models.

Are you interested in contributing this feature?

Yes, but need guidance.

Fix: Snowflake column selection

when running on Snowflake, if the schema and table name are lowercase, the generate base model returns the structure but without returning the column names.

If you update the schema to be capitalized an the table to have capitalized identifier then it works as expected

Feature: Alphabetized Columns in Base Model

Describe the feature

The generate_base_model macro output orders columns alphabetically.

Describe alternatives you've considered

Currently a manual step after the generate_base_model macro compiles.

Additional context

Our style guide requires alphabetizing column names in our base models. Currently, this has to be done as an additional step after running the generate_base_model macro.

Who will this benefit?

This will make generation of base models faster. It will benefit all users who are subject to a style guide which requires models to alphabetize column names.

`generate_source` not working on Snowflake due to column case

Describe the bug

When attempting to generate a source, an opaque error is generated. I've confirmed that the SELECT ... FROM INFORMATION_SCHEMA query does in fact return results, but the macro still fails.

Steps to reproduce

dbt run-operation generate_source --args 'schema_name: STG_BOOST'

Expected results

Generated sources.yml output

Actual results

Python runtime error:

$ dbt run-operation generate_source --args 'schema_name: STG_BOOST'
11:52:37  Running with dbt=1.2.2
11:52:37  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.insight_transformation.example

11:52:39  Encountered an error while running operation: Runtime Error
  Field "path" of type Path in SnowflakeRelation has invalid value {'database': 'INSIGHT_DEV2', 'schema': Undefined, 'identifier': Undefined}

Screenshots and log output

============================== 2022-10-17 11:52:37.691786 | 27ca7160-d499-4ada-920d-4e2036661d3a ==============================
11:52:37.691807 [info ] [MainThread]: Running with dbt=1.2.2
11:52:37.694314 [debug] [MainThread]: running dbt with arguments {'write_json': True, 'use_colors': True, 'printer_width': 80, 'version_check': True, 'partial_parse': True, 'static_parser': True, 'profiles_dir': '/dbt-runner/insight_transformation', 'send_anonymous_usage_stats': True, 'event_buffer_size': 100000, 'quiet': False, 'no_print': False, 'macro': 'generate_source', 'args': 'schema_name: STG_BOOST', 'which': 'run-operation', 'rpc_method': 'run-operation', 'indirect_selection': 'eager'}
11:52:37.697333 [debug] [MainThread]: Tracking: tracking
11:52:37.699636 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff77be4400>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff77be4b80>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff77be48e0>]}
11:52:37.781163 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
11:52:37.783317 [debug] [MainThread]: Partial parsing enabled, no changes found, skipping parsing
11:52:37.785223 [warn ] [MainThread]: [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.insight_transformation.example

11:52:37.792267 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '27ca7160-d499-4ada-920d-4e2036661d3a', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff7789e0d0>]}
11:52:37.822391 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '27ca7160-d499-4ada-920d-4e2036661d3a', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff779fe5e0>]}
11:52:37.824686 [debug] [MainThread]: Acquiring new snowflake connection "macro_generate_source"
11:52:37.849402 [debug] [MainThread]: Using snowflake connection "macro_generate_source"
11:52:37.851521 [debug] [MainThread]: On macro_generate_source: /* {"app": "dbt", "dbt_version": "1.2.2", "profile_name": "insight_transformation", "target_name": "default", "connection_name": "macro_generate_source"} */
select distinct
            table_schema as "table_schema",
            table_name as "table_name",

            case table_type
                when 'BASE TABLE' then 'table'
                when 'EXTERNAL TABLE' then 'external'
                when 'MATERIALIZED VIEW' then 'materializedview'
                else lower(table_type)
            end as "table_type"

        from INSIGHT_DEV2.information_schema.tables
        where table_schema ilike 'STG_BOOST'
        and table_name ilike '%'
        and table_name not ilike ''
11:52:37.853695 [debug] [MainThread]: Opening a new connection, currently in state init
11:52:39.075223 [debug] [MainThread]: SQL status: SUCCESS 28 in 1.22 seconds
11:52:39.097273 [debug] [MainThread]: Snowflake adapter: Error running SQL: macro generate_source
11:52:39.104811 [debug] [MainThread]: Snowflake adapter: Rolling back transaction.
11:52:39.108034 [debug] [MainThread]: On macro_generate_source: Close
11:52:39.197696 [error] [MainThread]: Encountered an error while running operation: Runtime Error
  Field "path" of type Path in SnowflakeRelation has invalid value {'database': 'INSIGHT_DEV2', 'schema': Undefined, 'identifier': Undefined}
11:52:39.205033 [debug] [MainThread]:
11:52:39.208489 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff778abfa0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff778abbe0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0xffff778aba00>]}
11:52:39.213634 [debug] [MainThread]: Flushing usage events
11:52:39.784175 [debug] [MainThread]: Connection 'macro_generate_source' was properly closed.

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
  - package: dbt-labs/codegen
    version: 0.8.1
  - package: Snowflake-Labs/dbt_constraints
    version: 0.5.1
  - package: calogica/dbt_expectations
    version: 0.7.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

$ dbt --version
Core:
  - installed: 1.2.2
  - latest:    1.3.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - snowflake: 1.2.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:
Working in a docker container, using base image python:3.9-slim

$ uname -a
Linux 5d79fa681533 5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022 aarch64 GNU/Linux

The output of python --version:

$ python --version
Python 3.9.15

Additional context

The problem seems to be here, in dbt_utils macro get_relations_by_pattern.

https://github.com/dbt-labs/dbt-utils/blob/064c2abee80816ea855de32e66e54e96d9916677/macros/sql/get_relations_by_pattern.sql#L18-L23

The columns in the query result are named correctly, but they're returned in all caps. If I change those lines to the following, then the command works as expected.

            {%- set tbl_relation = api.Relation.create(
                database=database,
                schema=row.TABLE_SCHEMA,
                identifier=row.TABLE_NAME,
                type=row.TABLE_TYPE
            ) -%}

I'm not 100% sure where a change needs to be made, it's possible that it's actually get_tables_by_pattern_sql.sql that needs to be changed. I'm not sure whether it's possible to force snowflake to return results with the correct case. I would have expected snowflake to use the quoted column aliases in the query, which are lower...

Are you interested in contributing the fix?

I probably won't have time any time soon.

Optional table_pattern on generate_source.sql

Describe the feature

Add a parameter to the generate_source macro which exposes the table_pattern parameter used in dbt_utils.get_relations_by_pattern.

Here is a starter implementation of the idea:

{% macro get_tables_in_schema(schema_name, database_name=target.database, table_pattern='') %}
    
    {% if table_pattern != '' %}
        {% set tables=dbt_utils.get_relations_by_pattern(
            database=database_name,
            schema_pattern=schema_name,
            table_pattern=table_pattern
        ) %}
    {% else %}
        {% set tables=dbt_utils.get_relations_by_pattern(
            database=database_name,
            schema_pattern=schema_name,
            table_pattern='%'
        ) %}

    {% endif %}

    {% set table_list= tables | map(attribute='identifier') %}

    {{ return(table_list | sort) }}

{% endmacro %}

Describe alternatives you've considered

  • Individual users could modify their own macros to override the default behavior
  • Users can parse larger yamls after the fact to target the sources they ultimately want to map
  • Users could move data from existing schemas to new "dbt specific" raw schemas and then use codegen against those. This option somewhat defeats a benefit of codegen in getting new users up-and-running quickly with dbt.

Additional context

Relevant for all databases

Who will this benefit?

Users with raw data schemas that contain more tables than they want to map into their dbt projects.

Are you interested in contributing this feature?

Sure - I have a branch sitting locally that I can push up if we have alignment.

Support for external tables (databricks)

Hi, I really like this package! Unfortunately it doesn't seem to support creating a source from external tables (e.g. Databricks Lakehouse).

I looked at the generate_source.sql` code, which calls dbt_utils.get_relations_by_pattern to create a list of tables. Unfortunately that only seems to work for relational databases(?). We are using the dbt-external-tables package to create our external (hive) tables, but it requires manually putting in all the source information and this package would be very helpful.

Do you think it would be a lot of work to make it work for external tables?

Use `print` instead of `log`

Describe the feature

Rather than using the log Jinja function to output generated code, use the print Jinja function added in PR-4701

Describe alternatives you've considered

Leave as-is.

Additional context

This isue is based on a suggestion from @emmyoop (here) to allow for cleaner output from the macros using the new global --quiet flag in dbt.

@jaypeedevlin and I worked on PR-4701 together to implement the solution she outlined in dbt-core, and would like to complete this next part of it as well.

Who will this benefit?

Users who would like to copy the output from dbt run-operation directly into a file without having to remove info and warnings.

Are you interested in contributing this feature?

Yep!

Feature: bash functions

Improve the UX for these macros so they feel less manual. For example, write
bash functions that move the macro output into the file

Package tidy up

Repo checklist:

Urgent:

  • A require-dbt-version parameter in the dbt_project.yml:
require-dbt-version: [">=0.17.0", "<0.18.0"]
  • The dbt_project.yml is in version 2 format (upgrading docs)
  • Any sample dbt_project.yml snippets in the README.md are in v2 syntax (especially for packages that use vars
  • Release is of format major.minor.patch, e.g. 1.0.0
  • A license (example)
  • If this package relies on dbt-utils, make sure the range allows v0.4.x
packages:
  - package: fishtown-analytics/dbt_utils
    version: ["???", "<0.5.0"]

Integration tests

If this package has integration tests:

  • The dbt_project.yml of the integration tests is in v2 format
  • CircleCI uses python v3.6.3
  • The dbt_project.yml does not have a require-dbt-version config (it should be inherited from the package)

Quick wins:

  • Issue and PR templates (examples — make sure you update for the right package name)
  • A codeowner (example)
  • If required, a level of support header in the README

    ⚠️ This package is currently not actively maintained.

More in-depth improvements

  • Installation instructions (see segment), including:
    • Link to packages docs
    • Variables that need to be included in the dbt_project.yml file
    • Listed database support
  • A link to our package contribution guide (Might formalize this as a CONTRIBUTING.md file)
  • Models are documented
  • Macros are documented
  • Integration tests
  • Move model logic intro model files, rather than entirely in adapter macros

Requires admin rights / Claire

  • Repo is named dbt-<package_name>
  • The hub URL listed on the repo

Screen Shot 2020-03-12 at 9 55 03 AM

  • A Zap to the #package-releases channel in dbt Slack

Reserved word table names break with quoting off at the project level

Describe the bug

When a table name is a reserved word and you set the generate_columns parameter to True, the generate_source macro fails with an error. You can avoid this by setting quoting at the project level, but that's not recommended especially for Snowflake according to the docs. It's manageable to flip it on during development and back off.

Seems related to #11.

Steps to reproduce

  1. Create a table named group
  2. Leave quoting unspecified in project.yml
  3. Run the generate_source macro on the database/schema that includes the group table and generate_columns=True

Expected results

Produce a source schema entry for the table, or fail more elegantly by skipping the table with a placeholder.

Actual results

The code fails with a Runtime Error - see the log output below.

Screenshots and log output

Encountered an error:
Runtime Error
  Database Error in analysis _codegen (analysis\_codegen.sql)
    001003 (42000): SQL compilation error:
    syntax error line 1 at position 40 unexpected 'GROUP'.
    syntax error line 1 at position 28 unexpected '.'.
    syntax error line 1 at position 39 unexpected '.'.

System information

The contents of your packages.yml file:

  - package: fishtown-analytics/codegen
    version: [">=0.3.0", "<0.4.0"]

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.19.1
   latest version: 0.19.1

Up to date!

Plugins:
  - bigquery: 0.19.1
  - postgres: 0.19.1
  - redshift: 0.19.1
  - snowflake: 0.19.1

The operating system you're using:
Windows 10

The output of python --version:

Python 3.8.9

Additional context

I haven't fully traced it but I think it's happening in the call to adapter.get_columns_in_relation. The issue does not happen if you don't set generate_columns=True.

Are you interested in contributing the fix?

Possibly. If I do I'll need a nudge on how to solve the issue. Is there a way to override quoting within a macro?

Write to file

For anyone looking, for example for the generate_source macro, in order to write to file you can change line 77 in generate_source.sql ({{ log(joined, info=True) }}) to {{ print(joined) }} and then run:

dbt -q run-operation your_codegen_operation > file.yaml

Snowflake CTE base model naming

Not entirely sure the best approach here but it makes sense to have more unique naming CTEs on the base models (ex. shopify_orders_source vs source and shopify_orders_renamed vs renamed) to avoid the issue of Snowflake's recursive CTEs.

enhancements for generate_source to include source/table descriptions and source name

Describe the feature

Currently, generate_source has an option to set include_descriptions = True, but this parameters only includes descriptions at the column level. Ideally, description placeholders would also be generated for the source and tables as well. Additionally, the required parameter for generate_source macro is the schema name, but there is no option to input a name value. It is possible that a user would like to name their source a different name from the schema name.

Describe alternatives you've considered

I can manually update the yaml that is generated from the current generate_source macro, but this is time consuming and prone to yaml formatting issues.

Additional context

I have working code in my own dbt project that I believe solves for both the descriptions at the source/table level, as well as the source name <> schema name. See below.

{% macro get_tables_in_schema(schema_name, database_name=target.database, table_pattern='%', exclude='') %}
    
    {% set tables=dbt_utils.get_relations_by_pattern(
        schema_pattern=schema_name,
        database=database_name,
        table_pattern=table_pattern,
        exclude=exclude
    ) %}

    {% set table_list= tables | map(attribute='identifier') %}

    {{ return(table_list | sort) }}

{% endmacro %}


---
{% macro generate_source(schema_name, name = schema_name, database_name=target.database, generate_columns=False, include_descriptions=False, table_pattern='%', exclude='') %}

{% set sources_yaml=[] %}
{% do sources_yaml.append('version: 2') %}
{% do sources_yaml.append('') %}
{% do sources_yaml.append('sources:') %}
{% do sources_yaml.append('  - name: ' ~ name | lower) %}

{% if include_descriptions %}
    {% do sources_yaml.append('    description: ""' ) %}
{% endif %}

{% if database_name != target.database %}
{% do sources_yaml.append('    database: ' ~ database_name | lower) %}
{% endif %}

{% if schema_name != name %}
{% do sources_yaml.append('    schema: ' ~ schema_name | lower) %}
{% endif %}

{% do sources_yaml.append('    tables:') %}

{% set tables=codegen.get_tables_in_schema(schema_name, database_name, table_pattern, exclude) %}

{% for table in tables %}
    {% do sources_yaml.append('      - name: ' ~ table | lower ) %}
    {% if include_descriptions %}
        {% do sources_yaml.append('        description: ""' ) %}
    {% endif %}
    {% if generate_columns %}
    {% do sources_yaml.append('        columns:') %}

        {% set table_relation=api.Relation.create(
            database=database_name,
            schema=schema_name,
            identifier=table
        ) %}

        {% set columns=adapter.get_columns_in_relation(table_relation) %}

        {% for column in columns %}
            {% do sources_yaml.append('          - name: ' ~ column.name | lower ) %}
            {% if include_descriptions %}
                {% do sources_yaml.append('            description: ""' ) %}
            {% endif %}
        {% endfor %}
            {% do sources_yaml.append('') %}

    {% endif %}

{% endfor %}

{% if execute %}

    {% set joined = sources_yaml | join ('\n') %}
    {{ log(joined, info=True) }}
    {% do return(joined) %}

{% endif %}

{% endmacro %}

a user would then be able to run something like this in the cloud IDE to generate a more comprehensive source yaml:
{{ codegen.generate_source('tpch_sf001', name = 'tpch', database_name = 'raw', generate_columns = True, include_descriptions = True) }}

Who will this benefit?

This will benefit anyone setting up new sources for the first time in their dbt project and encourage those users to input descriptions at the source and table levels, improving their documentation. It will also eliminate confusion when a user provides the include_descriptions = True parameter without the generate_columns = True.

Currently, the following command:
{{ codegen.generate_source('tpch_sf001', database_name = 'raw', include_descriptions = True) }}
generates a yaml with no descriptions at all:

version: 2

sources:
  - name: tpch_sf001
    database: raw
    tables:
      - name: customer
      - name: lineitem
      - name: nation
      - name: orders
      - name: part
      - name: partsupp
      - name: region
      - name: supplier

As a user, i would expect this to still generate descriptions at the name/source and table level.

Are you interested in contributing this feature?

Yes, I would love to contribute to this feature! I have some working code locally, but would appreciate a hand getting this into the dbt-codegen repo the right way!

Not compatile with DBT 0.17

I am getting an error when trying to install this package in DBT 0.17, seems it needs to be updated

RPCException(10006, Dependency Error, {'type': 'DependencyException', 'message': "Version error for package fishtown-analytics/dbt_utils: Could not find a satisfactory version from options: ['=0.3.0', '>=0.1.25', '>=0.1.25', '<0.3.0']"

generate_columns fails if no database provided for Databricks

Describe the bug

If you run generate_source with generate_columns = True then the operation fails as database is not supported for Databricks.

Steps to reproduce

If you run the following it fails:

dbt run-operation generate_source --args '{"schema_name": "default","table_pattern":"*","generate_columns":True}'

Expected results

It should run for Databricks even without a database and you should see the following

sources:
  - name: default
    tables:
      - name: my_first_dbt_model
        columns:
          - name: id

      - name: my_second_dbt_model
        columns:
          - name: id

Actual results

it returns the following error:

Encountered an error while running operation: Runtime Error
  Field "path" of type Path in DatabricksRelation has invalid value {'database': Undefined, 'schema': 'default', 'identifier': 'my_first_dbt_model'}

Screenshots and log output

See above

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • [ x] other (specify: databricks)

The output of dbt --version:

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - databricks: 1.1.0 - Up to date!
  - spark:      1.1.0 - Up to date!

The operating system you're using:
Mac OS

The output of python --version:
Python 3.8.10

Additional context

I was able to fix this by adding the following to macros/generate_source.sql

{% if target.type == 'databricks' and target.database is not defined %}
{% set database_name = None %}
{% endif %}

Are you interested in contributing the fix?

I would like to contribute to fix this but would need guidance on the best way to make database_name optional if the method above is not the best way?

ilike not supported in T-SQL

Describe the bug

ilike not supported in T-SQL. When running generate_source macro the complied SQL throws an error.

Steps to reproduce

dbt run-operation generate_source --args '{"schema_name": "sink"}'

Expected results

.yml file containing a list of all tables in sink schema

Actual results

Encountered an error while running operation: Database Error
  ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]An expression of non-boolean type specified in a context where a condition is expected, near 'ilike'. (4145) (SQLExecDirectW)")

Screenshots and log output

�[0m17:06:09.584072 [info ] [MainThread]: Running with dbt=1.3.0
�[0m17:06:09.584072 [debug] [MainThread]: running dbt with arguments {'write_json': True, 'use_colors': True, 'printer_width': 80, 'version_check': True, 'partial_parse': True, 'static_parser': True, 'profiles_dir': 'C:\\Users\xxxxx\\.dbt', 'send_anonymous_usage_stats': True, 'event_buffer_size': 100000, 'quiet': False, 'no_print': False, 'macro': 'generate_source', 'args': '{schema_name: sink}', 'which': 'run-operation', 'rpc_method': 'run-operation', 'indirect_selection': 'eager'}
�[0m17:06:09.584072 [debug] [MainThread]: Tracking: tracking
�[0m17:06:09.601213 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x00000204239B7708>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x00000204239B7848>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x00000204239B7788>]}
�[0m17:06:10.319117 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
�[0m17:06:10.319117 [debug] [MainThread]: Partial parsing enabled, no changes found, skipping parsing
�[0m17:06:10.348178 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '51552c34-aa33-467e-b0ca-b695db136455', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020423A57C48>]}
�[0m17:06:10.394193 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '51552c34-aa33-467e-b0ca-b695db136455', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020423CDAF08>]}
�[0m17:06:10.399190 [debug] [MainThread]: Acquiring new sqlserver connection "macro_generate_source"
�[0m17:06:10.401180 [debug] [MainThread]: On macro_generate_source: COMMIT
�[0m17:06:10.503190 [debug] [MainThread]: SQLServer adapter: Using sqlserver connection "macro_generate_source".
�[0m17:06:10.504188 [debug] [MainThread]: SQLServer adapter: On macro_generate_source: 

      

        select distinct
            table_schema as "table_schema",
            table_name as "table_name",
            
            case table_type
                when 'BASE TABLE' then 'table'
                when 'EXTERNAL TABLE' then 'external'
                when 'MATERIALIZED VIEW' then 'materializedview'
                else lower(table_type)
            end as "table_type"

        from sqlserver.information_schema.tables
        where table_schema ilike 'sink'
        and table_name ilike '%'
        and table_name not ilike ''


�[0m17:06:10.504188 [debug] [MainThread]: Opening a new connection, currently in state init
�[0m17:06:10.505184 [debug] [MainThread]: SQLServer adapter: Using connection string: DRIVER={ODBC Driver 17 for SQL Server};SERVER=sqlserver.database.windows.net,1433;Database=sqldb;UID={auser};PWD=***;encrypt=Yes;TrustServerCertificate=No;Application Name=dbt-sqlserver/1.3.0
�[0m17:06:11.768674 [debug] [MainThread]: SQLServer adapter: Connected to db: sqldb
�[0m17:06:11.881675 [debug] [MainThread]: SQLServer adapter: Database error: ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]An expression of non-boolean type specified in a context where a condition is expected, near 'ilike'. (4145) (SQLExecDirectW)")
�[0m17:06:11.882676 [debug] [MainThread]: On macro_generate_source: ROLLBACK
�[0m17:06:11.883675 [debug] [MainThread]: On macro_generate_source: Close
�[0m17:06:11.884673 [debug] [MainThread]: SQLServer adapter: Error running SQL: macro generate_source
�[0m17:06:11.885676 [debug] [MainThread]: SQLServer adapter: Rolling back transaction.
�[0m17:06:11.887677 [error] [MainThread]: Encountered an error while running operation: Database Error
  ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]An expression of non-boolean type specified in a context where a condition is expected, near 'ilike'. (4145) (SQLExecDirectW)")
�[0m17:06:11.889678 [debug] [MainThread]: 
�[0m17:06:11.891678 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020423B7E248>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020423B7E388>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020423B7EDC8>]}
�[0m17:06:11.892680 [debug] [MainThread]: Flushing usage events
�[0m17:06:12.288065 [debug] [MainThread]: Connection 'macro_generate_source' was properly closed.

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
  - package: dbt-labs/codegen
    version: 0.8.1

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: Azure SQL )

The output of dbt --version:

Core:
  - installed: 1.3.0
  - latest:    1.3.0 - Up to date!

Plugins:
  - sqlserver: 1.3.0 - Up to date!

The operating system you're using:

The output of python --version:

py --version    
Python 3.7.8

Additional context

Issue with ilike

Are you interested in contributing the fix?

Feature: Write a macro to convert a jinja dictionary to yaml

This will make generating yaml so much cleaner!

This is almost it, but it doesn't unnest things that are lists in the way we would expect:

{% macro dict_to_yaml(my_dict, indent=0, result="") %}
{% set ns=namespace(result=result) %}
{% for key, value in my_dict.items() %}
    {% set ns.result= ns.result ~ " " * indent ~  key %}

    {% if value is mapping %} {# if it's a dictionary, recurse #}
    {% set ns.result=dict_to_yaml(value, indent + 2, ns.result ~ ":\n") %}
    {% elif value is iterable and value is not string %} {# janky way to check for a list #}
    {% set ns.result = ns.result ~ ":\n" %}
    {% for item in value %}
        {% set ns.result = ns.result ~ " " * indent ~ "- " ~ item ~ "\n" %}
    {% endfor %}
    {% else %}
    {% set ns.result = ns.result ~ ": " ~ value ~ "\n" %}
    {% endif %}

{% endfor %}

{{ return(ns.result) }}
{% endmacro %}

Option for generate_model_yaml to add docs blocks references

Describe the feature

It would be useful to have an optional parameter in generate_model_yaml to also generate relevant docs blocks references. It will should default to false (not generating the docs) but writing something such as
{{ codegen.generate_model_yaml('data_user', docs_blocks = true) }}
Should generate something like

version: 2

models:
  - name: data_user
    description: ""
    columns:
      - name: user_hk
        description: 'docs{{ "user_hk" }}'

      - name: user_id
        description: 'docs{{ "user_id" }}'

      - name: create_date
        description: 'docs{{ "create_date" }}'

      - name: email_address_text
        description: 'docs{{ "email_address_text" }}'

We are only using docs blocks instead of just strings when documenting.

Describe alternatives you've considered

We are currently using an adapted version of the codegen macro in our dbt project to achieve the same thing, previously we were manually copying and pasting :P

Additional context

Who will this benefit?

Anyone who uses docs blocks as the default documentation method in their project.

Are you interested in contributing this feature?

Yes, I have a working macro in my dbt project.

Retrieve column and table comments for Redshift tables when running generate_source() with include_descriptions = True

Describe the feature

I am trying to use generate_source() to create source schema.yml files, and persist any description metadata in the Redshift database. We use this both to create the initial schema, and to update existing schema.yml's when columns are added or dropped. We spend too much effort maintaining code to "override" the generated descriptions with override files.

Currently, when running generate_source() with generate_columns=True and include_descriptions=True, the Description tags in the yaml are placeholders containing empty strings. If a file already exists, we have to use our custom merging logic to update the descriptions with the actual known values. Then we use the dbt docs to publish this info to users. Here is an example with the args used to generate the source schema yaml with descriptions: dbt run-operation generate_source --args '{"generate_columns": "True", "include_descriptions": "True", "schema_name": "some_schema", "database_name": "some_db"}'

I would like for generate_source() to also pull table and column metadata from Redshift comment fields, and include it in the generated yaml instead of the empty strings that are currently defaulted in for the descriptions.

Then separately in dbt-core, I would like to have the persist_docs config extended to support dbt sources so that the source schema.yaml's include any manually overridden descriptions. Those descriptions would be checked into our dbt git repo, deployed, and written back to the db. (To do this would need a separate PR in the dbt-core repo or our own custom logic).

This would enable a closed-loop workflow that looks like this:

  1. Run generate_source() to pull the latest table/column description metadata from the Redshift cluster and generate the yaml files
  2. Modify any table or column description values in the yaml files
  3. Commit -> push -> build. During the build we generate the dbt docs
  4. dbt docs uses the persist_docs config to write the description data to Redshift comments (requires another PR in that repo)
  5. repeat for any future description changes

Describe alternatives you've considered

Scripting out a cli tool with bash + python + macros to maintain manual "override" files for each generated yaml file. The manual files are merged into the generated yaml files before checking them in and rendering them in the dbt docs.

It is too much work to maintain this, and it would be very convenient for generate_source() to do this from one command.

Additional context

I am interested mostly in Redshift. I am not sure what approach other databases take to persist table and column descriptions. But it could probably be done for other databases too.

Here is an example query to get the table and column descriptions from Redshift:

WITH tables AS (
    SELECT c.oid,
           ns.nspname as schema_name,
           c.relname as table_name,
           d.description as table_description
    FROM pg_catalog.pg_class AS c
    JOIN pg_catalog.pg_namespace AS ns
      ON c.relnamespace = ns.oid
    LEFT JOIN pg_catalog.pg_description d
      on c.oid = d.objoid
     and d.objsubid = 0
     WHERE ns.nspname not in ('pg_catalog')
)
SELECT t.oid,
       c.table_schema as schema_name,
       c.table_name AS table_name,
       t.table_description,
       c.column_name AS column_name,
       c.ordinal_position,
       d.description as column_description
from tables t
join information_schema.columns c
    on c.table_schema = t.schema_name
   and c.table_name = t.table_name
left join pg_catalog.pg_description d
   ON d.objoid = t.oid
  AND d.objsubid = c.ordinal_position
  AND d.objsubid > 0
where 1=1
and coalesce(table_description, column_description) is not null
order by
    t.schema_name,
    t.table_name,
    c.ordinal_position

Who will this benefit?

  1. Integrating table and column description metadata into an end-to-end deployment workflow that includes the dbt docs. Using the dbt schema.yml's for sources as an interface to maintaining the db metadata; with the db holding the source of truth.
  2. Persisting table and column descriptions in the database based on info in the schema.yml's
  3. Defaulting the table and column descriptions in the schema.yml's from the latest values in the database

Are you interested in contributing this feature?

I am open to contributing but may need some guidance for testing, and help with the related changes on the dbt-core side to enable persist_docs for sources. But I am not sure what kind of effort persist_docs would take for sources vs writing our own logic to sync yaml descriptions with our db.

Even without the related dbt-core changes we would still get a lot of value from the dbt-codegen changes as described above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.