bitpicky / dbt-sugar Goto Github PK

View Code? Open in Web Editor NEW

150.0 10.0 20.0 866 KB

dbt-sugar is a CLI tool that allows users of dbt to have fun and ease performing actions around dbt models

Home Page: https://bitpicky.gitbook.io/dbt-sugar/

License: Apache License 2.0

Python 99.97% Shell 0.03%

dbt-sugar data-engineering documentation dbt

dbt-sugar's Introduction

Documentation on gitbook

If you want to help out and join the party feel free to:

create an issue or more fun
join us on Discord ✨ and
check the ROADMAP.md to see what's in the plans.

dbt-sugar 🍬

⚠️ dbt-sugar is pretty alpha, don't use on your prod models unless you have tested it in a safe place before.

What is dbt-sugar?

dbt-sugar is a CLI tool that allows users of dbt to have fun and ease performing actions around dbt models such as:

documentation
test enforcement and probably more in a not too distant future.

We keep track of our progress in the projects section.

dbt-sugar's People

Contributors

Stargazers

Watchers

Forkers

mihaibalint stumelius diegodewilde thien-vuong-cgc danieldiamond ssassi mhindery z3z1ma sornon1 oev81 jessica-ol badge emil-il paulmatthieu krisantis-group clarkric22 alex-at-loom diegoquintanav neozenith

dbt-sugar's Issues

Offer a command to "export" description to something (csv, spreadsheet...)

[feature] make test results more visually appealing

The summary of a run looks something like this:

? Do you want to change the model description of my_first_dbt_model No
? There are 1 undocumented columns. Do you want to document them all? Yes
? Column: 'question': Please write down your description: q desc
? Would you like to add any tests? Yes
? Please select one or more tests from the list below [unique]
? Would you like to add any tags? Yes
? Provide a comma-separated list of tags PII
The 'unique' test question, FAILED, and will NOT be added to your schema.yml.
                It migtht be a good idea to go and fix it.

I think it would be a good idea to add a new line and colour the word "PASSED" or "FAILED" in green and red respectively.

There is also a typo in the last sentence and It shouldn't be centered.

We might also want to have some indication that dbt-suagar is going to be running tests and which ones it is running.

Use jaffleshop as the test dbt model for more real life testing

Testing: Fix and improve tasks

Stop the process at any time.

Write documentation

Most of the documentation is now written on gitbook.

I'll add some gif screenshots to make it a bit more interactive.

re-read the entire documentation and hunt for typos or things that need correcting
make terminal gifs using https://github.com/asciinema/asciinema & https://github.com/asciinema/asciicast2gif
document the audit task

Bug: When you choose document tags if it doesn't enter something don't create the tag.

Build barebones of a config file and object

We will probably want to store some configs such as paths and dbt projects we mignt want to interact with.

This should be a yaml somewhere we should parse it and validate it probably using pydantic because it's so good with validation.

Tie it all together, test against real life example and prep for alpha release

Just some todos so I don't forget:

document (very basic at least).
convert all todos to tickets if they don't exist already.
tie all process together (make sure all path hardcodes and what not are resolved)
generate changelog
~~- [ ] make a bitpicky account on PyPI~~
package on PyPI

[feature] colour the log messages accordingly and make sure we only info infoable stuff

Build the basic "document a model" functionality

Goal

Users should be able to do:

dbt-sugar doc -m my_awesome_model

This would trigger a request onto the database* that will check what the columns of the model are and kicking the interactive flow to:

write model description
write column descriptions
pick tests (let's start with dbt's builtins --later on we can find a way to easily make the tool register other custom ones see #7 )
maybe run the tests to ensure they pass

NOTE: we'll have to make sure we can give control over the dev and prod dbs and all that down the line with some --target dev argument a bit like is done in dbt

Build CLI interaction with a tool that is beautiful

@virvirlopez I don't know if you know this package: https://github.com/CITGuru/PyInquirer

But it looks REALLY nice to use with the interactive flow. Might save us having to write user input collection from scratch.

What do you think?

Set up a toy dbt model for testing

Build main CLI interfaces

Main components:

Look into the following tools for setting up nice interfaces:
https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df

Look into why I can't set up a docker container for PG on GitHub Actions

I had a few issues creating a postgres conatiner when running test suite on windows platform. The issue was that there was no image that could be found. This surprises me. Since the tox suite does seem to work when triggered locally (also using tox) I'm not going to sweat it too much right now.

How would we select scope of which projects and tables to document?

One thing we'll need to worry about fairly soon is how we manage scopes of dbt projects to document.

Some users might have one repo per dbt projects and some might have several dbt projects in one repo (monorepo-like setup).

We'll probably want to offer in a yaml file or so an ability for users to list the dbt projects (and possibly tables) to include or exclude from the documentation task. It would probably look something like this:

dbt_projects:
    - dbt_dwh:
        - exclude: ['table_1', 'table_2']
    - dbt_prediction

It'll probably change and I don't even know if that yaml is even valid but at least this captures the gist of it.

Figure out a way to let users add complex tests

@virvirlopez we forgot to test subprocess and I also have some questions regarding the user flow and how we let people add tests via the CLI (especially for more complex tests which take several arguments).

Develop API and Connector architecture

We will develop our own connectors this means we should aim to support at least:

Postgres
Snowflake

We should build:

base connector
concrete adaptors for the different db types we want to build
a factory method to return the concrete adaptors based on the db_type from the profile.yml

Implement Model and column documentation flow

The basic flow for the doc command should be as follows:

dbt-sugar doc -m my_model
User would be shown the model-level description and asked a [y/n] question of whether they want to alter the current documentation.
Columns. We should start with the undocumented columns:
1. Users are told "There are n undocumented columns" and presented with a summary of the columns that are up for documentation.
2. Users are asked to choose which columns out of these the want to document. They would select them.
3. From that list we would ask them in order to document the column via a text prompt.
4. Once all the selected columns have been documented the list of columns + descriptions is passed back to the "back-end" and they are physically populated in the yaml files.
Columns. Already documented columns:
1. Users should also then be asked whether they want to amend or reword any already documented columns.
2. If yes, we should show them the list (we'll have to think about how to make it visually appealing and all that.
3. The flow from there is similar to the undocumented columns flow outlined above. Users select, document and we submit that back to the back end which goes and modifies the yaml files.

I think the next steps from there that we can build upon would be the ability to propagate the definitions across all models. @virvirlopez do you think that makes sense to follow this issue with that extension or do you think maybe we should leave that for later and move on to test enforcement next so we cover more feature ground? I'm open to suggestions and to your feeling on prioritisation.

Add tests UI flow

probably need to create a simple "base" confirm promo to support both test adding as well as tags

Implement `sugar_config.yml` recursive parser

Depending on what is concluded in #13 we will want to build some functionality to allow dbt-sugar to find it's own config.

Leverage "Column Names as Contracts" ideas to (auto)generate column tests

Noel, one of our community members posted an article from Emily (Capital One) on our discord: https://discord.com/channels/752101657218908281/789211881117057064/807763141526028338

Which refers to the idea of using column names as contracts:

The TLDR of these articles are that if you use strong column naming conventions such as: ID, DT, CAT you can directly from the column name have some assumptions such as something may have certain values to expect (e.g for categorical columns main you have a finite amount or terms, IDs should be unique and not null, dates should be within a certain range etc.)

If this is something people choose to use in their data models, we should be able to auto generate tests for them.

Probably the easiest and most explicit thing to do in a first version is to give users a way to configure a series of mappings in a yaml document which could look like the following:

- auto_tests:
  - pattern: '^ID_'
    test_rules: ['unique', 'not_null']
  - pattern: '^CAT_'
    test_rules: ['allowable_values']
    test_arguments: ['cat_a', 'cat_b']

dbt-sugar could then "evalute" (in a safe manner) these configs and propose tests based on those column conventions.

A similar start of an idea is in discussio in #24 in which we propose the idea that if a column is named as the primary key in the {{config()}} of a model that we could automatically propose a ['unique', 'not_null'] test on this column to the user as they document things.

This could also lead to the birth of a new dbt-sugar task which could be called auto-tests (or we'll find something sweeter later) which would iterate over all of the models of a dbt-project and add tests based on those user rules.

Later, later, we could offer a way for people to share their user pattern/configs so that people could maybe install "packages" of configs into their dbt-sugar projects.

What subtasks would make sense in the `DocumentTask`?

Just putting this out there to get some dicussion @virvirlopez

What do we think we want to have in the first (and following implementations). I could see all the following sub tasks:

add model description from user input
add descriptions to empty columns (or modify existing ones) from user input
allow for existing columns with definitions to propagate onto new ones (I have a feeling this should be optional and down to user choice. Maybe for some teams they do not want to homogeneise column descriptions).
add tests (and have them ran in the db)
add tags? (I think the prio on this might be quite low I think it might be a pretty uncommonly used feature but maybe something we could ask about).
homogenise definitions (this is a bit more long term but I think I definitely want to find a way to make this happen).

What are your thoughts and maybe what do you think about prioritisation on this?

How to keep state and definition information?

Ideally, we'll want to be able to offer a run type where we will be able to enforce a "one field == one definition" rule.

How might we keep state info and be aware of the whole definition set?

we can maintain a dictionary in a local .json file.
we can maintain a dictionary in a db table (preferred long term as it allows to a GH CI action)
we maybe want to rely on dbt's manifest.json artefact to leverage things like:
- deferring runs for state modified? This would allow people not to have to figure out which models they should select and document. Rather we would figure it out for them and we would iterate the documentation task throughout.

Something that is going to be annoying to figure out and nail
How do we solve in an elegant and least db dependent the issue of not having to force people to commit large changes to the main dictionary?
Here are some possibilities that we will have to discuss:

we could set up storage of the central dict into a few storage platforms. The first that comes to mind is probably S3. The annoying thing is that this requires users to set up S3 credentials which might not be the most convenient for most teams due to rights and stuff.
we could rely on the production db (as mentioned earlier) to be the storage of column | definition | models (list). This would require building a db adaptor quite early on. We may be able to rely on some of dbt's adapter functions to talk to dbs but this will need to be explored. I think we could maybe just make dbt create a model or a seed and expose it to the db. Not sure what will be easiest but I think this might be pretty good since dbt will always be ahead in terms of its ability to talk to more dbs. The only issue is that we're dependent on potential changes in dbt's internals that we will not be able to control for if we don't keep an eye on things.

Add tags back-end flow

Build audit task

We should be able to give some statistics to the users around test and documentation coverage for their models we should have some stuff around the following:

documented vs undocumented columns (per models + across the dbt project)
tested vs untested models (per model + across the dbt project.

Ideally we should present it a bit like a traditional test coverage a la pytest

users should do something like dbt-sugar audit [-m model_a] and get statistics about the model or all of the models in the debt project that is in scope.

Documentation Coverage

table	% coverage	undocument columns
table_a	90%	column_b
table_b	10%	column_a, column_b
=======	==========	==================
	80%

Test Coverage

table	% coverage	untested columns
table_a	90%	column_b
table_b	10%	column_a, column_b
=======	==========	==================
	80%

Caveat

When running audit on the entire project, we should not show the undocumented/untested columns
However, when we run audit on **a single model ** -m my_model then we can show the list of undocumented and untested
We should maybe offer an optional argument like --save-verbose-output filename that would output the full list to a file?

Feat: Find a way of stop the process in any moment.

Add test back-end flow

v0 should be simply add the tests. If it makes sense to slot in the queries why not but no pressure on this for v0 of the app I would say.

Find a way to test locally by spinning up our own non-cloud db (Postgres, SQLlite...)

We'll want to be able to test our stuff and it might be a good idea to make sure we can spin up a local db against which to test our stuff.

I think it's possible to make a postgresDB locally (and in github actions for testing). If this is too complex, we should be able to spin up a sqllite db but the SQL it can run is a bit shitty. Postgres looks more like main cloud dbs such as Snowflake, BigQuery and Redshift.

test issue

[regression] properly support more than 1 dbt project under scope

#Rationale
Originally we wanted to provide support for more than one dbt project under a given syrup config. However, this would slow our initial development but it is a regression as we strongly believe dbt-sugar was built with the main objective in mind to keep column definitions consistent across all of a user's tables.

We discussed a few points in the following issues:

But we made a conscious choice in the code to only support one dbt project for now to allow us to have an MVP out sooner (

dbt-sugar/dbt_sugar/core/main.py

Lines 124 to 128 in 38ba2af

 # TODO: Revise the index access here when we can support multiple dbt projects properly. 

 dbt_project = DbtProject( 

 sugar_config.config.get("dbt_projects", list())[0].get("name", str()), 

 sugar_config.config.get("dbt_projects", list())[0].get("path", str()), 

 )

) --well actually we are sketchy about it now and we should correct that and throw a proper error.

Challenge/Potential Solutions

Achieving this means that we need to be able to represent and store the models information that we parse from the schema.yml files in a one-level more nested dict that also contains which dbt_project it's in so that we can target and use the right profiles (i.e. credentials and db info) to use when we query information schemas.

What would be the best way to connect to various dbs?

The core ability for dbt-sugar to generate documentation is to be able to generate new yaml files from the model's columns. For this it needs to be able to talk to databases and either:

query for the columns of the current model
or query the entire information schema of a certain db (some databases might or might not have that concept and we'll probably want to make our own abstractions for this)

@virvirlopez and I discussed that we will likely want to build some connectors (or at least an abstaction layer) but we might also be able to leverage dbt's own adapter functions as dbt already knows how to talk to many databases and has adaptors for them.

The issue with relying on dbt's adaptors is that they may break or change a lot until they release a stable API in v1 so we'll have to do some reasarch.

The good thing is that if we at least write our abstraction layer we'll be able to either go all the way and write our own connectors or we'll be able to replace any broken or changed adaptor fairly quickly. Also, it's probably a good idea not to have too many dependencies.

@virvirlopez feel free to drop some of your thoughts on here and whatever comes out of our research

Build a dbt project finder

We need to know which dbt project we are applying to, if it's more than one we need to be able to read paths from a config or CLI. This needs to be architected

[bug] asking to document already documented columns isn't running

If all columns are documented I was not able to get a prompt for documenting descriptions which already have a description although there is functionality for it in the code.

cd tests/
dbt-sugar doc -m my_first_dbt_model --syrup syrup_4 -vv -s public

use pyfiglet to generate some sweetness

If subprocess fails because stout from dbt has changed we need to allow users to add tests

I'm sure we'll get in a spot where dbt-sugar fails on the subprocess parsing and stuff. I think we have the following we can do:

have it as an experimental feature and controlled by a flag in the sugar_config.yml
have a CLI (+ sugar_config.yml) argument which would allow people to not test that tests pass
have a way to propose that if a test fails to run because we fail to parse the stout log we should output an apologetic message and encourage people to deactivate test enforcement at run on CLI with a --force argument and that they should send us a bug report asap.

Make use of the `DbtSugarConfig` in documentation to get dbt model path etc.

@virvirlopez as we discussed we should make sure that we hook up your the DbtSugarConfig into the flow in order to get:

the correct dbt_project paths
the correct target and credentials associated with it.

While hooking up to this it might make sense to look into how supporting more than one dbt project might work out and what kind of changes we might need to make.

If you figure this out maybe document your ideas in a comment here or make a separate task so that I can see what it takes and maybe help also with what I can.

How does that sound?

Think about how to support test queries by not having to implement the code macros ourselves and perhaps just use the dbt API for that.

As part of the dbt-sugar doc flow, we want to allow people to add tests to columns. While we can stick with just populating their yaml files, it would be great if we could test their model for them at add-time.

For example, if a user were to say "I want column foo to be unique" when we're about to add it to their model's schema.yml file we would actually fire the test query. If the test fails, we would tell them that it failed and not add it.

The issue with doing that is that we could end up going down the rabbit-hole or re-implementing dbt's builtins and we wouldn't be able to easily integrate or support other custom tests which users may write as macros or import via packages.

There is probably a way to just use dbt as an API or peek into the code to find the sql templates and test that way.

One potentially dirty way would be to generate subprocess calls that trigger stuff like dbt test -m model with a temporary generated yaml or so. It would work and wouldn't make us rely to much on the dbt python API which isn't yet marked as stable but it's dirty and involved because we'd have to parse logs etc.

[bug] The `target/` folder is not excluded from the yaml search

There was a schema.yml file that lived in target and dbt-sugar was reading definitions from it which is not desired

Create a ROADMAP.md document

We should have a roadmap of all the things we plan to ship after the holiday. This will both help us keep track but also allow users to know what is coming and possibly help with.

Of course we will have issues for each of these but sometimes it's hard to have a birds-eye view of what maintainers of the project are thinking about.

[bug] schema in run_test should not be read from flags

In #85 we merged code that obtains the schema of the model we want to documents from the CLI. While we offer this argument on the CLI it's discouraged.

We should be reading the schema from the dbt profile. We require a target_profile argument to be given, if none is given or user passes it in the CLI we should override the field in the dbt profile object with the content of what was passed into the CLI but the test backend should not be reading directly from flags and rather read from the dbt profile which should be responsible for overriding things appropriately given relevant CLI inputs.

don't crash if at least one dbt project is found

Currently when we generate the DbtSugarConfig we make sure that all of the declared dbt projects actually exist on disk (see

dbt-sugar/dbt_sugar/core/config/config.py

Lines 118 to 123 in 5c44dd1

 # TODO: Maybe we want to revisit this and not have a raise but rather a logger warning and says we'll ignore 

 if bogus_projects: 

 raise MissingDbtProjects( 

 f"The following dbt projects are missing: \n{list(bogus_projects.keys())}" 

 ) 

 return True

) but we may also want to have a "riskier" run option which would be controlled in the sugar_config.yml or CLI as something like --soft-fails and which would instead turn non-run critical issues such as this one to throw a warning rather than crash.

We'd of course have to document and be VERY explicit that this can be a bit dangerous and also consider inclusion into these "non-run critical" errors with care.

should we allow users to deactivate tags and/or tests questioning in the config/CLI?

I'm thinking some users may not want to use tags or, god forbids, tests.

Would it make sense to maybe have a CLI arg + a sugar_config option which would be something like
skip_tags: True (or skip_tests: True) which would skip dbt-sugar asking if users want to add tags or tests?

@virvirlopez how easy do you think it would be to skip tag or test adding in your backend control flow if you had access to those flags from the DbtSugarConfig object?

Rename the `sugar_cane` concept to `syrup` to avoid potentially loaded references

When several dbt projects are in scope which target should we decide to use?

The sugar_config.yaml allows users to say "I want to document for project X, Y, Z".

Currently the DbtProfile object actually tries to look for a "default" profile. But if this doesn't exist what should we do?

What if several projects have different targets? How should we decide which db credentials to use?
Should we maybe request another field in the syrups which, much like the paths to the dbt projects would point to a profile name and target to use? Should we require that people set up a default profile? What if they're using their default profile in their projects already and for some reason it's not compatible or has different use cases?

We could alternatively "solve" this by enforcing our own profile file, but for now we'd like to avoid this since it will require users to maintain a very similar profiles.yml file to the one they already have from dbt.

Implement credentials parser and validator

We will want to get credentials directly from the dbt credentials file.

Need to build a parser and validator (using pydantic) to get db info from people's dbt profiles.yml and make it available to the rest of the app (mainly the connectors)

Make is possible to register custom tests

Part of the tool will allow users to add tests to the models they are documenting. While it's easy to hardcode access to a few of the builtins of dbt we might want to find a way to have users register their own custom tests.

It could be read from a yaml or so. Something like that:

custom_tests:
    - dbt_utils.custom_test
    - another_custom_test_not_from_a_package

Should we support adding freshness tests to some sources?

Not much to add here other than I think there might be a case for adding easily some freshness tests to sources when they are added or defined for the first time or when we do something like

dbt-sugar sources --project dbt_prediction

We could have something like a quick summary of the sources in this project and offer the user the ability to enforce a freshness test.

I think it's not urgent but definitely a nice feature to think about as I think we tend to forget adding freshness tests on sources far too often when we're in need of bringing new sources quickly while developping.

Flesh out DocumentTask implementation

The first task we want to support is the DocumentTask. The bare bones as well as basic Task API definition can be found here: https://github.com/bitpicky/dbt-sugar/tree/main/dbt_sugar/core/task

In it's most basic form it should:

know what dbt model you want to document
check the table in prod or user's env for the table structure (col names)
generate an empty yaml file.
write this back into the proper schema files.

Very quickly after, we will want to make sure we can ask for user input. This means in this issue we should define the API between the document task and what it will require from the CLI front end.

It might ask for the following:

show cols that are documented
ask user which one they might want to document
collect the definitions
submit that back to the documentation object/task.

	# TODO: Revise the index access here when we can support multiple dbt projects properly.
	dbt_project = DbtProject(
	sugar_config.config.get("dbt_projects", list())[0].get("name", str()),
	sugar_config.config.get("dbt_projects", list())[0].get("path", str()),
	)

	# TODO: Maybe we want to revisit this and not have a raise but rather a logger warning and says we'll ignore
	if bogus_projects:
	raise MissingDbtProjects(
	f"The following dbt projects are missing: \n{list(bogus_projects.keys())}"
	)
	return True