lewischarlesbaker / droughty Goto Github PK

View Code? Open in Web Editor NEW

60.0 5.0 6.0 17.48 MB

Droughty helps keep your workflow dry

Home Page: https://droughty.readthedocs.io/en/latest/

License: MIT License

Python 98.76% Jupyter Notebook 1.24%

droughty's Introduction

#### droughty. ## adjective, drought·i·er, drought·i·est. ## dry.

Droughty helps keep your workflow ah hem dry

What is droughty?

droughty is an analytics engineering toolkit. It takes warehouse metadata and outputs semantic files.

Current tools and supported platforms are:

lookml - generates a lkml with views, explores and measures from a warehouse schema
dbt - generates a base schema from specified warehouse schemas. Includes standard testing routines
dbml - generates an ERD based on the warehouse layer of your warehouse. Includes pk, fk relationships
cube - generates a cube schema including dimensions, integrations and meassures

The purpose of this project is to automate the repetitive, dull elements of analytics engineering in the modern data stack. It turns out this also leads to cleaner projects, less human error and increases the likelihood of the basics getting done...

Documentation

Installation, configuration and usage documentation can be found on ReadTheDocs

Installation

droughty is available through pip:

pip install droughty

droughty's People

Contributors

Stargazers

Watchers

Forkers

pnchkirr r-insighta tommll itismichael markrittman lydiab123

droughty's Issues

feat/add tag for primary keys (lean testing)

Add a tag to '_pk' columns in the droughty_schema.yml file to enable lean testing.

change foreign key test from `not null` to `at least one`

lkml: drill fields set gets generated for datespines in aggregate file but no set of that name gets generated in base

agg

base

Label for PoP fields

Please can you add in group_label: "Period over period" to all PoP measure and dimensions. Just to keep things nice and tidy

revert to default 'at_least_one' test on foreign keys

Currently foreign keys get the not null test generated

lowercase 'datatype: dateTIME' for certain dimension types in base looker output

This line should be lowercase:

The ability to extend the auto generated tests

Although the auto generated tests are an amazing start, they are not sufficient for proper testing and expectation change detection.
Every data model is build on a set of assumptions that might chance over time, the only way to protect our data set's integrity would be to properly test for unexpected cases.

In the current droughty project I don't believe there is any way to extend the auto generated tests with more advanced tests, adding this feature would add a lot to droughty.

Droughty LookML does not work with BigQuery Records

When running droughty we get an error because of this, which is

Traceback (most recent call last):

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/bin/droughty", line 8, in <module>

    sys.exit(start())

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/main.py", line 34, in start

    lookml_base()

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_cli.py", line 22, in lookml_base

    return output()

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_module.py", line 197, in output

    for value in get_all_values(get_base_dict(),get_field_dict()):

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_base_dict.py", line 53, in get_base_dict

    wrangled_dataframe = {n: grp.loc[n].to_dict('index')

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_base_dict.py", line 53, in <dictcomp>

    wrangled_dataframe = {n: grp.loc[n].to_dict('index')

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 2063, in to_dict

    raise ValueError("DataFrame index must be unique for orient='index'.")

ValueError: DataFrame index must be unique for orient='index'.

Figure out how to decouple user-specific project variables from project-specific ones

CI/CD and governance not possible for test_overwrite and test_ignore when project yaml is located outside the repo.

That file is often located outside the repo because the test_schemas and dbml_schemas values will otherwise change when different developers commit.

My proposal: move developer-specific variables such as test_schemas and dbml_schemas to the profile yaml

Support Cube's YAML syntax

Great project, kudos! 🙏

As far as I see, Droughty generates JavaScript-based Cube data models.

It might be more friendly to Droughty's end users (e.g., data engineers) if it's also able to generate Cube data models using new YAML syntax (https://cube.dev/blog/introducing-cube-support-for-yaml-data-modeling, https://cube.dev/docs/schema/fundamentals/working-with-yaml).

It would be great to have both options available since, as of today, YAML is still not on par with JavaScript, feature-wise. For instance, dynamic data models are only possible with JavaScript.

Allow for multiple targets in droughty

OpenAI model as a parameter

OpenAI allows us to fine-tune models so that we have completions that are more in line with our domain. Fine-tuning involves creating a new model out of a base model (for example the text-davinci-003 base model).

This means that if we wanted to generate field descriptions using our new OpenAI model, we would need to set that as a config. Would it then be possible to allow users to pass the value for the model_engine variable in the openai_descriptions.py file? I would assume it could be set as config just like the ProjectVariables.openai_secret?

Add Looker refinement-type functionality to test generation

When fks are generated null_safe it causes non_null errors on the automatically generated tests.

In order to allow for this, it would be great to be able to define in droughty_project which columns to ignore creating certain tests for. Syntax could be something like this:

ignore_tests:
  - not_null:
    - model: wh_operations__events_fact
    - columns:
      - document_fk
      - case_fk
    - model: wh_operations__cases_fact
    - columns:
      - employee_case_owner_fk

Suggestion: don't make test_overwrites overwrite all columns for model

The current setup makes the droughty_project.yaml file very and minimised automation value because for each table that needs at least one column overwrite, all columns will needed to be filled in. For the warehouse layer of one of our clients, that constitutes the whole warehouse layer, removing the benefit of the droughty automation.

Being able to overwrite just individual columns would constitute a much better workflow

Documentation is not updated

using version 0.11.0

I am looking to test out this project and when running the command droughty docs --project-dir droughty_project.yaml
I see the following error:

Exception: You have defined test_schemas in your project YAML file. As of 0.9.6 this is defined within the profile YAML file

I then move the test_schemas into my profile yaml file and then see the following error:

  File "/Users/.../.pyenv/versions/3.11.2/lib/python3.11/site-packages/droughty/droughty_core/config.py", line 393, in assign_explore_variables
    if ExploresVariables.test_schemas == None:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'ExploresVariables' has no attribute 'test_schemas'

My profile.yaml file has the following format:

bitbucket_cicd:
  test_schemas: 
    - 'main_marts'

  host: {redacted}

  key_file: {redacted}

  password: {redacted}

  port: 443

  project_name: 

  schema_name: {redacted}

  user: {redacted}

  warehouse_name: snowflake

  openai_secret: {redacted}
  
  account: {redacted}
  warehouse: {redacted}
  database: {redacted}
  role: {redacted}

Parse and translate dbt metric definitions

As dbt starts deploying their semantic layer, there is a need to not only parse entities defined in dbt, but also metrics. The biggest advantage is for Droughty users to control all semantic functions from within one tool. We currently can expose entities and relationships to Looker, Cube and dbml. Now it would be great if we could also expose the metrics as well.

There probably could be multiple ways of accomplishing this. A few options would be:

When dbt runs on dbt Cloud, we could read directly from their metadata api.
We could materialize artifacts to the data warehouse directly (using existing dbt packages) and read the metric definitions from there.
Parse the yaml files directly from the dbt project and select metric definitions from there.

There is however another way that might be way simpler and that is currently being used by Cube itself. Its implementation can be found in the loadMetricCubesFromDbtProject method of their dbt extension package.

They currently only document how to read dbt metrics when using dbt Cloud, but after digging into the above repo, we can see that they have a few options in place that they seem to have not documented (nor maybe tested) yet.

Regardless, the way they approach the parsing of dbt metric definitions for projects hosted outside of dbt Cloud is by reading the manifest.json file generated after a dbt run job. The bit where they do the parsing starts on line 147:

Object.keys(manifest.metrics).forEach(metric => {
      const regex = /^ref\('(\S+)'\)$/;
      const metricDef = manifest.metrics[metric];
      const match = metricDef.model.match(regex);
      if (!match) {
        throw new UserError(`Expected reference to the model in format ref('model_name') but found '${metricDef.model}'`);
      }
      // eslint-disable-next-line prefer-destructuring
      const modelName = match[1];
      metricDef.model = modelName.indexOf('.') !== -1 ? modelName : `model.${metricDef.package_name}.${modelName}`;
    });

If I read the manifest.json file of the discursus project where I have defined a metric, we have an example of what is being parsed:

"metrics": {
        "metric.discursus_dw.event_count": {
            "fqn": [
                "discursus_dw",
                "warehouse",
                "core",
                "event_count"
            ],
            "unique_id": "metric.discursus_dw.event_count",
            "package_name": "discursus_dw",
            "root_path": "/Users/olivierdupuis/Git/discursus/platform/discursus_data_platform/dw",
            "path": "warehouse/core/core_metrics.yml",
            "original_file_path": "models/warehouse/core/core_metrics.yml",
            "name": "event_count",
            "description": "The count of events",
            "label": "Events count",
            "calculation_method": "count",
            "timestamp": "event_date",
            "expression": "event_pk",
            "filters": [],
            "time_grains": [
                "day",
                "week",
                "month",
                "quarter",
                "year"
            ],
            "dimensions": [
                "action_geo_country_name"
            ],
            "window": null,
            "model": "ref('events_fct')",
            "model_unique_id": null,
            "resource_type": "metric",
            "meta": {},
            "tags": [],
            "config": {
                "enabled": true
            },
            "unrendered_config": {},
            "sources": [],
            "depends_on": {
                "macros": [],
                "nodes": [
                    "model.discursus_dw.events_fct"
                ]
            },
            "refs": [
                [
                    "events_fct"
                ]
            ],
            "metrics": [],
            "created_at": 1668096680.095598
        }
    },

Of course, that only covers the parsing of metric definitions and not how that can then be translated to either Looker, Cube or dbml definitions. But maybe a first step could be to have this as an experimental feature that would be used to convert dbt metrics to Cube, as it might be an appealing feature for dbt and Cube users who are looking for a way to integrate both tools without having to use dbt Cloud.

Multiple Parent Explores

I cannot seem to get multiple explores to work.

The ability to generate a source.yml file for a dbt project just by pointing it at a specific database / schema combo

Feature: project variable to overwrite drill sets

For cases where we would want to overwrite drill sets for dimensions or measures

Try/Except is hiding underlying errors

droughty/droughty/droughty/droughty_docs/openai_descriptions.py

Lines 92 to 98 in d371779

 try: 

 for value in list_rows(wrangle_descriptions()): 

 print(value) 

 except:

As a new user of this package, this try except block ends up causing the program to silently fail. After manually removing the try/except I can now see that I have snowflake connection errors that were being hidden.

It would be useful to to catch the errors and log/print them for end users to see rather than just passing

Cube.js definitions attributes are case sensitive

When generating Cube definitions, the supported attributes are case sensitive and will generate errors if they're all lowercase.

feature: toggle whether to build `tests` for all columns or just `primary keys`

Being able to only generate tests for primary keys would allow large projects to use the dbt build statement without it being prohibitively slow.

This would likely sit in dbt_project.yaml

Being able to test_overwrite should still work in this case, as you might want to ensure you have specific tests for crucial non-pk columns even when you are using the newly envisaged pk-only test setting.

Option to generate one cube definition per entity

The droughty cube functionality currently generates 3 files: base, integration and aggregate. This doesn't translate well with the way I organize my Cube deployments, where I have a cube definition per entity. So that would mean I have the dimensions, metrics and joins all defined in a single file.

Would it be possible to have the option to change how droughty generates the cube definitions so that instead of the above files, I end up with a single file per entity that would look something like the following...

cube('Events', {
  sql: `select * from analytics.events_fct`,

  dimensions: {
    eventPk: {
      primaryKey: true,
      sql: `event_pk`,
      type: `string`,
      shown: true
    },

    movementFk: {
      sql: `movement_fk`,
      type: `string`
    },

    eventDate: {
      sql: `event_date`,
      type: `time`
    },

    actionGeoCountryName: {
      sql: `action_geo_country_name`,
      type: `string`
    },
  },

  measures: {
    eventCount: {
      sql: `event_pk`,
      type: `count`
    }
  },

  joins: {
    Movements: {
      relationship: `belongsTo`,
      sql: `
        ${CUBE}.movement_fk = ${Movements.movementPk}
      `
    },
  },
});

New configuration syntax for dbt test definitions

With the introduction of unit tests in v1.8, dbt is deprecating the use of the tests: key in configuration files in favor of data_tests:. As of v1.8, the tests: key remains supported as an alias but will be fully deprecated in a future version.

https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax

Remove requirement for unused configurations

As a user of droughty, I want to remove unused configurations from my droughty_project.yml file, so that I can keep that file as simple and clean as possible.

For example, if I'm not to use the lookml command, I would like to be able to remove the following configs:

explores:
  parent_table: 
    - actors_dim
  dimensions: 
    - events_fct
  facts:
    - narratives_fct

Cube can't find join paths between base and extension cubes

In cube_base.js, I have the following definition:

cube(`events_fct`, {
    sql: `select * from analytics.events_fct`,
    dimensions: {
        action_geo_country_code: {
            sql: `action_geo_country_code`,
            type: `string`,
            description: `not available`,
        },
        action_geo_country_name: {
            sql: `action_geo_country_name`,
            type: `string`,
            description: `not available`,
        },
        action_geo_full_name: {
            sql: `action_geo_full_name`,
            type: `string`,
            description: `not available`,
        },
        event_pk: {
            primaryKey: true,
            type: `string`,
            sql: `event_pk`,
            description: `not available`,
        },
        event_ts: {
            sql: `event_ts`,
            type: `time`,
            description: `not available`,
        },
        protest_fk: {
            sql: `protest_fk`,
            type: `string`,
            description: `not available`,
        },
    }
});

In cube_aggregates, I have the following definition:

cube(`events_fct_extended`, {
    sql: `select * from analytics.events_fct`,
    extends: events_fct,
    measures: {
        sum_of_action_geo_latitude: {
            sql: `action_geo_latitude`,
            type: `sum`,
        },
        sum_of_action_geo_longitude: {
            sql: `action_geo_longitude`,
            type: `sum`,
        },
        count_of_event_pk: {
            sql: `event_pk`,
            type: `count`,
        },
    }
});

When I try to run a query using a metric from events_fct_extended, segmented by a dimension from the events_fct cube, I have the following error.

Control names of semantic files generated

As a user of droughty, I want to control what will be the name of the semantic files that are generated, because I always want to generate a specific file name, regardless of the sources being used.

For example, when I use droughty dbml, I can use it against my dev or qa schemas. But regardless to which schema it points to, I want it to generate the same file name.

Feature request to be able to omit dimensions/measures created automatically from looker project

Currently droughty creates lookml dimensions or measures by inferring based on information provided within a dbt project. However, there are some cols that would not make sense to create a measure for like sum of screen width.

It would be great if there was a way to specify what measures (and dimensions?) to not include in the automatically generated base layers for the looker project. This way users would only go to one place to toggle visibility for those dimensions/measures.

Postgres support

this would be much appreciated

The ability to disable features, specifically drill set generation

Would it be possible to have a flag to disable the drill set feature / for it to not run if it’s not specified in the project config?
This could also apply to other features

lkml - aggregate file's 'include:' should consider location of lookml subdir

currently it overwrites the correct path when lookml is in a subdir^

Don't generate a set if there are only pk and fk columns in the table

Bug: The scenario mentioned in the heading causes the preceding table's columns to apply to the current table and the set name to be blank

Surface Looker Explore prod/dev datasets in the SQL pane

Would it be possible to generate a customisable LookML line of code (specified in the droughty_project.yaml file) beneath each view in the _base.layer.lkml file, in this instance to enable custom prod and development datasets to be surfaced in the Looker Explore SQL pane?

_base.layer.lkml example

view: <data_warehouse_table_name> {
sql_table_name: `<gcp_project_id>.{{ _user_attributes['<user_attribute_analytics_data_warehouse>'] }}.<data_warehouse_table_name>` ;;

drought_project.yaml

lookml_dataset: sql_table_name: `<gcp_project_id>.{{ _user_attributes['<user_attribute_analytics_data_warehouse>'] }}.<data_warehouse_table_name>` ;;

This would enable users to investigate queries in the data warehouse UI if needed, or check if the user is querying against production or development data.

dbt_utils_expression_is_true test assertion

Bug description

droughty asserts dbt_utils_expression_is_true test on columns application_valid_from_dt and application_valid_to_dt:

- 'dbt_utils.expression_is_true:expression: valid_from < valid_to'

Expected behaviour

In this use case, I would expect only the not_null test to have been assigned by droughty to these date columns (flagged by 'dt' syntax in column name).

Screenshots/Video

Workaround

Using test_overwrite: in droughty_schema.yaml to redefine all tests for the model in question.

dependency conflict with recent dbt versions

If you try to install these two package versions

dbt-snowflake==1.6.4
droughty==0.10.1

then you get this error.

	try:

	for value in list_rows(wrangle_descriptions()):

	print(value)

	except: