Giter VIP home page Giter VIP logo

droughty's Introduction

#### droughty. ## adjective, drought·i·er, drought·i·est. ## dry.

Droughty helps keep your workflow ah hem dry


What is droughty?

droughty is an analytics engineering toolkit. It takes warehouse metadata and outputs semantic files.

Current tools and supported platforms are:

  • lookml - generates a lkml with views, explores and measures from a warehouse schema
  • dbt - generates a base schema from specified warehouse schemas. Includes standard testing routines
  • dbml - generates an ERD based on the warehouse layer of your warehouse. Includes pk, fk relationships
  • cube - generates a cube schema including dimensions, integrations and meassures

The purpose of this project is to automate the repetitive, dull elements of analytics engineering in the modern data stack. It turns out this also leads to cleaner projects, less human error and increases the likelihood of the basics getting done...

Documentation

Installation, configuration and usage documentation can be found on ReadTheDocs

Installation

droughty is available through pip:

pip install droughty

droughty's People

Contributors

lewischarlesbaker avatar terroface avatar lydiab123 avatar pnchkirr avatar jimbobparsons avatar

Stargazers

Edvard Conrad avatar  avatar Oleg Gunchenko avatar  avatar Tom avatar Matteo Fiorillo avatar Kabilan avatar Eduardo Bizarro avatar Javier avatar Quinn Herden avatar Marc Work avatar  avatar  avatar Tim Castillo avatar Zsombor Foldesi avatar Oscar Gomes avatar James Kelly avatar Nidheeshdas Thavorath avatar  avatar Gabriel Almeida avatar Raphael Costa avatar  avatar jon ⚝ avatar Sawyer W avatar  avatar Tobie Tusing avatar Elliot Partridge avatar 4rn1w3sth avatar Zach Renwick avatar Patrick Park avatar  avatar Jack Goble avatar  avatar Kotenko Nikolay avatar Alexey Yurchenko avatar Johan Fokeev avatar Danilo Soto avatar Marie Stephen Leo avatar Kyle Pierce avatar David Riordan avatar Francisco Valdez de la Fuente avatar richk_at_looker avatar alexander erofeev avatar  avatar Maria Restrepo avatar Georvic Tur avatar Juan Muñoz avatar Christopher Cabrera avatar Nikolaus Schlemm avatar Jordan Ilyat avatar William Seward avatar Junwei Lai avatar  avatar  avatar sonny avatar  avatar Olivier Dupuis avatar  avatar Stephan Teeuwen avatar

Watchers

Juan Muñoz avatar  avatar  avatar Jack Goble avatar  avatar

droughty's Issues

Label for PoP fields

Please can you add in group_label: "Period over period" to all PoP measure and dimensions. Just to keep things nice and tidy

The ability to extend the auto generated tests

Although the auto generated tests are an amazing start, they are not sufficient for proper testing and expectation change detection.
Every data model is build on a set of assumptions that might chance over time, the only way to protect our data set's integrity would be to properly test for unexpected cases.

In the current droughty project I don't believe there is any way to extend the auto generated tests with more advanced tests, adding this feature would add a lot to droughty.

Droughty LookML does not work with BigQuery Records

When running droughty we get an error because of this, which is

Traceback (most recent call last):

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/bin/droughty", line 8, in <module>

    sys.exit(start())

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/main.py", line 34, in start

    lookml_base()

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_cli.py", line 22, in lookml_base

    return output()

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_module.py", line 197, in output

    for value in get_all_values(get_base_dict(),get_field_dict()):

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_base_dict.py", line 53, in get_base_dict

    wrangled_dataframe = {n: grp.loc[n].to_dict('index')

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/droughty/droughty_lookml/lookml_base_dict.py", line 53, in <dictcomp>

    wrangled_dataframe = {n: grp.loc[n].to_dict('index')

  File "/Users/lydiablackley/projects/cia_analytics_warehouse/.venv3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 2063, in to_dict

    raise ValueError("DataFrame index must be unique for orient='index'.")

ValueError: DataFrame index must be unique for orient='index'.

Figure out how to decouple user-specific project variables from project-specific ones

CI/CD and governance not possible for test_overwrite and test_ignore when project yaml is located outside the repo.

That file is often located outside the repo because the test_schemas and dbml_schemas values will otherwise change when different developers commit.

My proposal: move developer-specific variables such as test_schemas and dbml_schemas to the profile yaml

Support Cube's YAML syntax

Great project, kudos! 🙏

As far as I see, Droughty generates JavaScript-based Cube data models.

It might be more friendly to Droughty's end users (e.g., data engineers) if it's also able to generate Cube data models using new YAML syntax (https://cube.dev/blog/introducing-cube-support-for-yaml-data-modeling, https://cube.dev/docs/schema/fundamentals/working-with-yaml).

It would be great to have both options available since, as of today, YAML is still not on par with JavaScript, feature-wise. For instance, dynamic data models are only possible with JavaScript.

OpenAI model as a parameter

OpenAI allows us to fine-tune models so that we have completions that are more in line with our domain. Fine-tuning involves creating a new model out of a base model (for example the text-davinci-003 base model).

This means that if we wanted to generate field descriptions using our new OpenAI model, we would need to set that as a config. Would it then be possible to allow users to pass the value for the model_engine variable in the openai_descriptions.py file? I would assume it could be set as config just like the ProjectVariables.openai_secret?

Add Looker refinement-type functionality to test generation

When fks are generated null_safe it causes non_null errors on the automatically generated tests.

image

In order to allow for this, it would be great to be able to define in droughty_project which columns to ignore creating certain tests for. Syntax could be something like this:

ignore_tests:
  - not_null:
    - model: wh_operations__events_fact
    - columns:
      - document_fk
      - case_fk
    - model: wh_operations__cases_fact
    - columns:
      - employee_case_owner_fk

Suggestion: don't make test_overwrites overwrite all columns for model

The current setup makes the droughty_project.yaml file very and minimised automation value because for each table that needs at least one column overwrite, all columns will needed to be filled in. For the warehouse layer of one of our clients, that constitutes the whole warehouse layer, removing the benefit of the droughty automation.

Being able to overwrite just individual columns would constitute a much better workflow

Documentation is not updated

using version 0.11.0

I am looking to test out this project and when running the command droughty docs --project-dir droughty_project.yaml
I see the following error:

Exception: You have defined test_schemas in your project YAML file. As of 0.9.6 this is defined within the profile YAML file

I then move the test_schemas into my profile yaml file and then see the following error:

  File "/Users/.../.pyenv/versions/3.11.2/lib/python3.11/site-packages/droughty/droughty_core/config.py", line 393, in assign_explore_variables
    if ExploresVariables.test_schemas == None:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'ExploresVariables' has no attribute 'test_schemas'

My profile.yaml file has the following format:

bitbucket_cicd:
  test_schemas: 
    - 'main_marts'

  host: {redacted}

  key_file: {redacted}

  password: {redacted}

  port: 443

  project_name: 

  schema_name: {redacted}

  user: {redacted}

  warehouse_name: snowflake

  openai_secret: {redacted}
  
  account: {redacted}
  warehouse: {redacted}
  database: {redacted}
  role: {redacted}



Parse and translate dbt metric definitions

As dbt starts deploying their semantic layer, there is a need to not only parse entities defined in dbt, but also metrics. The biggest advantage is for Droughty users to control all semantic functions from within one tool. We currently can expose entities and relationships to Looker, Cube and dbml. Now it would be great if we could also expose the metrics as well.

There probably could be multiple ways of accomplishing this. A few options would be:

  • When dbt runs on dbt Cloud, we could read directly from their metadata api.
  • We could materialize artifacts to the data warehouse directly (using existing dbt packages) and read the metric definitions from there.
  • Parse the yaml files directly from the dbt project and select metric definitions from there.

There is however another way that might be way simpler and that is currently being used by Cube itself. Its implementation can be found in the loadMetricCubesFromDbtProject method of their dbt extension package.

They currently only document how to read dbt metrics when using dbt Cloud, but after digging into the above repo, we can see that they have a few options in place that they seem to have not documented (nor maybe tested) yet.

Regardless, the way they approach the parsing of dbt metric definitions for projects hosted outside of dbt Cloud is by reading the manifest.json file generated after a dbt run job. The bit where they do the parsing starts on line 147:

Object.keys(manifest.metrics).forEach(metric => {
      const regex = /^ref\('(\S+)'\)$/;
      const metricDef = manifest.metrics[metric];
      const match = metricDef.model.match(regex);
      if (!match) {
        throw new UserError(`Expected reference to the model in format ref('model_name') but found '${metricDef.model}'`);
      }
      // eslint-disable-next-line prefer-destructuring
      const modelName = match[1];
      metricDef.model = modelName.indexOf('.') !== -1 ? modelName : `model.${metricDef.package_name}.${modelName}`;
    });

If I read the manifest.json file of the discursus project where I have defined a metric, we have an example of what is being parsed:

"metrics": {
        "metric.discursus_dw.event_count": {
            "fqn": [
                "discursus_dw",
                "warehouse",
                "core",
                "event_count"
            ],
            "unique_id": "metric.discursus_dw.event_count",
            "package_name": "discursus_dw",
            "root_path": "/Users/olivierdupuis/Git/discursus/platform/discursus_data_platform/dw",
            "path": "warehouse/core/core_metrics.yml",
            "original_file_path": "models/warehouse/core/core_metrics.yml",
            "name": "event_count",
            "description": "The count of events",
            "label": "Events count",
            "calculation_method": "count",
            "timestamp": "event_date",
            "expression": "event_pk",
            "filters": [],
            "time_grains": [
                "day",
                "week",
                "month",
                "quarter",
                "year"
            ],
            "dimensions": [
                "action_geo_country_name"
            ],
            "window": null,
            "model": "ref('events_fct')",
            "model_unique_id": null,
            "resource_type": "metric",
            "meta": {},
            "tags": [],
            "config": {
                "enabled": true
            },
            "unrendered_config": {},
            "sources": [],
            "depends_on": {
                "macros": [],
                "nodes": [
                    "model.discursus_dw.events_fct"
                ]
            },
            "refs": [
                [
                    "events_fct"
                ]
            ],
            "metrics": [],
            "created_at": 1668096680.095598
        }
    },

Of course, that only covers the parsing of metric definitions and not how that can then be translated to either Looker, Cube or dbml definitions. But maybe a first step could be to have this as an experimental feature that would be used to convert dbt metrics to Cube, as it might be an appealing feature for dbt and Cube users who are looking for a way to integrate both tools without having to use dbt Cloud.

Try/Except is hiding underlying errors

try:
for value in list_rows(wrangle_descriptions()):
print(value)
except:

As a new user of this package, this try except block ends up causing the program to silently fail. After manually removing the try/except I can now see that I have snowflake connection errors that were being hidden.

It would be useful to to catch the errors and log/print them for end users to see rather than just passing

feature: toggle whether to build `tests` for all columns or just `primary keys`

Being able to only generate tests for primary keys would allow large projects to use the dbt build statement without it being prohibitively slow.

This would likely sit in dbt_project.yaml

Being able to test_overwrite should still work in this case, as you might want to ensure you have specific tests for crucial non-pk columns even when you are using the newly envisaged pk-only test setting.

Option to generate one cube definition per entity

The droughty cube functionality currently generates 3 files: base, integration and aggregate. This doesn't translate well with the way I organize my Cube deployments, where I have a cube definition per entity. So that would mean I have the dimensions, metrics and joins all defined in a single file.

Would it be possible to have the option to change how droughty generates the cube definitions so that instead of the above files, I end up with a single file per entity that would look something like the following...

cube('Events', {
  sql: `select * from analytics.events_fct`,

  dimensions: {
    eventPk: {
      primaryKey: true,
      sql: `event_pk`,
      type: `string`,
      shown: true
    },

    movementFk: {
      sql: `movement_fk`,
      type: `string`
    },

    eventDate: {
      sql: `event_date`,
      type: `time`
    },

    actionGeoCountryName: {
      sql: `action_geo_country_name`,
      type: `string`
    },
  },

  measures: {
    eventCount: {
      sql: `event_pk`,
      type: `count`
    }
  },

  joins: {
    Movements: {
      relationship: `belongsTo`,
      sql: `
        ${CUBE}.movement_fk = ${Movements.movementPk}
      `
    },
  },
});

Remove requirement for unused configurations

As a user of droughty, I want to remove unused configurations from my droughty_project.yml file, so that I can keep that file as simple and clean as possible.

For example, if I'm not to use the lookml command, I would like to be able to remove the following configs:

explores:
  parent_table: 
    - actors_dim
  dimensions: 
    - events_fct
  facts:
    - narratives_fct

Cube can't find join paths between base and extension cubes

In cube_base.js, I have the following definition:

cube(`events_fct`, {
    sql: `select * from analytics.events_fct`,
    dimensions: {
        action_geo_country_code: {
            sql: `action_geo_country_code`,
            type: `string`,
            description: `not available`,
        },
        action_geo_country_name: {
            sql: `action_geo_country_name`,
            type: `string`,
            description: `not available`,
        },
        action_geo_full_name: {
            sql: `action_geo_full_name`,
            type: `string`,
            description: `not available`,
        },
        event_pk: {
            primaryKey: true,
            type: `string`,
            sql: `event_pk`,
            description: `not available`,
        },
        event_ts: {
            sql: `event_ts`,
            type: `time`,
            description: `not available`,
        },
        protest_fk: {
            sql: `protest_fk`,
            type: `string`,
            description: `not available`,
        },
    }
});

In cube_aggregates, I have the following definition:

cube(`events_fct_extended`, {
    sql: `select * from analytics.events_fct`,
    extends: events_fct,
    measures: {
        sum_of_action_geo_latitude: {
            sql: `action_geo_latitude`,
            type: `sum`,
        },
        sum_of_action_geo_longitude: {
            sql: `action_geo_longitude`,
            type: `sum`,
        },
        count_of_event_pk: {
            sql: `event_pk`,
            type: `count`,
        },
    }
});

When I try to run a query using a metric from events_fct_extended, segmented by a dimension from the events_fct cube, I have the following error.

Screen Shot 2022-05-30 at 13 34 53

Control names of semantic files generated

As a user of droughty, I want to control what will be the name of the semantic files that are generated, because I always want to generate a specific file name, regardless of the sources being used.

For example, when I use droughty dbml, I can use it against my dev or qa schemas. But regardless to which schema it points to, I want it to generate the same file name.

Feature request to be able to omit dimensions/measures created automatically from looker project

Currently droughty creates lookml dimensions or measures by inferring based on information provided within a dbt project. However, there are some cols that would not make sense to create a measure for like sum of screen width.

It would be great if there was a way to specify what measures (and dimensions?) to not include in the automatically generated base layers for the looker project. This way users would only go to one place to toggle visibility for those dimensions/measures.

Surface Looker Explore prod/dev datasets in the SQL pane

Would it be possible to generate a customisable LookML line of code (specified in the droughty_project.yaml file) beneath each view in the _base.layer.lkml file, in this instance to enable custom prod and development datasets to be surfaced in the Looker Explore SQL pane?

_base.layer.lkml example

view: <data_warehouse_table_name> {
sql_table_name: `<gcp_project_id>.{{ _user_attributes['<user_attribute_analytics_data_warehouse>'] }}.<data_warehouse_table_name>` ;;

drought_project.yaml

lookml_dataset: sql_table_name: `<gcp_project_id>.{{ _user_attributes['<user_attribute_analytics_data_warehouse>'] }}.<data_warehouse_table_name>` ;;

This would enable users to investigate queries in the data warehouse UI if needed, or check if the user is querying against production or development data.

dbt_utils_expression_is_true test assertion

Bug description

droughty asserts dbt_utils_expression_is_true test on columns application_valid_from_dt and application_valid_to_dt:

- 'dbt_utils.expression_is_true:expression: valid_from < valid_to'

Expected behaviour

In this use case, I would expect only the not_null test to have been assigned by droughty to these date columns (flagged by 'dt' syntax in column name).

Screenshots/Video

Screenshot 2023-01-20 at 11 04 01

Workaround

Using test_overwrite: in droughty_schema.yaml to redefine all tests for the model in question.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.