Giter VIP home page Giter VIP logo

dbt-labs / dbt-core Goto Github PK

View Code? Open in Web Editor NEW
8.9K 140.0 1.5K 42.61 MB

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Home Page: https://getdbt.com

License: Apache License 2.0

Python 73.26% Makefile 0.09% Shell 0.11% Dockerfile 0.02% HTML 25.75% Rust 0.75% Batchfile 0.01%
dbt-viewpoint slack pypa data-modeling business-intelligence analytics elt

dbt-core's Introduction

dbt logo

CI Badge

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

architecture

Understanding dbt

Analysts using dbt can transform their data by simply writing select statements, while dbt handles turning these statements into tables and views in a data warehouse.

These select statements, or "models", form a dbt project. Models frequently build on top of one another – dbt makes it easy to manage relationships between models, and visualize these relationships, as well as assure the quality of your transformations through testing.

dbt dag

Getting started

Join the dbt Community

Reporting bugs and contributing code

  • Want to report a bug or request a feature? Let us know and open an issue
  • Want to help us build dbt? Check out the Contributing Guide

Code of Conduct

Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the dbt Code of Conduct.

dbt-core's People

Contributors

aranke avatar azhard avatar bastienboutonnet avatar beckjake avatar chenyulinx avatar cmcarthur avatar dbeatty10 avatar dependabot[bot] avatar drewbanin avatar emmyoop avatar fishtownbuildbot avatar github-actions[bot] avatar gshank avatar iknox-fa avatar ilkinulas avatar jtcohen6 avatar jthandy avatar kconvey avatar leahwicz avatar max-sixty avatar mcknight-42 avatar michelleark avatar mikealfare avatar nathaniel-may avatar peterallenwebb avatar prratek avatar qmalcolm avatar raalsky avatar stu-k avatar versusfacit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-core's Issues

arbitrarily deep nesting

deep

#50

Summmary

This branch primarily addresses code cleanup/refactoring. In addition, it seemed like a good time to add arbitrary model nesting support. Everything can be arbitrarily deeply nested. This includes .csv files, model .sql files, schema.yml files, and analysis files!

New dbt_project.yml model configurations

Suppose the root of your models repository looks like this:

dbt_project.yml
models/
├── other_service
│   ├── accounts.sql
│   ├── people.sql
│   ├── people_accounts.sql
│   └── schema.yml
└── saas_service
    └── summary
        └── domain_counts.sql
analysis/
└── domains
    └── domain_counts.sql

You can structure your models config in dbt_project.yml to selectively configure models at any depth you want. You must specify the full path to a model in the config for its configuration to take affect. That is, if your file is located at models/saas_service/summary/domain_counts.sql, your config should look like:

models:
  'Your Project Name':
    saas_service:
      summary:
        domain_counts:
          enabled: true

Of course, if you just want to enable all models in Your Project Name, you can write:

models:
  'Your Project Name':
    enabled: true

This works at any level of the config, and the most restrictive config that can apply to a model is used. Each level of the config "inherits" its parents config and can overwrite some or all of it.

A full example:

name: 'My Project Name'
version:  '1.0'
...
# default parameters that apply to _all_ models (unless overridden below)
model-defaults:
  enabled: true           # enable all models by default
  materialized: false     # If true, create tables. If false, create views

models:
  'My Project Name': # this is the name of your dbt project (listed above)
    enabled: true       # by default, enable models in this repository (including deps)
    saas_service:
      enabled: false     # disable 'My Project Name'.saas_service.*
      materialized: true
      emails:
        enabled: true # specifically enable 'My Project Name'.saas_service.emails

  'Analyst_Collective': # this is a dependency listed below
    enabled: false  # by default, disable all 'Analyst_Collective' models
    emails:
      enabled: true  # only turn on 'email' and 'pardot' models
    pardot:
      enabled: true

repositories:
   - "[email protected]:analyst-collective/analytics.git" # name = 'Analyst_Collective'

Namespacing

I am a bit worried about our namespaces. It seems like we have three things: models, tests, analyses, and they all have their own namespace tree. Seems weird. I wonder if top level shouldn't be pardot and then within that there are tests, models, analysis. There could potentially be other elements as well.

Curious to know whether this has occurred to you / if you see it as a problem. Might be a useful cleanup.

implement hooks

We need to create "hooks" for the core dbt tasks. Hooks would simply offer the opportunity to write sql-based scripts that would execute before and after the various stages in the dbt commands.

One example use for a hook would be setting grants for a schema and all tables in it to certain users after that schema is created. There's no need for dbt to set those grants itself, this is a generic way to allow users to do this or many other tasks associated with dbt processes.

For now, I could imagine pre and post hooks for dbt run. Since the other commands don't actually interface with the database, I'm not sure that the concept makes as much sense there. If anyone can imagine a use, I'd be open to it though.

In the future it might make sense to implement hooks for individual models, but I don't see the need to do that today.

dbt should compile an analytics folder to target as well as the models folder

Right now it's super annoying to interpolate the schema names in analytical queries in a dbt project. We've never really considered dbt a tool for analysis, but the way I'm using it is that in a single dbt project you have both models and analyses, and you work on them in parallel in your text editor and sql terminal. Once the results of a model are good, you dbt run it. Once the results of an analysis are good, you paste it into Mode.

The problem is that right now, analyses don't get compiled, which means that schema interpolation is super-annoying. Every analytic query needs to have its schema name changed when moving from test to production (which is exactly what I'm doing right now with 12 queries).

(Test vs. production is managed via schema names.)

I don't think this is really requesting any new functionality, just the addition of an analytics folder within dbt-project.yml and to also create a folder for that within target that compiles in exactly the same way that models currently does.

dbt should have the ability to seed tables from csvs

there are a bunch of types of data that are needed around the edges of any specific analysis. mapping an entity to metadata about it, tables for exceptions to ignore, etc. all of this data needs to be in the warehouse somewhere so that it can be incorporated into analytic queries.

currently this process is being facilitated via super-long union all queries that encode the data within the structure of a view. this is not a good solution for performance and maintenance reasons; it's also just weird/awkward. instead, there should be CSVs that are checked into a /data directory in the dbt folder that get loaded by dbt.

potentially the command should be dbt seed.

other thoughts:

  • there needs to be a file that defines the schema for each table; otherwise it would have to be inferred from the csv and that's more work than we need to do right now.
  • it's far easier to truncate and reload these tables than drop and rebuild tables because it doesn't require cascade-dropping all of the dependencies on the table. we probably need a separate dbt seed --create-tables vs. just dbt seed, where simply calling seed truncates and reloads whereas --create-tables drops any existing tables and recreates from the definition files.
  • loading should probably be done via insert statements. while a redshift COPY command would be more performant, it also would force us to go outside of SQLAlchemy and would not be compatible across multiple eventual endpoints. performance is not a concern because we the purpose of this function is to load very small numbers of rows (single-digit thousands should be considered high for this use case).

run queries in parallel

dbt run should issues multiple commands to the database in parallel. it should use the dependency graph to determine what tasks don't depend on each other and run them as soon as possible (with some configurable max # of concurrent tasks).

this has the potential to ​_significantly_​ speed up dbt run execution time, which is really important for usability.

dbt throws error without additional command

if I just type dbt at command line i get the following error:

Traceback (most recent call last):
  File "/Users/tristan/Development/zendesk/venv/bin/dbt", line 11, in <module>
    sys.exit(main())
  File "/Users/tristan/Development/zendesk/venv/lib/python3.4/site-packages/dbt/main.py", line 43, in main
    proj = proj.with_profiles(parsed.profile)
AttributeError: 'Namespace' object has no attribute 'profile'

This should function like other bash commands and supply a list of possible well-formed commands.

consider adding "listeners" to trigger dbt run commands

this only makes sense with a hosted solution with some level of persistence, but dbt could listen for things to make it know that it needed to update tables. for instance, when there is a new payment since the most recent processing.

I think this would be implemented as a sql statement that returned 1 if reprocessing was required and 0 if not.

another standard dbt schema validation

accepted values. for instance, this field can have values of "paid" and "free".

accepted-values:
- {field: type, values: ['paid', 'free']}

this is very common. would be a great thing to implement within schema.yml.

logging locations

run logs should be output to a folder or to standard out optionally. in dev, stdout is probably easier; in production, would be good to output to date-versioned files.

`dbt seed` for dependencies

There is a csv that I want to load within the snowplow open source repo. Can't do it, because dbt seed doesn't understand dependencies.

allow users to specify the schema for a model

right now, all models need to be deployed in a common schema. this is messy/confusing, but it allows the user to use schemas to manage test / production environments.

a more mature configuration would be to use separate databases or clusters for development and production, periodically copying data from production to test. this would free up all schemas within that database to be used, rather than needing to cordon them off into a single schema. this means that snowplow models could go into the snowplow schema, right along with the raw data. this is desirable.

we could easily infer schema names for a given model from the filename, as we currently do for table names: {{schemaname}}.{{tablename}}.sql, where {{schemaname}} could be optional and would therefore default back to the default schema name defined in profiles.yml.

test for referential integrity is awkward

right now, dbt is assuming you're specifying referential integrity on the "one" side of the one-to-many relationship. this is awkward; you should ideally specify it on the "many" side of the relationship. what you're really doing in this case is checking "are there any values in this field that don't appear in the parent table?", so it's counterintuitive to specify it on the parent table, because it's actually validating the child table.

PLMK if this doesn't make sense. It should ideally just be a change to the construction of that one SQL statement.

For now, I'm going to construct all of my schema.yml files to work with the way it is currently specified, but I'd prefer to migrate them over to the corrected format asap.

do something else with CSV loading--it sucks

option 1: maybe an ENV function like load_csv_file(...)?

option 2: force the user to write the model as follows (this is terrible pseudocode):

{ for item in dict(whatever)
  return (build a concatenated select statement with a UNION ALL at the end)
}

option 3: other stuff?

option 4: kill the feature?

right now we have a fairly strong assumption that dbt does not load data into your warehouse. this feature in its current form defies that. I'd really prefer to re-implement it in a more "dbt-y" way. option 2 feels the most appropriate to me, although forcing the user to write this code is terrible. if we built a wrapper function that essentially did exactly this, then turned CSVs into an appropriate way to load configuration data, then we're getting to a place that feels more like a good solution to this, however. it would allow the user to materialize this model however they chose to.

dbt run is missing output

I'm fairly sure that when I dbt run, the output doesn't contain all the actual actions. I am attaching the output as well as what objects have been created in the database: it tells me that 5 views have been created but there are actually 10 views created.

screen shot 2016-03-16 at 6 16 22 pm

screen shot 2016-03-16 at 6 16 31 pm

dbt profiles should have environment schemas called out

dbt run should take an environment argument:

dbt run --dev
dbt run --prod

This isn't a massive deal, but at the moment you have to go into profiles.yml and switch the schema. And that's vulnerable to forgetting that, in the future, you have in fact switched the schema and accidentally deploying to production when you didn't mean to.

Switch run-targets via command line

There is an ability in config to define multiple run-targets. It seems like I should be able to specify a run target as an option in the command line, because otherwise the facility to switch between them is fairly awkward.

In general, the ability to define / switch between run targets should be thought through and documented. It will be common to use dbt in multiple environments at once from within a single machine (I'm trying to do this now and struggling) and we need to support it.

allow users to seed data from production into a test environment

with some basic filters on top of base models, it would be fairly straightforward to limit test data to a subset of production data and then unload/copy it into another database for testing.

this would be particularly elegant to do using dbt because of the hierarchical nature of models--if you filter at the base layer, subsequent layers inherit.

"incremental" deploy

only update views/tables which have changed
for developmental debugging, not for production

log levels

should implement log levels (warn/error/etc). this will be useful in the future if we want to monitor a production deployment via amazon cloudwatch--you could pipe all of the messages of a certain level to a certain queue.

dbt push

push to "prod" and run on some schedule

would be good to set up cloudwatch, alarms too

self()

implement a self() function that returns schema_name.model_name. this will be helpful in pre- and post-hooks as outlined here: #22

dbt complie networkx error

We've confirmed that networkx 1.11 was installed at the same time as dbt. Here's the error message:

  File "/usr/local/bin/dbt", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 2603, in <module>
    working_set.require(__requires__)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 666, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 565, in resolve
    raise DistributionNotFound(req)  # XXX put more info here
pkg_resources.DistributionNotFound: networkx==1.11

append instead of recreate for immutable tables

ie. snowplow events. Don't need to recreate model tables, just need to append the new data

materialize: t/f --> type: view/table/incremental

if not exist:
create
else:
insert select (as template)... add WHERE on id/date field (specified in config)

Need dependency resolution

Running dbt compile and then dbt run causes an error in the current state of the models repo, because there are dependencies. Currently every time i dbt compile I need to go into target and rename folders, because currently they are just executed in alpha order.

Ideally dependencies are simply inferred from the content of the SQL.

write integration tests

  • archival
  • csv seeding
  • dependency projects
  • dependency project configurations (enabled, materialization)
  • var injection (+ dependencies)
  • dbt run --models ....
  • dbt run --models .... w/ dependencies`
  • dbt run --dry (simple version, make sure it doesn't write anything)

Via #206

fail only within a single dependency chain.

currently, a single failure terminates the entire run. would be great (especially for production deployment) for failure to terminate only a single dependency chain (subsequent models obviously can't run) but not to terminate other dependency chains.

Project Feedback

Thoughts on the project structure:

I would make dbt/task/*.py just be one file: dbt/tasks.py. Generally you shouldn't make Python packages unless things start to get big. And the one-class-per-file is not really ever the way to split it up.


main.py:

You could have each task set up its subparser. I think your structure here came from mario so this one is my bad. But I wish I had done:

# tasks.py

class Clean:
    @classmethod
    def setup_subparser(cls, subs):
        sub = subs.add_parser('clean')
        return sub

tasks = [
    Clean,
    # ...
]

# main.py
import tasks

# ...

    p = argparse.ArgumentParser(prog='dbt: data build tool')
    subs = p.add_subparsers()

    for task in tasks.tasks:
        sub = task.setup_subparser(subs)
        sub.set_defaults(cls=cls)

return connection information

dbt's connection information should be able to be accessed by other code. this provides a convenient utility for credential management across all analytic code.

the current use case for this is using the redshift credentials within a jupyter notebook.

dbt should support other database adapters

We should consider how to support other adapters, including biquery, presto, etc. While I don't think it's critical that we actually build support for other adapters, I do believe we should have a clear understanding of how we want those to be supported when someone does in fact want to add support for them.

dbt_project.yml models config not respected within dependencies

I've specified model configurations for the open source quickbooks models that I've created, but when I pull them into a new project and run a compile, all of them are compiled as views. Two models should be created as tables according to the config, but that is not happening. This forces me to copy the configuration into the config of the primary project repo.

dbt fails without create schema permissions

dbt is trying to run the following during dbt run:

cursor.execute('create schema if not exists "{}"'.format(target_cfg['schema']))

this is a convenient way to create the schema if it doesn't exist, but it requires the user to have create schema privileges. it's reasonable to assume the user will not have create schema privileges, and in this case dbt will just need to verify that the requested schema does, in fact, exist--and it will need to do so in a way that does not require admin privileges (aka does not use system tables).

dbt should have testing functionality to ensure common schema rules are valid

Rules like uniqueness and referential integrity are incredibly common when doing analysis on a schema. As a part of the testing process, dbt should be able to be fed a schema configuration that instructs it how to test that a schema is following these rules, and then be able to run these tests automatically.

There are three specific schema constraints that we should test for:

  1. Not null
  2. Uniqueness
  3. Referential integrity

Below I'll provide the most standardized SQL to use to test these constraints. All tests pass when the queries return 0.

not null

Could be declared like table.field is not null.

with t as (

  select [field] as f
  from [table]

)

select count(*) 
from t
where f is null

uniqueness

Could be declared like table.field is unique.

with t as (

  select [field] as f
  from [table]

)

select count(*) from (

  select f
  from t
  group by f
  having count(*) > 1

)

referential integrity

Could be declared like parent table one joins to child table many on one.id = many.one_id.

with one as (

  select [pk field name] as id
  from [table name of parent table]

), many as (

  select [fk field name] as id
  from [table name of child table]

)

select count(*)
from many
where id not in (select id from one)
and id is not null

dbt init

dbt init should make a shell project directory (w/ .gitignore and such)

this is increasingly important in the package mgmt world

Issue when using dbt on Windows

Followed all steps, and encountered this stack trace when running dbt compile:

Traceback (most recent call last):
  File "c:\python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "c:\python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\Scripts\dbt.exe\__main__.py", line 9, in <module>
  File "c:\python27\lib\site-packages\dbt\main.py", line 37, in main
    parsed.cls(args=parsed, project=proj).run()
  File "c:\python27\lib\site-packages\dbt\task\compile.py", line 102, in run
    self.__compile(src_index)
  File "c:\python27\lib\site-packages\dbt\task\compile.py", line 92, in __compile
    template = jinja.get_template(f)
  File "c:\python27\lib\site-packages\jinja2\environment.py", line 812, in get_t                               emplate
    return self._load_template(name, self.make_globals(globals))
  File "c:\python27\lib\site-packages\jinja2\environment.py", line 774, in _load                               _template
    cache_key = self.loader.get_source(self, name)[1]
  File "c:\python27\lib\site-packages\jinja2\loaders.py", line 168, in get_sourc                               e
    pieces = split_template_path(template)
  File "c:\python27\lib\site-packages\jinja2\loaders.py", line 31, in split_temp                               late_path
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: email\emails_denormalized.sql

macros

  • macros should be includable in models
  • key issue is with namespacing
  • how do we use built-in, dependency, and top-level macros at the same time?

Document suggested interface here for review (cc @jthandy)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.