Giter VIP home page Giter VIP logo

dataform-data-models's People

Contributors

adatzer avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataform-data-models's Issues

Assertions and anonymous tracking

For use cases with anonymous tracking activated the assertions fail because of the check for cookie ids:

  OR domain_userid IS NULL
  OR domain_sessionid IS NULL
  OR domain_sessionidx IS NULL

Would it maybe make sense to have these seperated from the other data quality assertions? If I get dataform notifications everyday with "Run failed" because of these, I will probably ignore them after a couple of days and won't realize when there are other issues arising.

Solve for multiple 'first sessions' with the same start_tstamp per user

Issue

In users_this_run we join together users_aggregates and users_sessions_this_run on start_tstamp. By joining on start_tstamp we attempt to pull info from the first session per user:

  FROM {{.scratch_schema}}.users_aggregates{{.entropy}} AS b

  INNER JOIN {{.scratch_schema}}.users_sessions_this_run{{.entropy}} AS a
    ON a.domain_userid = b.domain_userid
    AND a.start_tstamp = b.start_tstamp

There are rare cases however where a user can have multiple sessions with the same start_tstamp, which also happens to be their first session. This can result in duplicate domain_userids in the users_this_run table.

Proposed Fix

  • In 03-users-sessions-this-run step, join in start_tstamp from users_userids_this_run giving user_start_tstamp.
  • In 04-users-aggs step, add first_domain_sessionid. Use some arbitary selection if multiple first sessions with same start_tstamp i.e. max(case when start_tstamp = user_start_tstamp then domain_sessionid end) as first_domain_sessionid
  • In 06-users step, join a to b using first_domain_sessionid = domain_session_id, rather than start_tstamp to get first session details.

Documentation and data catalogue

Dataform allows setting table and column descriptions, which are then synced to BigQuery and the Google Data Catalog. This would be a great addition to the models:

  type: "table",
    description: "This table joins orders information from Shopify & payment information from Stripe",
  columns: {
    order_date: "The date when a customer placed their order",
    id: "Order ID as defined by Shopify",
    order_status: "The status of an order e.g. sent, delivered",
    customer_id: "Unique customer ID",
    payment_status: "The status of a payment e.g. pending, paid",
    payment_method: "How the customer chose to pay",
    item_count: "The number of items the customer ordered",
    amount: "The amount the customer paid"
  },
    assertions: {
    uniqueKey: ["id"]
  }
}

https://docs.dataform.co/getting-started-tutorial/testing-and-documentation#documenting-your-dataset

Add source project and target project to tables

Some Dataform projects use a more complex setup with:

  1. Different source and target GCP projects
  2. Serparate dev and prod environments (https://docs.dataform.co/dataform-web/scheduling/environments#example-use-separate-databases-for-development-and-production-data)

As the tables are hard-coded in this model right now, these setups don't work out of the box right now. A simple fix could be to introduce variables for source and target:

const input_project = "snowplow-raw-302115";
const output_project = dataform.projectConfig.defaultDatabase;

Also maybe this could be solved in the SQLX files itself by using the dataform config functions?

config {
  database : "snowplow-raw-302115",
  schema: "rt_pipeline_prod1",
  name: "events",
  description: "Raw Snowplow events table"
}

base_duplicates_this_run does not exist?

In 99_base_cleanup there's a reference to base_duplicates_this_run, but I can't seem to find it.

I can see a reference to it in the BQ Readme, but the only code references to it in redshift.

Is it meant to be in bq?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.