snowplow-incubator / dataform-data-models Goto Github PK
View Code? Open in Web Editor NEWSnowplow Incubator project for Dataform SQL data models for working with Snowplow data. Supports BigQuery only
License: Apache License 2.0
Snowplow Incubator project for Dataform SQL data models for working with Snowplow data. Supports BigQuery only
License: Apache License 2.0
For use cases with anonymous tracking activated the assertions fail because of the check for cookie ids:
OR domain_userid IS NULL
OR domain_sessionid IS NULL
OR domain_sessionidx IS NULL
Would it maybe make sense to have these seperated from the other data quality assertions? If I get dataform notifications everyday with "Run failed" because of these, I will probably ignore them after a couple of days and won't realize when there are other issues arising.
In users_this_run
we join together users_aggregates
and users_sessions_this_run
on start_tstamp
. By joining on start_tstamp
we attempt to pull info from the first session per user:
FROM {{.scratch_schema}}.users_aggregates{{.entropy}} AS b
INNER JOIN {{.scratch_schema}}.users_sessions_this_run{{.entropy}} AS a
ON a.domain_userid = b.domain_userid
AND a.start_tstamp = b.start_tstamp
There are rare cases however where a user can have multiple sessions with the same start_tstamp
, which also happens to be their first session. This can result in duplicate domain_userid
s in the users_this_run
table.
03-users-sessions-this-run
step, join in start_tstamp
from users_userids_this_run
giving user_start_tstamp
.04-users-aggs
step, add first_domain_sessionid
. Use some arbitary selection if multiple first sessions with same start_tstamp
i.e. max(case when start_tstamp = user_start_tstamp then domain_sessionid end) as first_domain_sessionid
06-users
step, join a to b using first_domain_sessionid
= domain_session_id
, rather than start_tstamp
to get first session details.Mirror of snowplow/data-models#75
Dataform allows setting table and column descriptions, which are then synced to BigQuery and the Google Data Catalog. This would be a great addition to the models:
type: "table",
description: "This table joins orders information from Shopify & payment information from Stripe",
columns: {
order_date: "The date when a customer placed their order",
id: "Order ID as defined by Shopify",
order_status: "The status of an order e.g. sent, delivered",
customer_id: "Unique customer ID",
payment_status: "The status of a payment e.g. pending, paid",
payment_method: "How the customer chose to pay",
item_count: "The number of items the customer ordered",
amount: "The amount the customer paid"
},
assertions: {
uniqueKey: ["id"]
}
}
https://docs.dataform.co/getting-started-tutorial/testing-and-documentation#documenting-your-dataset
Some Dataform projects use a more complex setup with:
As the tables are hard-coded in this model right now, these setups don't work out of the box right now. A simple fix could be to introduce variables for source and target:
const input_project = "snowplow-raw-302115";
const output_project = dataform.projectConfig.defaultDatabase;
Also maybe this could be solved in the SQLX files itself by using the dataform config functions?
config {
database : "snowplow-raw-302115",
schema: "rt_pipeline_prod1",
name: "events",
description: "Raw Snowplow events table"
}
In 99_base_cleanup
there's a reference to base_duplicates_this_run
, but I can't seem to find it.
I can see a reference to it in the BQ Readme, but the only code references to it in redshift.
Is it meant to be in bq?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.