Giter VIP home page Giter VIP logo

data-models's Introduction

Snowplow logo

Release Release activity Latest release Docker pulls Discourse posts License


As of January 8, 2024, Snowplow is introducing the Snowplow Limited Use License Agreement, and we will be releasing new versions of our core behavioral data pipeline technology under this license.

Our mission to empower everyone to own their first-party customer behavioral data remains the same. We value all of our users and remain dedicated to helping our community use Snowplow in the optimal capacity that fits their business goals and needs.

We reflect on our Snowplow origins and provide more information about these changes in our blog post here → https://eu1.hubs.ly/H06QJZw0


Overview

Snowplow is a developer-first engine for collecting behavioral data. In short, it allows you to:

Thousands of organizations around the world generate, enhance, and model behavioral data with Snowplow to fuel advanced analytics, AI/ML initiatives, or composable CDPs.

Table of contents

Why Snowplow?

  • 🏔️ Rock solid architecture capable of processing billions of events per day.
  • 🛠️ Over 20 SDKs to collect data from web, mobile, server-side, and other sources.
  • ✅ A unique approach based on schemas and validation ensures your data is as clean as possible.
  • 🪄 Over 15 enrichments to get the most out of your data.
  • 🏭 Send data to popular warehouses and streams — Snowplow fits nicely within the Modern Data Stack.

➡ Where to start? ⬅️

Snowplow Community Edition Snowplow Behavioral Data Platform
Community Edition equips you with everything you need to start creating behavioral data in a high-fidelity, machine-readable way. Head over to the Quick Start Guide to set things up. Looking for an enterprise solution with a console, APIs, data governance, workflow tooling? The Behavioral Data Platform is our managed service that runs in your AWS, Azure or GCP cloud. Book a demo.

The documentation is a great place to learn more, especially:

  • Tracking design — discover how to approach creating your data the Snowplow way.
  • Pipelines — understand what’s under the hood of Snowplow.

Would rather dive into the code? Then you are already in the right place!


Snowplow technology 101

Snowplow architecture

The repository structure follows the conceptual architecture of Snowplow, which consists of six loosely-coupled sub-systems connected by five standardized data protocols/formats.

To briefly explain these six sub-systems:

  • Trackers fire Snowplow events. Currently we have 15 trackers, covering web, mobile, desktop, server and IoT
  • Collector receives Snowplow events from trackers. Currently we have one official collector implementation with different sinks: Amazon Kinesis, Google PubSub, Amazon SQS, Apache Kafka and NSQ
  • Enrich cleans up the raw Snowplow events, enriches them and puts them into storage. Currently we have several implementations, built for different environments (GCP, AWS, Apache Kafka) and one core library
  • Storage is where the Snowplow events live. Currently we store the Snowplow events in a flat file structure on S3, and in the Redshift, Postgres, Snowflake and BigQuery databases
  • Data modeling is where event-level data is joined with other data sets and aggregated into smaller data sets, and business logic is applied. This produces a clean set of tables which make it easier to perform analysis on the data. We officially support data models for Redshift, Snowflake and BigQuery.
  • Analytics are performed on the Snowplow events or on the aggregate tables.

For more information on the current Snowplow architecture, please see the Technical architecture.


About this repository

This repository is an umbrella repository for all loosely-coupled Snowplow components and is updated on each component release.

Since June 2020, all components have been extracted into their dedicated repositories (more info here) and this repository serves as an entry point for Snowplow users and as a historical artifact.

Components that have been extracted to their own repository are still here as git submodules.

Trackers

A full list of supported trackers can be found on our documentation site. Popular trackers and use cases include:

Web Mobile Gaming TV Desktop & Server
JavaScript Android Unity Roku Command line
AMP iOS C++ iOS .NET
React Native Lua Android Go
Flutter React Native Java
Node.js
PHP
Python
Ruby
Scala
C++
Rust
Lua

Loaders

Iglu

Data modeling

Web

Mobile

Media

Retail

Testing

Parsing enriched event


Community

We want to make it super easy for Snowplow users and contributors to talk to us and connect with one another, to share ideas, solve problems and help make Snowplow awesome. Join the conversation:

  • Meetups. Don’t miss your chance to talk to us in person. We are often on the move with meetups in Amsterdam, Berlin, Boston, London, and more.
  • Discourse. Our forum for all Snowplow users: engineers setting up Snowplow, data modelers structuring the data, and data consumers building insights. You can find guides, recipes, questions and answers from Snowplow users and the Snowplow team. All questions and contributions are welcome!
  • Twitter. Follow @Snowplow for official news and @SnowplowLabs for engineering-heavy conversations and release announcements.
  • GitHub. If you spot a bug, please raise an issue in the GitHub repository of the component in question. Likewise, if you have developed a cool new feature or an improvement, please open a pull request, we’ll be glad to integrate it in the codebase! For brainstorming a potential new feature, Discourse is the best place to start.
  • Email. If you want to talk to Snowplow directly, email is the easiest way. Get in touch at [email protected].

Copyright and license

Snowplow is copyright 2012-2023 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

data-models's People

Contributors

adatzer avatar agnessnowplow avatar bill-warner avatar colmsnowplow avatar rlh1994 avatar stanch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-models's Issues

Resolve issues in testing flow

Great expectations' datasource new won't work if the file doesn't exist as templates are referenced elsewhere. Add config_variables.yml file with defaults in order to ensure testing workflow can be established easily.

Additionally, the .test/share directory should be ignored.

Redshift: Rename metadata tables and columns

Original metadata table design and naming follows a model structure which we have diverted from.

Table should be renamed to datamodel_metadata.

Instead of module_name and step_name we should have model and module. Values of model should all be web (when base module becomes generalised to other models, this can change). Values of module should be module name.

Redshift: Add deduplication

A rare edge case can occur with current 'exclude all' duplicates strategy, when an event is processed, and a subsequent run contains duplicates of that event along with other, legitimate events. For example:

A run contains a page view with event id 123 - this event has page view in session index = 1.

A subsequent run contains a duplicate of that event, along with another, legitimate page view event in the same session. The data from that session in this run will be:

page view event - event ID: 123
page view event - event ID: 123
page view event - event ID: 456

In this second run, the already processed event 123 will be removed by deduplication, and the new one 456 will be assigned page view in session index of 1.

Page View 123 won't be removed from the table, so we will have a session with two page views of page view in session index of 1.


We might solve this by using session_id to update the table, but this feels somewhat fragile.

We can also solve it by implementing better deduplication logic - to keep the first event_id (by collector_tstamp).

The tricky part is that ideally we only keep the first event IF the collector_tstamp is not duplicated also, and remove both otherwise (to avoid cartesian join). However, if we remove both we still have a chance to hit this issue.

One way out of that is to implement a mechanism to apply the incremental logic to all relevant atomic tables (thereby creating deduplicated _staged tables for every join that might be involved in a customisation).

Add top-level CHANGELOG

Having a changelog per-model makes sense. However, changes to scripts and other non-model-specific resources shouldn't be tied to a model's changelog. So, we should add a global one to track those.

I think the best structure is as follows:

  • Any commits that aren't specific to a particular model are listed with issue reference.
  • If those commits are part of a specific web release that release is referenced with a reference to the changelog.
  • If those commits aren't tied to a release they are listed under a maintenance 'release'
maintenance (2021-02-04)
---------------------------
Patch broken link in README (#127)

snowflake/web/1.0.0 (2021-02-01)
------------------------------------------
Snowflake web model release ([changelog](web/snowflake/CHANGELOG))
Fix bug in run_config.sh script (#124)

redshift/web/1.1.0 (2020-12-01)
---------------------------------------
Update main readme (#123)
Redshift web model release ([changelog](web/redshift/CHANGELOG))

Documentation loop

In the documentation you refer to data Documentation for the data models can be found on our documentation site. and this link send me to a website where there is a link sending me back to the readme. Documentation Inception :D

Data modeling: are the filters applied in the performance timing context too strict?

The filters applied to the dom_interactive, dom_content_loaded_event_start and dom_content_loaded_event_end fields in the web model result in more page views than expected showing NULL for total_time_in_ms.

The question is, are these filters too strict?

  • If they are then the solution is simple, all we need to do is relax the filters.
  • If the results are unexpected, then maybe this needs to be looked into further.

Improve performance with MERGE upserts.

Using MERGE instead of DELETE FROM, INSERT INTO might well prove more performant and cost efficient, particularly in the case of BigQuery.

We should test this, and implement MERGE if it proves to be the better option (without affecting logic obviously).

Redshift: Fix table definitions

Table definitions for the Redshift model define CHAR/VARCHAR columns inconsistently through the model - should be both internally consistent and consistent with external definitions

Switch off wiki

docs.snowplowanalytics.com should be the home for all the docs...

Mobile: Incorrect screen_id from screen view context

In some instances the screen_id generated by the screen view context during a screen view event does not equal the screen_view_id which is incorrect.

This is thought to be state management issue within the mobile trackers rather than within the data modelling. As such, it should be resolved once a fix is applied to the tracker.

Redshift: Add config

Bigquery will introduce config and script to run multiple playbooks. Redshift should get similar treatment once BQ is released (as well as Snowflake when that gets released).

Fix readme formatting

Playbook readmes have mangled formatting for lists of playbook variables. Needs a newline after each entry. Eg:

`:input_schema:`       name of atomic schema

`:scratch_schema:`     name of scratch schema  

`:output_schema:`      name of derived schema

`:entropy:`            string to append to all tables, to test without affecting prod tables (eg. `_test` produces tables like `events_staged_test`). Must match entropy value used for all other modules in a given run. Populate with an empty string if no entropy value is needed.

`:stage_next:`         update staging tables - set to true if running the next module. If true, make sure that the next module includes a 'complete' step.

`:start_date:`         Start date, used to seed manifest.

`:lookback_window:`    Defaults to 6. Period of time (in hours) to look before the latest event in manifest - to account for late arriving data, which comes out of order.

`:days_late_allowed:`  Defaults to 3.  Period of time (in days) for which we should include late data. If the difference between collector tstamps for the session start and new event is greater than this value, data for that session will not be processed.

`:update_cadence:`     Defaults to 7. Period of time (in days) in the future (from the latest event in manifest) to look for new events.

`:session_lookback:`   Defaults to 365. Period of time (in days) to limit scan on session manifest. Exists to improve performance of model when we have a lot of sessions. Should be set to as large a number as practical.

Redshift: Patch readmes

Line 17 of web/v1/README.md

More detail on each module can be found in the 'Modules detail' section below.

Should point to the web/v1/sql-runner/README.

BigQuery: Provide handler for schema evolution for custom events and contexts

New schema versions create new columns in BigQuery - which need to be coalesced, but also pose the problem that some versions might not exist in the Database.

We solve this issue for the core enrichment contexts in #52 by using a stored procedure to extract the relevant data into a scratch table.

For the sake of solving the problem at hand, the initial implementation only handles top-level fields which aren't arrays or structs, and only uses the first item in the array.

This or a similar pattern could be amended to handle those more complicated cases, and offer a generic means of handling schema evolution for any custom BQ column.

The trickiest part being that a changing datatype in a struct or array of structs makes the column incompatible with its previous form. If we solve this problem, we can solve the single biggest pain point of working with Snowplow data in BigQuery.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.