Giter VIP home page Giter VIP logo

eliekawerk / data-engineer-camp-project2-group2-luke Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lukehuntley/tvmaze

0.0 0.0 0.0 26.11 MB

Data Engineer Bootcamp Project 2 - api.tvmaze.com

Shell 0.01% JavaScript 0.06% C++ 0.21% Python 99.06% C 0.22% PowerShell 0.05% XSLT 0.01% CSS 0.04% HTML 0.24% Batchfile 0.01% Cython 0.03% Dockerfile 0.01% SCSS 0.05% VBScript 0.01% Classic ASP 0.01%

data-engineer-camp-project2-group2-luke's Introduction

TV MAZE TV SHOWS ๐Ÿ“บ

Currated TV Show datasets from TV Maze

Project Background

Watch TV

TV show data extracted from TV Maze API for Data Analyst to identify and visualize trends in TV history and for ML use-case to potentially create shows with AI or suggest shows based on genres and ratings. Once data is visualised, it can then be used by TV show producers to produce better content and potential actors to star in shows that will be popular. Network channels data can help to determine what shows to air and what time would be a success when aired.


Table of Contents
  1. Project Background
  2. Installation & Setup
  3. Folder Structure
  4. How it's done
  5. Discussion, Lesson Learnt and Future Improvement
  6. Contributors & Contributions

Installation & Setup

This project is built with Python version 3.9.

The following packages and tools are used in this project:

  • Jupyter Notebook
  • Airbyte
  • Snowflake
  • Dbt
  • Airflow
  • Jinja
  • Docker
  • AWS Services (S3, ECR, EC2)
  • Diagrams

To run the pipeline locally: Conda Activate "your venv"

pip install -r requirement.txt

Deploy airbyte locally

  1. Clone the airbyte repo locally from https://github.com/airbytehq/airbyte.git

    git clone https://github.com/airbytehq/airbyte.git
    
  2. Make sure you have Docker Desktop running first

  3. Run the docker compose file

    cd airbyte 
    docker-compose up
    
  4. Copy /source-tvmaze into airbyte/airbyte-integrations/connector

  5. Run docker-compose up again to update the airbyte connector

  6. Setup custom TVMaze connector as Airbyte Source

  7. Login into Snowflake and setup Snowflake user roles with the SQL files within the data-integration\snowflake

  8. Setup Snowflake as Destination and create new connection between Source and Destination in Airbyte

  9. Install dbt-core and its snowflake specific packages pip install dbt-snowflake

  10. Under dbt project folder run dbt init. That will create the same folder as the data-transformation/dbt

  11. Models can be copied over from the tvmaze folder under data-transformation/dbt and run dbt build

Building a Docker Image and push to ECR

  1. Build the Docker Image docker build -t tvmaze-dbt:dev -f docker/Dockerfile

  2. Tag the Docker Image for AWS tvmaze-dbt:dev *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ecr:dev

  3. Login to AWS with CLI aws ecr get-login-password --region *aws_region* | docker login --username AWS --password-stdin *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ec

  4. Docker Push docker push *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ecr:dev

Exporting Airflow connections to JSON

  1. Open up Terminal and input docker exec -it *replace this with the connection id* /bin/.sh

  2. Identify where the "dags" folder in airflow with ls and change directory into the folder with cd dags

  3. Input airflow connections export conn.json and verify that the connections are successfully exported to "conn.json"

  4. To import the "conn.json" file for airflow, use airflow connection import conn.json

For detailed guide on how to replicate this project, please refer to this guide


Folder Structure

code
.
โ”‚
โ”œโ”€โ”€ data-integration
โ”‚   โ”œโ”€โ”€ airbyte
โ”‚   โ”œโ”€โ”€ snowflake
โ”œโ”€โ”€ data-ochestration
โ”‚   โ””โ”€โ”€ airflow
โ”‚        โ””โ”€โ”€ dags
โ”œโ”€โ”€ data-transformation
โ”‚   โ””โ”€โ”€ dbt
โ”‚       โ”œโ”€โ”€ docker
โ”‚       โ”œโ”€โ”€ logs
โ”‚       โ””โ”€โ”€ tvmaze
โ”œโ”€โ”€ screenshots
โ”œโ”€โ”€ solution_diagram.py
โ””โ”€โ”€ README.md

Screenshots for AWS ECR image and AWS setup

How it's Done

We start off by setting up Airbyte and Snowflake. As we chose the TVMAZE API which does not have a connector in Airbyte, we had to create a custom connector. The custom connector is built with incremental refresh on the backend and the front end interface specified on spec.yaml. We have also described the schema in tv_maze.json. This is done through sampling the json response to get the response structure and datatypes, however for datatypes we went with STRINGS ALL THE WAY!. We then build the Docker image and upload to AWS ECR and run the instance with EC2. In Snowflake, we create the Database - TVSHOW to host all the data we will be ingesting. We've also created roles for DBT and Airbyte. Once the access is sorted, we created a connector between the Custom TVMaze API as the Source and Snowflake as the Destination. The source data is then ingested into Snowflake's S3 storage where we transform the data with models via DBT. These are all ochestrated with Airflow with Notification sent through to our project Slack channel.


Solution Architecture Diagram

Solution Architecture Diagram

Created using Diagrams

View all tables

SOURCE-MODEL-EXPOSE Diagram

Created using DrawIO

SME detail Diagram

Created using DrawIO


Data Integration

For the data integration, Airbyte was chosen as the tool for performing extract and load, with Snowflake as our datawarehouse. The custom airbyte connector is a source connector used to get tv episode schedule data from the tvmaze.com API. The tvmaze connector was developed for incremental extracts. Testing has been done whereby the connector was added as a source in the Airbyte UI and used in a connection successfully extracting from the tvmaze API and loading the raw data to a Snowflake destination.
The next step was to create the datawarehouse in Snowflake and apply all corresponding user permissions to Airbyte and dbt. The full SQL query can be viewed here

Snowflake Permissions

TVSHOW Snowflake Permissions

Once that is setup, we test both the user permission with the logging into Snowflake and verify access. Returning to Airbyte, we setup Snowflake as the destination with the Airbyte user credentials created earlier on Snowflake, then connect Source (Airbyte) and Destination (Snowflake).

Airbyte Interface

Airbyte Successful Sync

Upon successful Sync, response from API call are ingested into Snowflake's storage on the backend.

Snowflake TVSHOW

TVSHOW database structure


Data Transformation

All transformations are done using dbt. We have created 3 staging tables and 5 serving tables. Models are written in SQL and can be found in the dbt\tvmaze\models folder. We've built the dbt docker image, upload onto ECR and run the instance with EC2. All credentials are hosted on S3 in an .env file.

Successful dbt run

Successful dbt run

We have used the following transformation techniques:

  • Renaming - renaming certain column headers to avoid confusion
  • Data type casting - date string to date type, IDs into integers
  • Joins - Joins on multiple tables for the staging and serving tables
  • Aggregation function - avg() for ratings, rank() for ratings, count() number of shows
  • Data Cleaning - Nested replace() for Genres and Days
  • Filtering - where claused used to get show types with best rating.
  • Grouping - group by show_type, network_name
  • Sorting - Order By show runtime


Data Ochestration

Our ELTL process is ochestrated with a local version of Airflow. Our DAGS include a SlackWebhookOperator which is positioned at the start and end of the "ETL" process. For the AirbyteTriggerSyncOperator, we had to setup 2 connection ids for the task - `airflow_airbyte_conn_id` and `airbyte_tvmazeapisf_conn_id`. This is for the ingestion of the data from the TV MAZE API to our datawarehouse on Snowflake. Then we setup the ECSOperator to trigger the dbt image which is hosted on ECR. For this, we needed the `aws-login-for-ecs-task`.

aws-login-for-ecs-task

conn.json for ECSOperator

Slack notification

Successfully Integrated Slack Notification on Project channel

Discussion, Lesson Learnt and Future Improvement

Current iteration Airbyte and dbt docker images are built and pushed onto ECR and task instance created to run the pipeline on EC2 with Airflow running locally on Luke's device.
Key Learnings and Room for Improvements:

  • How to pronounce the word Genre
  • Pronouncing Genre
    Sounds like "Zhon-Ruh"
  • Learned when Implementing the Airbyte Connector, in the "parse_response" function, the original codes from class reference the response in square brackets, ours didnt need that as it is already in a json object.
  • When Implementing schemas for custom connector on Airbyte, each objects require a type (as object) and each type would have its own properties that can contain the "columns" with their own type and properties.
  • When starting an Airbyte EC2 instance following a non-graceful shutdown , use `sudo service docker status` to check if the docker daemon is running before running `docker-compose up -d`
  • To avoid the above, run `sudo systemctl enable docker` to enable the docker daemon to auto start on boot and when your container is running, run `docker update --restart unless-stopped $(docker ps -q)` to ensure your airbyte container will auto start on the next reboot.
  • "When it's curly braces it's an Object, if it's a square bracket it's a List" - Luke, 2022
  • For Snowflake permission, being the owner of the Database does not automatically grant access rights. Snowflake permission
  • Just by using Snowflake, we do not need to push data onto S3 to get it into Snowflake, data can be pushed directly onto Snowflake.
  • When creating the custom connector, discovered that a Python Class name cannot be too long. Originally we had TV_MAZE_API as a class and that did not work, had to change it to just TVMAZE
  • When experimenting and working with Airflow, definitely export all connectors first and delete everything that is not the DAG folder to save setup time.
  • Create Airbyte Custom Connector Test and make use of dbt Macros
  • To improve the ETL notification messages - More unique and cater for failed pipeline.
  • As New Airbyte connectors are regularly being added, did not realised that there is already a TVMaze connector in Alpha, we built one anyways!


Contributors & Contributions

All participants in this project are professional individuals enrolled in Data Engineering Camp

Name GitHub Contributions
Luke Huntley LuckyLukeAtGitHub Airbyte, Docker Image Build & Upload, AWS (IAM, S3, EC2, ECR), Airflow
Helen Amin Helenamin dbt Transformation Models & Tests, Airflow
Fang Xuan Foo foofx88 Snowflake, Documentation, Airflow

All Team members partook on the development,cross check and supplied content for project documentation. This was the Second project for the ETL part of the course in the Data Engineering Camp.

GO โฌ†

data-engineer-camp-project2-group2-luke's People

Contributors

foofx88 avatar lukehuntley avatar helenamin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.