TV MAZE TV SHOWS 📺

Currated TV Show datasets from TV Maze

Project Background

TV show data extracted from TV Maze API for Data Analyst to identify and visualize trends in TV history and for ML use-case to potentially create shows with AI or suggest shows based on genres and ratings. Once data is visualised, it can then be used by TV show producers to produce better content and potential actors to star in shows that will be popular. Network channels data can help to determine what shows to air and what time would be a success when aired.

Table of Contents

Project Background
Installation & Setup
Folder Structure
How it's done

Solution Architecture Diagram
Data Integration
Data Transformation
Data Ochestration

Discussion, Lesson Learnt and Future Improvement
Contributors & Contributions

Installation & Setup

This project is built with Python version 3.9.

The following packages and tools are used in this project:

Jupyter Notebook
Airbyte
Snowflake
Dbt
Airflow
Jinja
Docker
AWS Services (S3, ECR, EC2)
Diagrams

To run the pipeline locally: Conda Activate "your venv"

pip install -r requirement.txt

Deploy airbyte locally

Clone the airbyte repo locally from https://github.com/airbytehq/airbyte.git
```
git clone https://github.com/airbytehq/airbyte.git
```
Make sure you have Docker Desktop running first
Run the docker compose file
```
cd airbyte 
docker-compose up
```
Copy /source-tvmaze into airbyte/airbyte-integrations/connector
Run docker-compose up again to update the airbyte connector
Setup custom TVMaze connector as Airbyte Source
Login into Snowflake and setup Snowflake user roles with the SQL files within the data-integration\snowflake
Setup Snowflake as Destination and create new connection between Source and Destination in Airbyte
Install dbt-core and its snowflake specific packages pip install dbt-snowflake
Under dbt project folder run dbt init. That will create the same folder as the data-transformation/dbt
Models can be copied over from the tvmaze folder under data-transformation/dbt and run dbt build

Building a Docker Image and push to ECR

Build the Docker Image docker build -t tvmaze-dbt:dev -f docker/Dockerfile
Tag the Docker Image for AWS tvmaze-dbt:dev *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ecr:dev
Login to AWS with CLI aws ecr get-login-password --region *aws_region* | docker login --username AWS --password-stdin *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ec
Docker Push docker push *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ecr:dev

Exporting Airflow connections to JSON

Open up Terminal and input docker exec -it *replace this with the connection id* /bin/.sh
Identify where the "dags" folder in airflow with ls and change directory into the folder with cd dags
Input airflow connections export conn.json and verify that the connections are successfully exported to "conn.json"
To import the "conn.json" file for airflow, use airflow connection import conn.json

For detailed guide on how to replicate this project, please refer to this guide

Folder Structure

code
.
│
├── data-integration
│   ├── airbyte
│   ├── snowflake
├── data-ochestration
│   └── airflow
│        └── dags
├── data-transformation
│   └── dbt
│       ├── docker
│       ├── logs
│       └── tvmaze
├── screenshots
├── solution_diagram.py
└── README.md

Screenshots for AWS ECR image and AWS setup

How it's Done

We start off by setting up Airbyte and Snowflake. As we chose the TVMAZE API which does not have a connector in Airbyte, we had to create a custom connector. The custom connector is built with incremental refresh on the backend and the front end interface specified on spec.yaml. We have also described the schema in tv_maze.json. This is done through sampling the json response to get the response structure and datatypes, however for datatypes we went with STRINGS ALL THE WAY!. We then build the Docker image and upload to AWS ECR and run the instance with EC2. In Snowflake, we create the Database - TVSHOW to host all the data we will be ingesting. We've also created roles for DBT and Airbyte. Once the access is sorted, we created a connector between the Custom TVMaze API as the Source and Snowflake as the Destination. The source data is then ingested into Snowflake's S3 storage where we transform the data with models via DBT. These are all ochestrated with Airflow with Notification sent through to our project Slack channel.

Solution Architecture Diagram

Created using Diagrams

View all tables

Created using DrawIO

Data Integration

For the data integration, Airbyte was chosen as the tool for performing extract and load, with Snowflake as our datawarehouse. The custom airbyte connector is a source connector used to get tv episode schedule data from the tvmaze.com API. The tvmaze connector was developed for incremental extracts. Testing has been done whereby the connector was added as a source in the Airbyte UI and used in a connection successfully extracting from the tvmaze API and loading the raw data to a Snowflake destination.
The next step was to create the datawarehouse in Snowflake and apply all corresponding user permissions to Airbyte and dbt. The full SQL query can be viewed here

TVSHOW Snowflake Permissions

Once that is setup, we test both the user permission with the logging into Snowflake and verify access. Returning to Airbyte, we setup Snowflake as the destination with the Airbyte user credentials created earlier on Snowflake, then connect Source (Airbyte) and Destination (Snowflake).

Airbyte Successful Sync

Upon successful Sync, response from API call are ingested into Snowflake's storage on the backend.

TVSHOW database structure

Data Transformation

All transformations are done using dbt. We have created 3 staging tables and 5 serving tables. Models are written in SQL and can be found in the dbt\tvmaze\models folder. We've built the dbt docker image, upload onto ECR and run the instance with EC2. All credentials are hosted on S3 in an .env file.

Successful dbt run

We have used the following transformation techniques:

Renaming - renaming certain column headers to avoid confusion
Data type casting - date string to date type, IDs into integers
Joins - Joins on multiple tables for the staging and serving tables
Aggregation function - avg() for ratings, rank() for ratings, count() number of shows
Data Cleaning - Nested replace() for Genres and Days
Filtering - where claused used to get show types with best rating.
Grouping - group by show_type, network_name
Sorting - Order By show runtime

Data Ochestration

Our ELTL process is ochestrated with a local version of Airflow. Our DAGS include a SlackWebhookOperator which is positioned at the start and end of the "ETL" process. For the AirbyteTriggerSyncOperator, we had to setup 2 connection ids for the task - `airflow_airbyte_conn_id` and `airbyte_tvmazeapisf_conn_id`. This is for the ingestion of the data from the TV MAZE API to our datawarehouse on Snowflake. Then we setup the ECSOperator to trigger the dbt image which is hosted on ECR. For this, we needed the `aws-login-for-ecs-task`.

conn.json for ECSOperator

Successfully Integrated Slack Notification on Project channel

Discussion, Lesson Learnt and Future Improvement

Current iteration Airbyte and dbt docker images are built and pushed onto ECR and task instance created to run the pipeline on EC2 with Airflow running locally on Luke's device.
Key Learnings and Room for Improvements:

How to pronounce the word Genre

Sounds like "Zhon-Ruh"

Learned when Implementing the Airbyte Connector, in the "parse_response" function, the original codes from class reference the response in square brackets, ours didnt need that as it is already in a json object.
When Implementing schemas for custom connector on Airbyte, each objects require a type (as object) and each type would have its own properties that can contain the "columns" with their own type and properties.
When starting an Airbyte EC2 instance following a non-graceful shutdown , use `sudo service docker status` to check if the docker daemon is running before running `docker-compose up -d`
To avoid the above, run `sudo systemctl enable docker` to enable the docker daemon to auto start on boot and when your container is running, run `docker update --restart unless-stopped $(docker ps -q)` to ensure your airbyte container will auto start on the next reboot.
"When it's curly braces it's an Object, if it's a square bracket it's a List" - Luke, 2022
For Snowflake permission, being the owner of the Database does not automatically grant access rights.
Just by using Snowflake, we do not need to push data onto S3 to get it into Snowflake, data can be pushed directly onto Snowflake.
When creating the custom connector, discovered that a Python Class name cannot be too long. Originally we had TV_MAZE_API as a class and that did not work, had to change it to just TVMAZE
When experimenting and working with Airflow, definitely export all connectors first and delete everything that is not the DAG folder to save setup time.
Create Airbyte Custom Connector Test and make use of dbt Macros
To improve the ETL notification messages - More unique and cater for failed pipeline.
As New Airbyte connectors are regularly being added, did not realised that there is already a TVMaze connector in Alpha, we built one anyways!

Contributors & Contributions

All participants in this project are professional individuals enrolled in Data Engineering Camp

Name	GitHub	Contributions
Luke Huntley	LuckyLukeAtGitHub	Airbyte, Docker Image Build & Upload, AWS (IAM, S3, EC2, ECR), Airflow
Helen Amin	Helenamin	dbt Transformation Models & Tests, Airflow
Fang Xuan Foo	foofx88	Snowflake, Documentation, Airflow

All Team members partook on the development,cross check and supplied content for project documentation. This was the Second project for the ETL part of the course in the Data Engineering Camp.

GO ⬆

eliekawerk / data-engineer-camp-project2-group2-luke Goto Github PK

data-engineer-camp-project2-group2-luke's Introduction