Currated TV Show datasets from TV Maze
TV show data extracted from TV Maze API for Data Analyst to identify and visualize trends in TV history and for ML use-case to potentially create shows with AI or suggest shows based on genres and ratings. Once data is visualised, it can then be used by TV show producers to produce better content and potential actors to star in shows that will be popular. Network channels data can help to determine what shows to air and what time would be a success when aired.
Table of Contents
This project is built with Python version 3.9.
The following packages and tools are used in this project:
- Jupyter Notebook
- Airbyte
- Snowflake
- Dbt
- Airflow
- Jinja
- Docker
- AWS Services (S3, ECR, EC2)
- Diagrams
To run the pipeline locally: Conda Activate "your venv"
pip install -r requirement.txt
-
Clone the airbyte repo locally from
https://github.com/airbytehq/airbyte.git
git clone https://github.com/airbytehq/airbyte.git
-
Make sure you have Docker Desktop running first
-
Run the docker compose file
cd airbyte docker-compose up
-
Copy /source-tvmaze into airbyte/airbyte-integrations/connector
-
Run docker-compose up again to update the airbyte connector
-
Setup custom TVMaze connector as Airbyte Source
-
Login into Snowflake and setup Snowflake user roles with the SQL files within the data-integration\snowflake
-
Setup Snowflake as Destination and create new connection between Source and Destination in Airbyte
-
Install dbt-core and its snowflake specific packages
pip install dbt-snowflake
-
Under dbt project folder run
dbt init
. That will create the same folder as the data-transformation/dbt -
Models can be copied over from the tvmaze folder under data-transformation/dbt and run
dbt build
-
Build the Docker Image
docker build -t tvmaze-dbt:dev -f docker/Dockerfile
-
Tag the Docker Image for AWS
tvmaze-dbt:dev *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ecr:dev
-
Login to AWS with CLI
aws ecr get-login-password --region *aws_region* | docker login --username AWS --password-stdin *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ec
-
Docker Push
docker push *aws_account_id*.dkr.ecr.*aws_region*.amazonaws.com/tvmaze-dbt-ecr:dev
-
Open up Terminal and input
docker exec -it *replace this with the connection id* /bin/.sh
-
Identify where the "dags" folder in airflow with
ls
and change directory into the folder withcd dags
-
Input
airflow connections export conn.json
and verify that the connections are successfully exported to "conn.json" -
To import the "conn.json" file for airflow, use
airflow connection import conn.json
For detailed guide on how to replicate this project, please refer to this guide
code
.
โ
โโโ data-integration
โ โโโ airbyte
โ โโโ snowflake
โโโ data-ochestration
โ โโโ airflow
โ โโโ dags
โโโ data-transformation
โ โโโ dbt
โ โโโ docker
โ โโโ logs
โ โโโ tvmaze
โโโ screenshots
โโโ solution_diagram.py
โโโ README.md
Screenshots for AWS ECR image and AWS setup
We start off by setting up Airbyte and Snowflake. As we chose the TVMAZE API which does not have a connector in Airbyte, we had to create a custom connector. The custom connector is built with incremental refresh on the backend and the front end interface specified on spec.yaml. We have also described the schema in tv_maze.json. This is done through sampling the json response to get the response structure and datatypes, however for datatypes we went with STRINGS ALL THE WAY!. We then build the Docker image and upload to AWS ECR and run the instance with EC2. In Snowflake, we create the Database - TVSHOW to host all the data we will be ingesting. We've also created roles for DBT and Airbyte. Once the access is sorted, we created a connector between the Custom TVMaze API as the Source and Snowflake as the Destination. The source data is then ingested into Snowflake's S3 storage where we transform the data with models via DBT. These are all ochestrated with Airflow with Notification sent through to our project Slack channel.
Created using Diagrams
Created using DrawIO Created using DrawIO
For the data integration, Airbyte was chosen as the tool for performing extract and load, with Snowflake as our datawarehouse. The custom airbyte connector is a source connector used to get tv episode schedule data from the tvmaze.com API. The tvmaze connector was developed for incremental extracts. Testing has been done whereby the connector was added as a source in the Airbyte UI and used in a connection successfully extracting from the tvmaze API and loading the raw data to a Snowflake destination.
The next step was to create the datawarehouse in Snowflake and apply all corresponding user permissions to Airbyte and dbt. The full SQL query can be viewed here
Once that is setup, we test both the user permission with the logging into Snowflake and verify access. Returning to Airbyte, we setup Snowflake as the destination with the Airbyte user credentials created earlier on Snowflake, then connect Source (Airbyte) and Destination (Snowflake).
Airbyte Successful SyncUpon successful Sync, response from API call are ingested into Snowflake's storage on the backend.
TVSHOW database structureAll transformations are done using dbt. We have created 3 staging tables and 5 serving tables. Models are written in SQL and can be found in the dbt\tvmaze\models folder. We've built the dbt docker image, upload onto ECR and run the instance with EC2. All credentials are hosted on S3 in an .env file.
Successful dbt runWe have used the following transformation techniques:
- Renaming - renaming certain column headers to avoid confusion
- Data type casting - date string to date type, IDs into integers
- Joins - Joins on multiple tables for the staging and serving tables
- Aggregation function - avg() for ratings, rank() for ratings, count() number of shows
- Data Cleaning - Nested replace() for Genres and Days
- Filtering - where claused used to get show types with best rating.
- Grouping - group by show_type, network_name
- Sorting - Order By show runtime
Our ELTL process is ochestrated with a local version of Airflow. Our DAGS include a SlackWebhookOperator which is positioned at the start and end of the "ETL" process. For the AirbyteTriggerSyncOperator, we had to setup 2 connection ids for the task - `airflow_airbyte_conn_id` and `airbyte_tvmazeapisf_conn_id`. This is for the ingestion of the data from the TV MAZE API to our datawarehouse on Snowflake. Then we setup the ECSOperator to trigger the dbt image which is hosted on ECR. For this, we needed the `aws-login-for-ecs-task`.
conn.json for ECSOperator Successfully Integrated Slack Notification on Project channel
Current iteration Airbyte and dbt docker images are built and pushed onto ECR and task instance created to run the pipeline on EC2 with Airflow running locally on Luke's device.
Key Learnings and Room for Improvements:
- How to pronounce the word Genre
- Learned when Implementing the Airbyte Connector, in the "parse_response" function, the original codes from class reference the response in square brackets, ours didnt need that as it is already in a json object.
- When Implementing schemas for custom connector on Airbyte, each objects require a type (as object) and each type would have its own properties that can contain the "columns" with their own type and properties.
- When starting an Airbyte EC2 instance following a non-graceful shutdown , use `sudo service docker status` to check if the docker daemon is running before running `docker-compose up -d`
- To avoid the above, run `sudo systemctl enable docker` to enable the docker daemon to auto start on boot and when your container is running, run `docker update --restart unless-stopped $(docker ps -q)` to ensure your airbyte container will auto start on the next reboot.
- "When it's curly braces it's an Object, if it's a square bracket it's a List" - Luke, 2022
- For Snowflake permission, being the owner of the Database does not automatically grant access rights.
- Just by using Snowflake, we do not need to push data onto S3 to get it into Snowflake, data can be pushed directly onto Snowflake.
- When creating the custom connector, discovered that a Python Class name cannot be too long. Originally we had TV_MAZE_API as a class and that did not work, had to change it to just TVMAZE
- When experimenting and working with Airflow, definitely export all connectors first and delete everything that is not the DAG folder to save setup time.
- Create Airbyte Custom Connector Test and make use of dbt Macros
- To improve the ETL notification messages - More unique and cater for failed pipeline.
- As New Airbyte connectors are regularly being added, did not realised that there is already a TVMaze connector in Alpha, we built one anyways!
Sounds like "Zhon-Ruh"
All participants in this project are professional individuals enrolled in Data Engineering Camp
Name | GitHub | Contributions |
---|---|---|
Luke Huntley | LuckyLukeAtGitHub | Airbyte, Docker Image Build & Upload, AWS (IAM, S3, EC2, ECR), Airflow |
Helen Amin | Helenamin | dbt Transformation Models & Tests, Airflow |
Fang Xuan Foo | foofx88 | Snowflake, Documentation, Airflow |
All Team members partook on the development,cross check and supplied content for project documentation.
This was the Second project for the ETL part of the course in the Data Engineering Camp.