Data Engineering Zoomcamp Project

This is my project for the Data Engineering Zoomcamp by DataTalks.Club

Check my personal repository here

Index

Data Engineering Zoomcamp Project

Problem Statement

DevTrack, a developer-productivity company wants to create a new product for the developer community.
You have been hired to give insights on Github developer activity for April 2022.
Here are some of your proposed end-goal questions:
- On which day of the month was Github most active?
- On which weekday is Github most active?
- The most active day and weekday filtered by event?

About the Dataset

Github Archive is a project to record the public Github timeline, archive it, and make it accessible for further analysis.

Architecture

back to index

Technologies/Tools

Containerisation - Docker
Infrastructure-as-Code (IaC) - Terraform
Cloud - Google Cloud Platform
Workflow Orchestration - Airflow
Data Lake - Google Cloud Storage
Data Warehouse - Google BigQuery
Batch Processing - Google DataProc
Visualisation - Google Data Studio

About the Project

Starting from April 1, Github Archive data is ingested daily into Google Cloud Storage,
A PySpark job is run on the data in GCS using Google DataProc,
The results are written to 2 pre-defined tables in Google BigQuery,
A dashboard is created from the BigQuery tables.

Dashboard

back to index

Reproducibility

Pre-Requisites

Google Cloud Platform Account

Create a GCP account if you do not have one. Note that GCP offers $300 free credits for 90 days
Create a new project from the GCP dashboard. Note your project ID

Create a Service Account

Go to IAM & Admin > Service Accounts
Click Create Service Account. More information here
Add the following roles to the service account:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
- DataProc Administrator
Download the private JSON keyfile. Rename it to google_credentials.json and store it in ${HOME}/.google/credentials/
You would need to enable this APIs if you have not done already
- IAM API
- IAM Service Account Credentials API

back to index

Pre-Infrastructure Setup

Terraform is used to setup most of the infrastructure but the Virtual Machine and DataProc Cluster used for this project was created on the cloud console. This aspect contain steps to setup this aspect of this project.

You can also use your local machine to reproduce this project but it is much better to use a VM. If you still choose to use your local machine, install the necessary packages on your local machine.

Setting up a Virtual Machine on GCP

On the project dashboard, go to Compute Engine > VM Instances
Create a new instance
- Use any name of your choosing
- Choose a region that suits you most
  
  All your GCP resources should be in the same region
- For machine configuration, choose the E2 series. An e2-standard-2 (2 vCPU, 8 GB memory) is sufficient for this project
- In the Boot disk section, change it to Ubuntu preferably Ubuntu 20.04 LTS. A disk size of 30GB is also enough.
- Leave all other settings on default value and click Create

You would need to enable the Compute Engine API if you have not already.

Setting up a DataProc Cluster on GCP

On the project dashboard, go to DataProc > Clusters
Create a new cluster
- Use any name of your choosing
I used gharchive-cluster for my project
- For Cluster Type, use Standard (1 master, N workers)
- Leave other options on default and click Create

You would need to enable the Cloud Dataproc API if you have not already.

back to index

Installing Required Packages on the VM

Before installing packages on the VM, an SSH key has to be created to connect to the VM

SSH Key Connection

To create the SSH key, check this guide
Copy the public key in the ~/ssh folder
On the GCP dashboard, navigate to Compute Engine > Metadata > SSH KEYS
Click Edit. Then click Add Item. Paste the public key and click Save
Go to the VM instance you created and copy the External IP
Go back to your terminal and type this command in your home directory
```
ssh -i <path-to-private-key> <USER>@<External IP>
```
- This should connect you to the VM
You can also create a config file in your local ~/.ssh/ directory. This would make it easier to log in to the VM. Here is an example below:
```
Host dezp  # Can be any name of your choosing
    HostName <External IP>
    User <username>
    IdentityFile <absolute-path-to-private-key>
```
- You can now connect to your VM from your home directory by running
```
ssh dezp
```
When you're through with using the VM, you should always shut it down. You can do this either on the GCP dashboard or on your terminal
```
sudo shutdown now
```

Google Cloud SDK

Google Cloud SDK is already pre-installed on a GCP VM. You can confirm by running gcloud --version.
If you are not using a VM, check this link to install it on your local machine

Docker

Connect to your VM

Install Docker

sudo apt-get update
sudo apt-get install docker.io

Docker needs to be configured so that it can run without sudo
```
sudo groupadd docker
sudo gpasswd -a $USER docker
sudo service docker restart
```
- Logout of your SSH session and log back in
- Test that docker works successfully by running docker run hello-world

Docker-Compose

Check and copy the latest release for Linux from the official Github repository
Create a folder called bin/ in the home directory. Navigate into the /bin directory and download the binary file there
```
wget <copied-file> -O docker-compose
```
Make the file executable
```
chmod +x docker-compose
```
Add the .bin/ directory to PATH permanently
- Open the .bashrc file in the HOME directory
```
nano .bashrc
```
- Go to the end of the file and paste this there
```
export PATH="${HOME}/bin:${PATH}"
```
- Save the file (CTRL-O) and exit nano (CTRL-X)
- Reload the PATH variable
```
source .bashrc
```
You should be able to run docker-compose from anywhere now. Test this with docker-compose --version

Terraform

Navigate to the bin/ directory that you created and run this

wget https://releases.hashicorp.com/terraform/1.1.7/terraform_1.1.7_linux_amd64.zip

Unzip the file
```
unzip terraform_1.1.7_linux_amd64.zip
```
You might have to install unzip sudo apt-get install unzip
Remove the zip file
```
rm terraform_1.1.7_linux_amd64.zip
```
Terraform is already installed. Test it with terraform -v

Google Application Credentials

The JSON credentials downloaded is on your local machine. We would transfer it to the VM with an SFTP client

On your local machine, navigate to the location of the credentials file ${HOME}/.google/google_credentials.json
Connect to your VM with SFTP using the host name you created in your config file
```
sftp dezp
```
Once connected to your VM through sftp, create the same folder on your VM ${HOME}/.google/credentials/
Navigate to this folder and run
```
put google_credentials.json
```
Log out of sftp and log in to your VM. Confirm that the file is there
For convenience, add this line to the end of the .bashrc file
```
export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
```
- Refresh with source .bashrc

Use the service account credentials file for authentication

gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Remote-SSH

To work with folders on a remote machine on Visual Studio Code, you need this extension. This extension also simplifies the forwarding of ports.

Install the Remote-SSH extension from the Extensions Marketplace
At the bottom left-hand corner, click the Open a Remote Window icon
Click Connect to Host. Click the name of your config file host.
In the Explorer tab, open any folder on your Virtual Machine Now, you can use VSCode completely to run this project.

back to index

Main

Clone the repository

    git clone https://github.com/Isaac-Tolu/dezoomcamp-project.git

Create remaining infrastructure with Terraform

We use Terraform to create a GCS bucket, a BQ table, and 2 BQ tables

Navigate to the terraform folder
Initialise terraform
```
terraform init
```
Check infrastructure plan
```
terraform plan
```
Create new infrastructure
```
terraform apply
```
Confirm that the infrastructure has been created on the GCP dashboard

Copy PySpark file to Google Cloud Storage

When creating the DataProc cluster, a temporary GCS bucket was created for that cluster. The pyspark file makes use of that temporary bucket.
- Copy the name of the bucket from the cloud console
- Replace it in the pyspark file
Copy file to GCS with gsutil
- On the terminal, nagivate to the dataproc directory
- Then run this command:
```
gsutil cp spark_job.py gs://<gcs-bucket-name>/dataproc/spark_job.py
```
  gcs-bucket-name is the name of the bucket you created with terraform
Go to the cloud console and confirm that the folder is there

Initialise Airflow

Airflow is run in a docker container. This section contains steps on initisialing Airflow resources

Navigate to the airflow folder
Create a logs folder airflow/logs/
```
mkdir logs/
```
Build the docker image
```
docker-compose build
```
The names of some project resources are hardcoded in the docker_compose.yaml file. Change this values to suit your use-case
Initialise Airflow resources
```
docker-compose up airflow-init
```
Kick up all other services
```
docker-compose up
```
Open another terminal instance and check docker running services
```
docker ps
```
- Check if all the services are healthy
Forward port 8080 from VS Code. Open localhost:8080 on your browser and sign into airflow

Both username and password is airflow

Run the pipeline

You are already signed into Airflow. Now it's time to run the pipeline

Click on the DAG gharchive_dag that you see there
You should see a tree-like structure of the DAG you're about to run
You can also check the graph structure of the DAG
At the top right-hand corner, trigger the DAG. Make sure Auto-refresh is turned on before doing this

The DAG would run from April 1 at 8:00am UTC till 8:00am UTC of the present day
This should take a while
While this is going on, check the cloud console to confirm that everything is working accordingly

If you face any problem or error, confirm that you have followed all the above instructions religiously. If the problems still persist, raise an issue.
When the pipeline is finished and you've confirmed that everything went well, shut down *docker-compose with CTRL-C and kill all containers with docker-compose down
Take a well-deserved break to rest. This has been a long ride.

back to index

Notable Notes

Partitioning and Clustering is pre-defined on the tables in the data warehouse. You can check the definition in the main terraform file

Acknowledgements

I'd like to thank the organisers of this wonderful course. It has given me valuable insights into the field of Data Engineering. Also, all fellow students who took time to answer my questions on the Slack channel, thank you very much.

back to index

isaac-tolu / dezoomcamp-project Goto Github PK

dezoomcamp-project's Introduction

Data Engineering Zoomcamp Project

Index

Problem Statement

About the Dataset

Architecture

Technologies/Tools

About the Project

Dashboard

Reproducibility

Pre-Requisites

Google Cloud Platform Account

Create a Service Account

Pre-Infrastructure Setup

Setting up a Virtual Machine on GCP

Setting up a DataProc Cluster on GCP

Installing Required Packages on the VM

SSH Key Connection

Google Cloud SDK

Docker

Docker-Compose

Terraform

Google Application Credentials

Remote-SSH

Main

Clone the repository

Create remaining infrastructure with Terraform

Copy PySpark file to Google Cloud Storage

Initialise Airflow

Run the pipeline

Notable Notes

Acknowledgements

dezoomcamp-project's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org