Giter VIP home page Giter VIP logo

dezoomcamp-project's Introduction

Data Engineering Zoomcamp Project

This is my project for the Data Engineering Zoomcamp by DataTalks.Club

Check my personal repository here

Index

Problem Statement

  • DevTrack, a developer-productivity company wants to create a new product for the developer community.
  • You have been hired to give insights on Github developer activity for April 2022.
  • Here are some of your proposed end-goal questions:
    • On which day of the month was Github most active?
    • On which weekday is Github most active?
    • The most active day and weekday filtered by event?

About the Dataset

Github Archive is a project to record the public Github timeline, archive it, and make it accessible for further analysis.

Architecture

architecture

back to index

Technologies/Tools

About the Project

  • Starting from April 1, Github Archive data is ingested daily into Google Cloud Storage,
  • A PySpark job is run on the data in GCS using Google DataProc,
  • The results are written to 2 pre-defined tables in Google BigQuery,
  • A dashboard is created from the BigQuery tables.

Dashboard

dashboard

back to index

Reproducibility

Pre-Requisites

Google Cloud Platform Account

  1. Create a GCP account if you do not have one. Note that GCP offers $300 free credits for 90 days
  2. Create a new project from the GCP dashboard. Note your project ID

Create a Service Account

  1. Go to IAM & Admin > Service Accounts
  2. Click Create Service Account. More information here
  3. Add the following roles to the service account:
    • Viewer
    • Storage Admin
    • Storage Object Admin
    • BigQuery Admin
    • DataProc Administrator
  4. Download the private JSON keyfile. Rename it to google_credentials.json and store it in ${HOME}/.google/credentials/
  5. You would need to enable this APIs if you have not done already

back to index

Pre-Infrastructure Setup

Terraform is used to setup most of the infrastructure but the Virtual Machine and DataProc Cluster used for this project was created on the cloud console. This aspect contain steps to setup this aspect of this project.

You can also use your local machine to reproduce this project but it is much better to use a VM. If you still choose to use your local machine, install the necessary packages on your local machine.

Setting up a Virtual Machine on GCP
  1. On the project dashboard, go to Compute Engine > VM Instances
  2. Create a new instance
    • Use any name of your choosing
    • Choose a region that suits you most

      All your GCP resources should be in the same region

    • For machine configuration, choose the E2 series. An e2-standard-2 (2 vCPU, 8 GB memory) is sufficient for this project
    • In the Boot disk section, change it to Ubuntu preferably Ubuntu 20.04 LTS. A disk size of 30GB is also enough.
    • Leave all other settings on default value and click Create

You would need to enable the Compute Engine API if you have not already.

Setting up a DataProc Cluster on GCP
  1. On the project dashboard, go to DataProc > Clusters
  2. Create a new cluster
    • Use any name of your choosing

    I used gharchive-cluster for my project

    • For Cluster Type, use Standard (1 master, N workers)
    • Leave other options on default and click Create

You would need to enable the Cloud Dataproc API if you have not already.

back to index

Installing Required Packages on the VM

Before installing packages on the VM, an SSH key has to be created to connect to the VM

SSH Key Connection
  1. To create the SSH key, check this guide
  2. Copy the public key in the ~/ssh folder
  3. On the GCP dashboard, navigate to Compute Engine > Metadata > SSH KEYS
  4. Click Edit. Then click Add Item. Paste the public key and click Save
  5. Go to the VM instance you created and copy the External IP
  6. Go back to your terminal and type this command in your home directory
    ssh -i <path-to-private-key> <USER>@<External IP>
    • This should connect you to the VM
  7. You can also create a config file in your local ~/.ssh/ directory. This would make it easier to log in to the VM. Here is an example below:
    Host dezp  # Can be any name of your choosing
        HostName <External IP>
        User <username>
        IdentityFile <absolute-path-to-private-key>
    • You can now connect to your VM from your home directory by running
      ssh dezp
  8. When you're through with using the VM, you should always shut it down. You can do this either on the GCP dashboard or on your terminal
    sudo shutdown now
Google Cloud SDK

Google Cloud SDK is already pre-installed on a GCP VM. You can confirm by running gcloud --version.
If you are not using a VM, check this link to install it on your local machine

Docker
  1. Connect to your VM
  2. Install Docker
    sudo apt-get update
    sudo apt-get install docker.io
  3. Docker needs to be configured so that it can run without sudo
    sudo groupadd docker
    sudo gpasswd -a $USER docker
    sudo service docker restart
    • Logout of your SSH session and log back in
    • Test that docker works successfully by running docker run hello-world
Docker-Compose
  1. Check and copy the latest release for Linux from the official Github repository
  2. Create a folder called bin/ in the home directory. Navigate into the /bin directory and download the binary file there
    wget <copied-file> -O docker-compose
  3. Make the file executable
    chmod +x docker-compose
  4. Add the .bin/ directory to PATH permanently
    • Open the .bashrc file in the HOME directory
    nano .bashrc
    • Go to the end of the file and paste this there
    export PATH="${HOME}/bin:${PATH}"
    • Save the file (CTRL-O) and exit nano (CTRL-X)
    • Reload the PATH variable
    source .bashrc
  5. You should be able to run docker-compose from anywhere now. Test this with docker-compose --version
Terraform
  1. Navigate to the bin/ directory that you created and run this
    wget https://releases.hashicorp.com/terraform/1.1.7/terraform_1.1.7_linux_amd64.zip
  2. Unzip the file
    unzip terraform_1.1.7_linux_amd64.zip

    You might have to install unzip sudo apt-get install unzip

  3. Remove the zip file
    rm terraform_1.1.7_linux_amd64.zip
  4. Terraform is already installed. Test it with terraform -v
Google Application Credentials

The JSON credentials downloaded is on your local machine. We would transfer it to the VM with an SFTP client

  1. On your local machine, navigate to the location of the credentials file ${HOME}/.google/google_credentials.json
  2. Connect to your VM with SFTP using the host name you created in your config file
    sftp dezp
  3. Once connected to your VM through sftp, create the same folder on your VM ${HOME}/.google/credentials/
  4. Navigate to this folder and run
    put google_credentials.json
  5. Log out of sftp and log in to your VM. Confirm that the file is there
  6. For convenience, add this line to the end of the .bashrc file
    export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
    • Refresh with source .bashrc
  7. Use the service account credentials file for authentication
    gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
Remote-SSH

To work with folders on a remote machine on Visual Studio Code, you need this extension. This extension also simplifies the forwarding of ports.

  1. Install the Remote-SSH extension from the Extensions Marketplace
  2. At the bottom left-hand corner, click the Open a Remote Window icon
  3. Click Connect to Host. Click the name of your config file host.
  4. In the Explorer tab, open any folder on your Virtual Machine Now, you can use VSCode completely to run this project.

back to index

Main

Clone the repository

    git clone https://github.com/Isaac-Tolu/dezoomcamp-project.git

Create remaining infrastructure with Terraform

We use Terraform to create a GCS bucket, a BQ table, and 2 BQ tables

  1. Navigate to the terraform folder
  2. Initialise terraform
    terraform init
  3. Check infrastructure plan
    terraform plan
  4. Create new infrastructure
    terraform apply
  5. Confirm that the infrastructure has been created on the GCP dashboard

Copy PySpark file to Google Cloud Storage

  1. When creating the DataProc cluster, a temporary GCS bucket was created for that cluster. The pyspark file makes use of that temporary bucket.
    • Copy the name of the bucket from the cloud console gcs-temp-bucket
    • Replace it in the pyspark file spark-temp-bucket
  2. Copy file to GCS with gsutil
    • On the terminal, nagivate to the dataproc directory
    • Then run this command:
      gsutil cp spark_job.py gs://<gcs-bucket-name>/dataproc/spark_job.py

      gcs-bucket-name is the name of the bucket you created with terraform

  3. Go to the cloud console and confirm that the folder is there

Initialise Airflow

Airflow is run in a docker container. This section contains steps on initisialing Airflow resources

  1. Navigate to the airflow folder
  2. Create a logs folder airflow/logs/
    mkdir logs/
  3. Build the docker image
    docker-compose build
  4. The names of some project resources are hardcoded in the docker_compose.yaml file. Change this values to suit your use-case hardcoded-values
  5. Initialise Airflow resources
    docker-compose up airflow-init
  6. Kick up all other services
    docker-compose up
  7. Open another terminal instance and check docker running services
    docker ps
    • Check if all the services are healthy
  8. Forward port 8080 from VS Code. Open localhost:8080 on your browser and sign into airflow

    Both username and password is airflow

Run the pipeline

You are already signed into Airflow. Now it's time to run the pipeline

  1. Click on the DAG gharchive_dag that you see there
  2. You should see a tree-like structure of the DAG you're about to run tree-dag
  3. You can also check the graph structure of the DAG graph-dag
  4. At the top right-hand corner, trigger the DAG. Make sure Auto-refresh is turned on before doing this

    The DAG would run from April 1 at 8:00am UTC till 8:00am UTC of the present day
    This should take a while

  5. While this is going on, check the cloud console to confirm that everything is working accordingly

    If you face any problem or error, confirm that you have followed all the above instructions religiously. If the problems still persist, raise an issue.

  6. When the pipeline is finished and you've confirmed that everything went well, shut down *docker-compose with CTRL-C and kill all containers with docker-compose down
  7. Take a well-deserved break to rest. This has been a long ride.

back to index

Notable Notes

  • Partitioning and Clustering is pre-defined on the tables in the data warehouse. You can check the definition in the main terraform file

Acknowledgements

I'd like to thank the organisers of this wonderful course. It has given me valuable insights into the field of Data Engineering. Also, all fellow students who took time to answer my questions on the Slack channel, thank you very much.

back to index

dezoomcamp-project's People

Contributors

toludaree avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.