Giter VIP home page Giter VIP logo

global-emissions-data-2024's Introduction

Global-Emissions-Data-2024

Data Engineering Zoomcamp: final project

Global-Emissions_data-2024/
├── Airflow/
│   ├── dags/
│   │   └── data_ingestion_dag.py
│   └── plugins/
│       ├── __init__.py
│       ├── data_ingestion.py
│       └── setup.py
├── Data/
│   ├── cleaned_data.parquet
│   └── world_air_quality.csv
├── Terraform/
│   ├── main.tf
│   └── bigquery.tf
└── dbt/
    └── final_project_DEz/
        ├── models/
        │   └── my_model.sql
        └── macros/
            └── ELT_Python_Script.py

Introduction

The purpose of this project is for learning purposes, as part of the Data Engineering Zoomcamp 2024 final project.

The problem which this project solves is that working with the kaggle data requires transformations and storage for future data batches. Therefore, this project provides automated ELT which can be repeated in the future

To achieve this, data is extracted using Airflow; loaded into GCP bucket; transformed and loaded into BigQuery via dbt; then visualised using Looker.

The data is global emissions data (2024), imported from Kaggle.

Dataset

I have provided the raw csv file within this repository, can be found via: Data / world_air_quality. Additionally, I have provided a cleaned partition file

Tools

  • GCP
  • Terraform
  • Python
  • Airflow
  • dbt
  • Looker Studio

Solution

image

Dashboard:

image

Prequisites

GCP (Google Cloud Platform)

No specific installation required.

Access GCP services via web console or Google Cloud SDK.

  • Configuration: Install Google Cloud SDK: Follow instructions here for your operating system.

  • Authenticate Google Cloud SDK: Run gcloud auth login and follow the prompts to authenticate.

  • Set default project: Run gcloud config set project <project_id> to set the default project.

Terraform

pip install terraform

Python

pip install python

pip install pandas google-cloud-storage google-cloud-bigquery

Airflow

pip install apache-airflow

dbt (data build tool)

pip install dbt

Looker Studio

  • Sign up for a Looker Studio account and log in.
  • Configure connections to your data sources within Looker.

Instructions

Further instructions can be found within each module's ReadMe, these instructions are at a higher level.

1. Set up infrastructure --- Terraform

  • Open Terraform/
  • Deploy main.tf and bigquery.tf files
    • Make changes to names, where applicable. Feel free to add further resources or to use variables for best practice, however, the current terraform code works just fine for what this project needs.
  • Add your GCP credentials to your machine.
  • Terraform Init
  • Terraform Apply

2. Review infrastructure --- GCP

  • Login to GCP
  • Check that your vm instance, bucket, and bigquery file is setup.

3. Extract Data --- Airflow

  • Open Airflow/plugins
  • Run data_ingestion.py
  • Open Airflow/dags
  • Run data_ingestion_dag.py
    • The dag file will call the plugin, you will need both for the Airflow to work
  • Set up cron job to ensure code runs on a schedule, current code is assigned to run monthly.

4. Transform and Load --- dbt

  • Open dbt/final_project_dez/
  • Open models/
  • Save my_model.sql
  • Open macros/
  • Save ELT_Python_Scripy.py
  • Run in Terminal : dbt run
    • A BigQuery folder and file will be created, however you specified in your creation.

5. Data Visualisation --- Looker

  • Navigate to Looker Studio via your search engine
  • Create (top left)
  • Data Source
  • BigQuery
  • Create any visualisation you wish

global-emissions-data-2024's People

Contributors

maundojako avatar

Watchers

 avatar

Forkers

samueloshio

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.