Giter VIP home page Giter VIP logo

citibike-data-project's Introduction

NYC Citibike data pipeline

Alt text

Project introduction

This repository is the final project of the DEZoomcamp 2024 course.
In this project, we explore the NYC Citibike rides dataset, which contains information about bike rides taken in New York City using the Citibike bike-sharing system. This project is use the dataset covers the second half of 2023 to insights and analyze.

Problem statement

Our goal is to analyze usage patterns of the Citibike system in NYC during the second half of 2023, and identify trends that could inform future transportation planning. Specifically, we aim to answer the following questions:

  • What are the most popular Citibike stations?
  • On which days of the week are the most trips made?
  • What part of the day is the most common to make trips?

Technologies

In this project, we will utilize a suite of modern technologies to build a robust and scalable data pipeline. These technologies include:

  • Data lake - Google Cloud Storage (GCS): will be used as our cloud storage solution. We will store our raw data and processed data in GCS buckets, allowing us to easily access and analyze the data using other GCP tools.
  • Infrastructure as Code (IaC) - Terraform: is an open-source IaC tool that allows us to define and provision our cloud infrastructure using code. We will use Terraform to automate the deployment of our GCP resources, ensuring that our infrastructure is reproducible and scalable.
  • Workflow orchestration - MageAI: is a data engineering platform that provides tools for building and orchestrating data pipelines. We will use MageAI to manage our workflows, schedule jobs, and monitor the performance of our pipeline.
  • Data transformation - dbt: dbt is a data transformation tool that allows us to define our transformations using SQL. We will use dbt to transform our raw data into a more structured and usable format, and to perform data quality checks.
  • Data warehouse - BigQuery: is a fully-managed, serverless data warehouse that allows us to store and analyze large volumes of data. We will use BigQuery to store our processed data and perform data partitioning and clustering.
  • Dashboard - Looker Studio: is a data visualization tool that allows us to create interactive dashboards and reports. We will use Looker Studio to visualize our data and share insights with stakeholders.
    By leveraging these technologies, we aim to build a robust and scalable data pipeline that can handle large volumes of data and provide valuable insights into the usage patterns of the Citibike system in NYC.

Data pipeline

Alt text The pipeline consists of the following main components:

  • Using the MageAI orchestrator, we implement an ETL pipeline where we load data from the API, remove duplicates, and extract the data into the GCS bucket. Then pipeline load the data from the GCS and extract it into BigQuery data warehouse.
  • Then using dbt we perform data transformation tasks such as, data type conversion, adding column with the ride duration and column which defining the part of the day when the ride taken. We also merge Citibike rides data with data from a file containing information about Citibike stations and add numeric station id from this file. The distinctive features of dbt - macros and seeds - were used in data processing.
  • The transformed data is partitioned and clustered in BigQuery to optimize query performance. The data is partitioned by date and clustered by station ID.

Dashboard

Below is a dashboard that provides answers to the questions posed. Alt text

Reproducing

To reproduce the project, you will need to follow the steps below. Note that you will need to have certain tools and credentials to complete the setup.

  1. If you need to set up a virtual machine, account, project or service account in Google Cloud Platform, please refer to the detailed instructions from the DataTalksClub team. Also, they have a guide for perform the execution steps to create your Terraform infrastructure.
  2. Clone the project's GitHub repository to your local machine:
    git clone https://github.com/Siddha911/Citibike-data-project.git
    
  3. Run the following command to start the docker containers for MageAI and its underlying Postgres database:
    docker-compose -f mageai/docker-compose.yml up -d
    
  4. Then run this command to have a ready-made pipeline in your MageAI:
    mv pipeline/citibike_data_pipeline mageai/magic-zoomcamp/pipelines
    
  5. Using VS Code and having established a remote SSH connection to your VM through it and forwarded port 6789, you can now go to http://localhost:6789 to access the MageAI server instance and run citibike_data_pipeline.
  6. Now you can install dbt cloud once again using a great guide from the incredible DataTalksClub ๐Ÿ˜Š to apply transformations and analysis using the files in the folders dbt/models and dbt/macros.

Acknowledgements

I would like to thank DataTalksClub for creating this course and giving the opportunity to take it for free, it's really amazing!

citibike-data-project's People

Contributors

siddha911 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.