Giter VIP home page Giter VIP logo

airflow-data-pipeline's Introduction

Airflow ETL Data Pipeline

Airflow

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Project Structure
  4. Setup
  5. Airflow DAG
  6. Data Transformation
  7. Running the Pipeline
  8. Logging and Monitoring

Introduction

This project demonstrates an ETL (Extract, Transform, Load) data pipeline implemented using Apache Airflow. The pipeline loads data from an S3 bucket, applies data transformations using the Astro framework, and loads the transformed data into a Snowflake database.

Prerequisites

  • An AWS account to access the S3 bucket.
  • A Snowflake account for the target Snowflake data warehouse.
  • Access to an Apache Airflow environment : AWS Managed Apache Airflow (MWAA).

Project Structure

The project structure is as follows:

Setup

  1. Clone this repository to your local machine.
  2. Ensure that you have Apache Airflow or AWS Managed Apache Airflow (MWAA) set up.
  3. Configure your Airflow connections:
    • Create an S3 connection (S3_CONN_ID) for accessing the S3 bucket.
    • Create a Snowflake connection (SNOWFLAKE_CONN_ID) for the Snowflake database.
  4. Make necessary modifications to the DAG script to match your specific requirements, including the S3 file path, Snowflake credentials, and data transformations.

Airflow DAG

dag

The Airflow DAG (Directed Acyclic Graph) is defined in the Python script Pipeline.py. It performs the following tasks:

  • Sets up a DAG with a unique dag_id.
  • Schedules the DAG to run daily.
  • Defines the data loading, transformation, and loading tasks.
  • Uses the Astro framework for data transformation.
  • Specifies conflict resolution when merging data into Snowflake.

Data Transformation

Data transformations are defined using the Astro framework. Two transformation functions are used:

  1. filter_orders: Filters rows in the input table where the "amount" column is greater than 150.
  2. join_orders_customers: Joins two tables, filtered orders, and the customer table, using SQL.

Running the Pipeline

  1. Ensure your DAG script is uploaded and configured in your Airflow environment.
  2. Trigger the DAG to start the ETL process.
  3. Monitor the Airflow logs for any issues or failures.
  4. Set up scheduling as per your requirements.

Logging and Monitoring

To monitor the pipeline's progress and diagnose issues, you can:

  • Monitor Airflow logs for task execution details.
  • Configure alerts and notifications for failures.

Feel free to connect with me on LinkedIn

LinkedIn

airflow-data-pipeline's People

Contributors

m0hamedibrahim1 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.