Giter VIP home page Giter VIP logo

ismaildawoodjee / aws-data-pipeline Goto Github PK

View Code? Open in Web Editor NEW
17.0 2.0 6.0 4.89 MB

A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locally hosted Airflow containers. The end product is a Superset dashboard and a Postgres database, hosted on an EC2 instance at this address (powered down):

Home Page: http://54.169.163.221:8080

Python 50.35% HCL 46.82% Shell 2.83%
python docker postgresql airflow elt etl data-engineering aws aws-s3 aws-emr

aws-data-pipeline's Introduction

AWS Batch Data Processing Pipeline

.
├── dags
│   ├── malware_detection.py
│   ├── scripts
│   │   ├── emr
│   │   │   └── create_emr.json
│   │   ├── spark
│   │   │   └── malware_file_detection.py
│   │   └── sql
│   │       ├── extract_malware_files.sql
│   │       └── from_redshift_to_supersetdb.sql
│   └── utils.py
├── database-setup
│   ├── redshiftdwh.sql
│   ├── sourcedb.sql
│   └── supersetdb.sql
├── source-data
│   ├── temp
│   │   ├── .gitkeep
│   │   └── malware_files_sample.csv
│   ├── .gitkeep
│   ├── dated_sample.csv
│   ├── malware_detection.csv
│   └── sample.csv
├── terraform
│   ├── main.tf
│   ├── superset_ec2_keypair.pem
│   ├── terraform.tfstate
│   ├── terraform.tfvars
│   └── variables.tf
├── .dockerignore
├── .gitattributes
├── .gitignore
├── README.md
├── docker-compose.yml
├── .env
└── requirements.txt

In this project, I built a data pipeline on AWS that performs daily batch processing. The final Terraform task spins up an EC2 instance on which a Postgres database and Superset dashboard are hosted. Public users can login via one of the usernames (and passwords) public1, public2, or public3, and explore the data or create their own dashboards.

AWS batch processing data pipeline

Initially, the dataset that I used did not have a time column, so I wrote a Python function that generates timestamps for the following week, every week, on Mondays. The file contains N = 165053 rows, and I wanted to divide it up into 7 days. However, I don't want each day to have the same number of rows, so the timestamps were generated in a random fashion, drawing from a Gaussian distribution with mean N/7 and standard deviation N/70. In this way, some days might have more rows than other days, e.g. 21st January had 16069 rows while 20th January had 22735 rows, and this is different for every week as well. The other columns in the dataset represent the attributes of each file that was "downloaded" and "scanned" whether or not the file contains malicious software. Information about each column is detailed in the metadata.csv file in the source-data folder.

Next, I want to pretend that the data source for the AWS pipeline comes from a SQL database, so the CSV file is loaded into a local Postgres instance hosted on a Docker container. The Airflow framework that orchestrates all of this is also hosted on its on set of containers. After this, I use Terraform to provision each of the AWS infrastructure as I need according to the data pipeline figure above. Terraform is an Infrastructure as Code tool and uses declarative programming. This means I tell Terraform exactly how I want my cloud infrastructure to look like and it will create it for me that way. From here onwards, I test my data pipeline by triggering runs on Airflow, and if it goes well, I schedule it to run daily at midnight UTC. In fact, the pipeline is already running properly everyday, and only breaks when my Internet Service Provider (ISP) changes my local IP address (which looks like it happens on Thursdays).

Airflow DAG and full runs from the past week

Full DAG runs from the past week

It takes at least 5-8 minutes to start up and stop an EMR cluster (also known as a "cold start" in cloud computing) even though it only takes about 30-60 seconds to process the data. Similarly, the Redshift cluster also requires several minutes to resume/pause from a paused/resumed state, but only several seconds to store the file from S3 and copy it over to supersetdb in the EC2 instance. In addition, EMR and Redshift are massive, heavy-duty beasts that were made for tackling terabytes and petabytes of big data. Using them to process and store a small 25 MB dataset is super overkill.

Hence, an alternative architecture could involve replacing EMR with a Lambda function, and removing the Redshift component altogether. This way, the entire pipeline can stay online, and can easily process small data (perhaps up to 1 GB?) within a couple of minutes instead of ~20 minutes for a full DAG run. The only cold start would be the Lambda function, which is practically negligible. The cost would also be very low, and only include storage costs on S3 and hosting Docker containers on EC2 (even this can be free for 12 months if using the free-tier t2.micro instance).

Finally, here is a simple dashboard I built using the data from supersetdb (not the data from Redshift). Public users can access this dashboard, but cannot alter the visuals and cannot perform DML (Data Manipulation Language) operations on the Postgres database. Public users are also invited to create their own charts/dashboard and explore the data using the SQL editor.

Dashboard about fictional malware files detection data

Git commit labels:

  • 🚀 Initial commits.
  • ⚙️ Maintenance, rewriting some parts of code, small code changes.
  • 🪲 Fixing bugs in the code, which cause errors or unintended behavior.
  • ⬆️ Enhancements to the functionality of existing code.
  • ⭐ New features, such as adding a new AWS service or a new container.
  • 📚 Code documentation, README edits and adding images/media.

aws-data-pipeline's People

Contributors

ismaildawoodjee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

aws-data-pipeline's Issues

how to run this pipeline ?

Hi

how to run this pipeline, do I have to run terraform first or docker compose ?
and also in the file terraform-eg.tfvars I don't understand how we can replace the values, is it necessary to have already the information before filling them ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.