AWS Batch Data Processing Pipeline

.
├── dags
│   ├── malware_detection.py
│   ├── scripts
│   │   ├── emr
│   │   │   └── create_emr.json
│   │   ├── spark
│   │   │   └── malware_file_detection.py
│   │   └── sql
│   │       ├── extract_malware_files.sql
│   │       └── from_redshift_to_supersetdb.sql
│   └── utils.py
├── database-setup
│   ├── redshiftdwh.sql
│   ├── sourcedb.sql
│   └── supersetdb.sql
├── source-data
│   ├── temp
│   │   ├── .gitkeep
│   │   └── malware_files_sample.csv
│   ├── .gitkeep
│   ├── dated_sample.csv
│   ├── malware_detection.csv
│   └── sample.csv
├── terraform
│   ├── main.tf
│   ├── superset_ec2_keypair.pem
│   ├── terraform.tfstate
│   ├── terraform.tfvars
│   └── variables.tf
├── .dockerignore
├── .gitattributes
├── .gitignore
├── README.md
├── docker-compose.yml
├── .env
└── requirements.txt

In this project, I built a data pipeline on AWS that performs daily batch processing. The final Terraform task spins up an EC2 instance on which a Postgres database and Superset dashboard are hosted. Public users can login via one of the usernames (and passwords) public1, public2, or public3, and explore the data or create their own dashboards.

Initially, the dataset that I used did not have a time column, so I wrote a Python function that generates timestamps for the following week, every week, on Mondays. The file contains N = 165053 rows, and I wanted to divide it up into 7 days. However, I don't want each day to have the same number of rows, so the timestamps were generated in a random fashion, drawing from a Gaussian distribution with mean N/7 and standard deviation N/70. In this way, some days might have more rows than other days, e.g. 21st January had 16069 rows while 20th January had 22735 rows, and this is different for every week as well. The other columns in the dataset represent the attributes of each file that was "downloaded" and "scanned" whether or not the file contains malicious software. Information about each column is detailed in the metadata.csv file in the source-data folder.

Next, I want to pretend that the data source for the AWS pipeline comes from a SQL database, so the CSV file is loaded into a local Postgres instance hosted on a Docker container. The Airflow framework that orchestrates all of this is also hosted on its on set of containers. After this, I use Terraform to provision each of the AWS infrastructure as I need according to the data pipeline figure above. Terraform is an Infrastructure as Code tool and uses declarative programming. This means I tell Terraform exactly how I want my cloud infrastructure to look like and it will create it for me that way. From here onwards, I test my data pipeline by triggering runs on Airflow, and if it goes well, I schedule it to run daily at midnight UTC. In fact, the pipeline is already running properly everyday, and only breaks when my Internet Service Provider (ISP) changes my local IP address (which looks like it happens on Thursdays).

It takes at least 5-8 minutes to start up and stop an EMR cluster (also known as a "cold start" in cloud computing) even though it only takes about 30-60 seconds to process the data. Similarly, the Redshift cluster also requires several minutes to resume/pause from a paused/resumed state, but only several seconds to store the file from S3 and copy it over to supersetdb in the EC2 instance. In addition, EMR and Redshift are massive, heavy-duty beasts that were made for tackling terabytes and petabytes of big data. Using them to process and store a small 25 MB dataset is super overkill.

Hence, an alternative architecture could involve replacing EMR with a Lambda function, and removing the Redshift component altogether. This way, the entire pipeline can stay online, and can easily process small data (perhaps up to 1 GB?) within a couple of minutes instead of ~20 minutes for a full DAG run. The only cold start would be the Lambda function, which is practically negligible. The cost would also be very low, and only include storage costs on S3 and hosting Docker containers on EC2 (even this can be free for 12 months if using the free-tier t2.micro instance).

Finally, here is a simple dashboard I built using the data from supersetdb (not the data from Redshift). Public users can access this dashboard, but cannot alter the visuals and cannot perform DML (Data Manipulation Language) operations on the Postgres database. Public users are also invited to create their own charts/dashboard and explore the data using the SQL editor.

Git commit labels:

🚀 Initial commits.
⚙️ Maintenance, rewriting some parts of code, small code changes.
🪲 Fixing bugs in the code, which cause errors or unintended behavior.
⬆️ Enhancements to the functionality of existing code.
⭐ New features, such as adding a new AWS service or a new container.
📚 Code documentation, README edits and adding images/media.

ismaildawoodjee / aws-data-pipeline Goto Github PK

aws-data-pipeline's Introduction

AWS Batch Data Processing Pipeline

aws-data-pipeline's People

Contributors

Stargazers

Watchers

Forkers

aws-data-pipeline's Issues

how to run this pipeline ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent