ajbd2106 / etlproject-batch Goto Github PK

View Code? Open in Web Editor NEW

An ETL pipeline where data is captured from REST API (Remotive, Adzuna & GitHub) and RSS feeds (StackOverflow). The data collected from the API is stored on local disk. The files are preprocessed and ETL jobs are written in spark and scheduled in Prefect to run every week. Transformed data is moved to PostgreSQL.

Python 100.00%

etlproject-batch's Introduction

ETL Project (Batch/Local Edition)

PS: This is a work in progress

Architecture diagram

Extraction

Raw data is extracted from sources and saved as json file before any further processing. This will ensure that we still have access to the data in case we want to perform additional analysis or loss of data. There are two categories of data collected.

Extracting Data from Job Postings

I extracted data from job listing websites (Adzuna, Remotive) using their respective REST APIs endpoints and RSS feeds (Stackoverflow jobs).

Extracting Data from Github Trends

Github trending repositories data is scrapped using Python requests library with BeautifulSoup.

Transformation

Pre-processing

Initially extracted data is pre-processed. Since the websites have different field names in their API responses, I ensure the data followed a common type/format in terms of the fields.

Final Transformation

Loading

Data Model

Tech Stack

Python3
Prefect

Recommend Projects

ajbd2106 / etlproject-batch Goto Github PK

etlproject-batch's Introduction

ETL Project (Batch/Local Edition)

Architecture diagram

Extraction

Extracting Data from Job Postings

Extracting Data from Github Trends

Transformation

Pre-processing

Final Transformation

Loading

Tech Stack

etlproject-batch's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent