PS: This is a work in progress
Raw data is extracted from sources and saved as json
file before any further processing. This will ensure that we still have access to the data in case we want to perform additional analysis or loss of data. There are two categories of data collected.
I extracted data from job listing websites (Adzuna, Remotive) using their respective REST APIs endpoints and RSS feeds (Stackoverflow jobs).
Github trending repositories data is scrapped using Python requests
library with BeautifulSoup
.
Initially extracted data is pre-processed. Since the websites have different field names in their API responses, I ensure the data followed a common type/format in terms of the fields.
Data Model
Python3
Prefect