Giter VIP home page Giter VIP logo

undergraduate-study-programs-etl's Introduction

FCSE-Skopje 2023 Undergraduate Study Programs ETL

The ETL application is used to transform the study programs and related courses from the Faculty of Computer Science and Engineering at the Ss. Cyril and Methodius University in Skopje. which can be found at the following URL.

Prerequisites

Overview

Pipeline:

Study Programs:

Cleaning Stage
  • Read the study programs data from the study_programs.csv file
  • Clean the study_program_name column by removing any leading or trailing whitespaces, as well as occurrences of multiple whitespaces, and converting the text to sentence case
Generation Stage
  • Generate the study_program_id column by indexing the study programs

Curriculum:

Cleaning Stage
  • Read the curriculum data from the curriculum.csv file
  • Clean the course_code column by removing any leading or trailing whitespaces, as well as occurrences of multiple whitespaces
  • Clean the study_program_name and course_name_mk columns by removing any leading or trailing whitespaces, as well as occurrences of multiple whitespaces, and converting the text to sentence case
Handling Invalid Data
  • Handle invalid course_code values by extracting the course code from the course_name_mk column, as well as removing the valid course_code from course_name_mk

Courses:

Cleaning Stage
  • Read the courses data from the courses.csv file
  • Clean the course_code column by removing any leading or trailing whitespaces, as well as occurrences of multiple whitespaces
  • Clean the course_name_en and course_name_mk columns by removing any leading or trailing whitespaces, as well as occurrences of multiple whitespaces, and converting the text to sentence case
  • Clean the course_professors and course_prerequisite columns by replacing newline characters with commas, removing any leading or trailing whitespaces, as well as occurrences of multiple whitespaces and replacing nulls with нема
Handling Invalid Data
  • Handle invalid course_code values by extracting the course code from the course_name_mk column, as well as removing the valid course_code from course_name_mk and course_name_en
Extraction Stage
  • Extract the course_level column from the course_code column
  • Extract the course_semester column from the columns course_season and course_academic_year
  • Extract the course_prerequisite_type column from the course_prerequisite column
Generation Stage
  • Generate the course_id column by indexing the courses
Transformation Stage
  • Transform the course_professors column by splitting the values and removing the academic titles
  • Transform the course_prerequisite column by splitting the values and validating the course names and calculating the minimum number of subjects that need to be passed in order to enroll in the course
Mapping Stage
  • Map course_prerequisites to course_prerequisite_ids using the singletons for course_name_mk
  • Map course_professors to course_professor_ids using the singletons for course_professors

Results:

This ETL application will save the transformed data in four different files:

  • study_programs.csv: contains the details of the study programs
  • curriculum.csv: contains the details of the study programs and related courses
  • courses.csv: contains the details of the courses
  • merged.csv: contains the merged data from curriculum.csv and courses.csv

Requirements

  • Python 3.9 or later

Environment Variables

Before running the scraper, make sure to set the following environment variables:

  • OUTPUT_DIRECTORY_PATH: the path to the directory where the output files will be saved
  • STUDY_PROGRAMS_INPUT_DATA_FILE_PATH: the path to the study programs data file
  • CURRICULA_INPUT_DATA_FILE_PATH: the path to the curricula data file
  • COURSE_INPUT_DATA_FILE_PATH: the path to the courses data file
  • STUDY_PROGRAMS_DATA_OUTPUT_FILE_NAME: the name of the study programs output file
  • CURRICULA_DATA_OUTPUT_FILE_NAME: the name of the curricula output file
  • COURSE_DATA_OUTPUT_FILE_NAME: the name of the courses output file
  • MERGED_DATA_OUTPUT_FILE_NAME: the name of the merged output file
  • EXECUTOR_TYPE: the type of executor that will be used to run the ETL pipeline
  • MAX_WORKERS: the number of threads that will be used to read and write the data

Installation

  1. Clone the repository

    git clone <repository_url>
  2. Install the required packages

    pip install -r requirements.txt
  3. Run the scraper

    python main.py

Make sure to replace <repository_url> with the actual URL of the repository.

undergraduate-study-programs-etl's People

Contributors

username-gigo-is-not-available avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.