This is the repo containing the capstone project for the University of Toronto School of Continuing Studies (SCS) machine learning course (3253-067) that takes place from Sept 21, 2023 to Dec 7, 2023. This README file is used to document the project's information, and it will be updated as the project progresses.
- Will Huang
- Anmole Bajwa
Purpose / Objective Statement
- What is the main problem to address? Why does it matter?
- Clearly define the population / phenomenon that this project is trying to model / predict / forecast.
- What kind of modeling problem is it (regression; classification / clustering; etc.)
- How does ML address this problem and add value that other solutions cannot?
Project Scope
Here is a list of activities to be accomplished during this project.
- Problem identification and statement
- Initial planning and identifying potential solutions
- Literature research
- Relevant regulations (if applicable)
- Data collection. exploration and analysis
- Data selection, transformation and feature engineering
- Model assumptions
- Model training, tuning and selection
- Measure and analyze model performance
- Benchmarking
- Identify model limitations
- State the final model(s) developed
- Design a model monitoring plan
- Model deployment
- Discussions on future enhancements
- Model risk assessment
- Briefly re-state the problem here.
- Do a brief literature research to identify the most prominent solutions to the problem that are currently available in academia and / or industry.
- If applicable and time allows, do research to identify all regulatory requirements that place constraint on how the model can be developed and used.
Describe the technological tools that are used to develop and implement the model(s). For example, describe:
- The computer hardware and / or cloud platform(s) used.
- The programming languages (such as Python) used.
- The databases for hosting the datasets used to train the model.
- The model's infrastructure and / or pipeline, such as how the model is connected to the database.
Clearly describe the following about the dataset.
- Provide a short sentence describing what this dataset represents.
- State the source(s) of this dataset.
- Is it sourced from one place or multiple places?
- How reliable / reputable is this source?
- How was this data collected? Is there any potential of bias in the data collection process?
- Data representativeness: Given the intended target population defined above, how representative is the dataset to the target population?
- What is the dataset format (e.g., tabular; unstructured; etc.)? If it is comprised of multiple datasets, how are these datasets related to each other (e.g., related by primary and foreign key)?
- Data composition
- How many records are in the dataset?
- What are the variables available? What are their data types (e.g., integer, float, string, date, etc.)?
- For each variable, how many records are available, and how many are missing?
- Are there outliers in the dataset? Do they warrant removal?
Given the information above, how reliable will the model be?
Inspect the data both numerically and visually. Refer to chapter 2 of the textbook for examples.
Be sure to isolate a training set from the full dataset before doing the data analysis too far; this will help to prevent data snooping.
Develop a simple model that represents the "base-level" to compare the developed the model with.