Heritage Housing Issues
- A data analytics project to clean and engineer data for an ML Model that predicts the value of a house in Ames, Iowa and to help visualize the most important features considered when predicting that value.
Live Link to the Project Dashboard
Business Requirements
You are requested by your friend, who has received an inheritance from a deceased great-grandfather located in Ames, Iowa, to help in maximizing the sales price for the inherited properties.
Although your friend has an excellent understanding of property prices in their own state and residential area, they fear that basing their estimates for property worth on their current knowledge might lead to inaccurate appraisals. What makes a house desirable and valuable where they come from might not be the same in Ames, Iowa. They found a public dataset with house prices for Ames, Iowa, and will provide you with that
- 1 - The client is interested in discovering how house attributes correlate with the typical house Sale Price. Therefore, the client expects data visualizations of the correlated variables against Sale Price to show that.
- For this we will need to present the data in a way that is easy to understand, showcasing the major variables that affect a house's sale price.
- 2 - The client is interested in predicting the house sales price from their 4 inherited houses, and any other house in Ames, Iowa.
- We will need to create a dashboard where the user can enter the key variables of their houses in order to give them a price estimate.
Dataset Content
- The dataset is sourced from Kaggle. We created a fictitious user story where predictive analytics could be applied to a real project in the workplace.
- The dataset has almost 1.5 thousand rows and represents housing records from Ames, Iowa; indicating house profile (Floor Area, Basement, Garage, Kitchen, Lot, Porch, Wood Deck, Year Built) and its respective sale price for houses built between 1872 and 2010.
Variable | Meaning | Units |
---|---|---|
1stFlrSF | First Floor square feet | 334 - 4692 |
2ndFlrSF | Second floor square feet | 0 - 2065 |
BedroomAbvGr | Bedrooms above grade (does NOT include basement bedrooms) | 0 - 8 |
BsmtExposure | Refers to walkout or garden level walls | Gd: Good Exposure; Av: Average Exposure; Mn: Minimum Exposure; No: No Exposure; None: No Basement |
BsmtFinType1 | Rating of basement finished area | GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement |
BsmtFinSF1 | Type 1 finished square feet | 0 - 5644 |
BsmtUnfSF | Unfinished square feet of basement area | 0 - 2336 |
TotalBsmtSF | Total square feet of basement area | 0 - 6110 |
GarageArea | Size of garage in square feet | 0 - 1418 |
GarageFinish | Interior finish of the garage | Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage |
GarageYrBlt | Year garage was built | 1900 - 2010 |
GrLivArea | Above grade (ground) living area square feet | 334 - 5642 |
KitchenQual | Kitchen quality | Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor |
LotArea | Lot size in square feet | 1300 - 215245 |
LotFrontage | Linear feet of street connected to property | 21 - 313 |
MasVnrArea | Masonry veneer area in square feet | 0 - 1600 |
EnclosedPorch | Enclosed porch area in square feet | 0 - 286 |
OpenPorchSF | Open porch area in square feet | 0 - 547 |
OverallCond | Rates the overall condition of the house | 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor |
OverallQual | Rates the overall material and finish of the house | 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor |
WoodDeckSF | Wood deck area in square feet | 0 - 736 |
YearBuilt | Original construction date | 1872 - 2010 |
YearRemodAdd | Remodel date (same as construction date if no remodeling or additions) | 1950 - 2010 |
SalePrice | Sale Price | 34900 - 755000 |
Epics and User Stories
User | User wants to be able to.. | So they can.. |
---|---|---|
Client | Navigate dashboard pages easily | Easily absorb all information presented by the dashboard |
View features that correlate with house sale price | Gain an understanding of what factors increase a houses sale price | |
Input house key variables into a sales price prediction app | Estimate the value of their houses | |
Data Scientist | View a breakdown of the ML pipelines | Have an understanding of how the data is prepared and used in predicting sale price |
Most user stories all fall into the epic category of dashboard planning, design and development.
Hypothesis and how to validate?
- 1 - We suspect houses with larger square footing may have had a higher sales price.
- A Correlation study can help in this investigation
- 2 - We suspect that between houses with similar square footing, those with a more recent Year Built date may have had a higher sales price.
- A Correlation study can help in this investigation
- 3 - We suspect that between houses with similar square footing and year built date, those with a more recent Remodel date may have had a higher sales price.
- A Correlation study can help in this investigation
- 4 - We suspect that between houses with similar square footing, those with higher quality and condition scores may have had a higher sales price.
- A Correlation study can help in this investigation
Rationale to map the business requirements to the Data Visualizations and ML tasks
-
Business Requirement 1: Data Visualization and Correlation study
- We will inspect the data related to the houses.
- We will conduct a correlation study (Pearson and Spearman) to understand better how the variables are correlated to Sale Price.
- We will plot the main variables against Sale Price to visualize insights.
-
Business Requirement 2: Regression, Data Analysis
- We want to predict the value of a house. We want to build a regression model to predict the dependent variable.
- We want to make plots to visualize the train and test sets predictions vs the actual.
- We want to run regression evaluation to demonstrate the R2 Score and Mean Absolute Error.
ML Business Case
Predict Sale Price
Regression Model
- We want an ML model to predict the sale price of a house. A target variable is a continuous number. We consider a regression model, which is supervised and uni-dimensional.
- Our ideal outcome is to provide our client with reliable insight into what sale price they should expect for their inherited houses.
- The model success metrics are
- At least 0.7 for R2 score, on train and test set
- The ML model is considered a failure if:
- after 12 months of usage, the model's predictions are 50% off more than 30% of the time. Say, a prediction is >50% off if predicted 10 months and the actual value was 2 months.
- The output should be a continuous value for sale price.
Data Decision Making
-
I chose to use an ArbitraryNumberImputer for filling the missing data of
2ndFlrSf, EnclosedPorch, MasVnrArea and WoodDeckSF
instead of MeanMedianImputer. My reasoning for this was that in all of these variables theMode
value is 0.EnclosedPorch
for example has 90% of it's data missing and of the 10% remaining, 8% of it is zeroes. This is similar for almost all variables relating to area or size of features, which suggests that in the data collection these fields were often left blank instead of filling them with 0 when a house simply did not have a 2nd floor, wood deck or other feature. -
I wanted to fill the missing values for
GarageYrBlt
with the same values asYearBuilt
due the two almost always sharing the same date or the Garage Built date being just one or two years after. However, I ran into an error that I wasn't able to move past, the method I used tofillna
the empty values in my pipeline was at some point causing a ValueError when put though the hyperparameter optimization search. I wasn't able to fix this and so opted to use the MeanMedianImputer instead. This would lead to some odd data, like houses having garages built before the house itself, but ultimately wouldn't cause a negative impact to the model. -
Lastly, I wanted to use the OutlierTrimmer instead of the Windsorizer on
SalePrice, GrLivArea and TotalArea
. To do so however would require me to make a number of changes to the order of the pipeline. Currently the train and test sets are split and then have the cleaning and engineering steps applied to them in the pipeline. The problem with this is that by splitting the sets and then trimming them of outliers, the sets always end up with imbalanced amounts of data that the pipeline expects and causes an error.
Dashboard Design
Page 1: Quick project summary
- Quick project summary
- Project Terms & Jargon
- Describe Project Dataset
- State Business Requirements
Page 2: House Sale Price Study
- Before the analysis, we knew we wanted this page to answer business requirement 1, but we couldn't know in advance which plots would need to be displayed.
- After data analysis, we agreed with stakeholders that the page will:
- State business requirement 1
- Checkbox: data inspection on house sale data (display the number of rows and columns in the data, and display the first ten rows of the data)
- Display the most correlated variables to house price and the conclusions
- Checkbox: Individual plots showing the house price levels for each correlated variable
- Checkbox: Plot that shows house prices and correlated variables with a hue of OverallQuality
Page 3: House Value Estimator
- State business requirement 2
- Set of widgets inputs, which relate to the house variables. Each set of inputs is related to a given ML task to predict a houses SalePrice.
- "Run predictive analysis" button that serves the input data to our ML pipelines, and predicts the value of the house.
Page 4: Project Hypothesis and Validation
Before the analysis, we knew we wanted this page to describe each project hypothesis, the conclusions, and how we validated each. After the data analysis, we can report that:
-
1 - We suspect houses with larger square footing may have had a higher sales price.
- Correct. There is strong correlation between the two. The correlation study on the House Sale Price Study supports this hypothesis.
-
2 - We suspect that between houses with similar square footing, those with a more recent Year Built date may have had a higher sales price.
- Correct. There is moderate correlation between the two. The correlation study on the House Sale Price Study supports this. It's worth noting that houses with a more recent Year Built date are typically higher in Overall Quality which has much stronger correlationto Sale Price.
-
3 - We suspect that between houses with similar square footing and year built date, those with a more recent Remodel date may have had a higher sales price.
- Correct. There is weak to moderate correlation between the two. Again, it is worth noting that there is a relationship between houses with a more recent Remodel date being higher in Overall Quality
-
4 - We suspect that between houses with similar square footing, those with higher quality and condition scores may have had a higher sales price.
- Correct. There is strong correlation between the two variables.The correlation study on the House Sale Price Study supports this hypothesis.
Page 5: ML: Predict House Value
- Considerations and conclusions after the pipeline is trained
- Present ML pipeline steps
- Feature importance
- Pipeline performance
Unfixed Bugs
- There is one unsolved bug in page 5 of the app, related to the evaluate_pipeline.py file, where for some reason in the app on the last page, the function always outputs an extra line of 'None'. I've spent a great deal of my tutors time and my own searching for why this might be but haven't come across a reason for it.
Deployment
Heroku
- The App live link is: https://YOUR_APP_NAME.herokuapp.com/
- The project was deployed to Heroku using the following steps.
- Log in to Heroku and create an App
- At the Deploy tab, select GitHub as the deployment method.
- Select your repository name and click Search. Once it is found, click Connect.
- Select the branch you want to deploy, then click Deploy Branch.
- The deployment process should happen smoothly in case all deployment files are fully functional. Click now the button Open App on the top of the page to access your App.
Main Data Analysis and Machine Learning Libraries
- Matplotlib - Creates various graphs and plots to visualize the data.
- Seaborn - For visualizing the data in the Streamlit app with plots, graphs and more.
- ppscore - Used to study the power predictive score of variables against one another.
- Streamlit - Creating the app to present the study.
- Feature-Engine - Major library for engineering the data for the pipeline.
- Scikit-Learn - Creating the pipeline and applying various algorithms, feature engineering steps and more to it.
- Numpy - To process arrays that store values, aka data. It facilitates math operations and their vectorization.
- Pandas and Pandas-Profiling - For data analysis, data exploration, data manipulation, data visualization.
Credits
Content
- The template for this project was created by Code Institute
- A number of functions were built for this project by Code Institute and are credited throughout the notebooks.
- The dataset is provided by Kaggle
- All learning material was sourced through either the Code Institute program or the documentation of the various libraries used.
Acknowledgements
- A big thank you to Niel McEwen of Code Institute who is quick and eager to help with the many questions I had during my studies.