Giter VIP home page Giter VIP logo

house-price-prediction's Introduction

House Price Prediction

Introduction

Project Description

This practical final project is part of the Introduction to Data Science course in the field of Data Science at VNUHCM - University of Science.

House Price Prediction Problem is a task of estimating the value of real estate based on various features. It involves building and developing a model or algorithm that can predict prices based on property attributes.

The goal of house price prediction is to provide a reliable estimate of the buying or selling price for buyers, sellers, real estate agents, or financial institutions to make informed decisions in transactions. By analyzing past and current buying and selling transactions in the market, considering factors such as location, size, number of rooms in the property, amenities in the neighborhood, and market trends, prediction models are built and trained to accurately estimate the price of a house.

House price prediction is a regression problem because the target variable (house price) is a continuous variable. Regression models such as linear regression, decision trees, random forest, support vector machines, or gradient boosting algorithms are commonly used for this task.

Team members

Name Student ID Contribution % Note
Võ Thị Khánh Linh 21280070 100% Group leader
Nguyễn Nhật Minh Thư 21280112 100%
Nguyễn Đặng Anh Thư 21280111 100%
Trần Ngọc Tuấn 21280058 100%
Phạm Duy Sơn 21280107 100%

Project Workflow

image

Data Collection

To efficiently and automatically gather data, we will utilize the technique of web scraping. Our objective is to extract information related to real estate transactions from this website. Specifically, we aim to collect data on various property attributes such as the number of floors, bedrooms, bathrooms, the area in square meters, location, road frontage, legal status, as well as the selling prices of the properties.

To crawl information from real estate website, in this project we will function scrape_this:

vn = scrape_this(province, num_page, district_list, province_wards)

After collecting the data, the final dataset will consist of 7 features.

image

Data Preprocessing

At this phrase, we will handle various problem in raw collected data.

Drop duplicates

During the data collection process, we have noticed that some houses have been listed for sale multiple times. Therefore, the initial step in your data processing is to drop the duplicate data points in the dataset.

data_VN = data_VN.drop_duplicates().reset_index(drop = True)

Handle Irrational Input Values

In the data collection process, you have observed that some sellers did not pay attention to the unit of currency when listing their properties for sale. Specifically, some may have mistakenly used units such as million, thousand, or even Vietnamese Dong (VND) instead of billion (Vietnamese currency). Instead of discarding data points with such errors, you can perform simple division or multiplication operations to convert them to the correct values. However, please note that the following steps are based on the assumption that no property is listed for a price exceeding 1,000 billion or less than 100 million.

data_VN = handling_price(data_VN, 'Price')

Encode Categorical Variables

To convert the "Roadfrontage" and "Legal" features to numeric values, where they only accept two distinct values, you can map them to "1" and "0" in a meaningful way.

# Binary Variables
data_VN["Roadfrontage"] = list(map(int, data_VN["Roadfrontage"]))

data_VN.loc[data_VN['Legal'] == 'Good', 'Legal'] = int(1)
data_VN.loc[data_VN['Legal'].isna(), 'Legal'] = int(0)
data_VN['Legal'] = pd.to_numeric(data_VN['Legal'])

For the "Address" feature, our team has chosen to "geocode" it by converting it into coordinates, including two new features: Lat (Latitude) and Long (Longitude).

data_VN_encoding = createLatLong(data_VN)

More and more

You can see more details inside the notebook

Modeling

We build a model and train it based on data collected using various algorithms. Afterward, we will compare their performance.

Machine learning algorithms that are used to train the model:

  • eXtreme Gradient Boosting (XGBoost)
  • Histogram-based Gradient Boosting
  • CatBoost
  • ...
# Initialize models
model1 = xgb.XGBRegressor(max_depth=9,
                          learning_rate=0.1216060123693681,
                          n_estimators=632,
                          min_child_weight=52,
                          gamma=0.5109582511543662,
                          subsample=0.9976570598100385,
                          colsample_bytree=0.9719365038898948,
                          reg_alpha=0.911025347022059,
                          reg_lambda=0.8850435124855656)

model2 = HistGradientBoostingRegressor(l2_regularization = 5.390165270582185,
                                       learning_rate = 0.1518554287972511,
                                       max_iter = 201,
                                       max_depth = 17,
                                       max_bins = 251,
                                       max_leaf_nodes = 37,
                                       min_samples_leaf = 5)

model3 = CatBoostRegressor(l2_leaf_reg = 4.3173827907470885,
                           max_bin = 91,
                           subsample = 0.9085139050093797,
                           learning_rate = 0.14374193354140258,
                           n_estimators = 423,
                           max_depth = 7,
                           min_data_in_leaf = 46,
                           verbose = False)

# Training
model1.fit(X_train_scaled, y_train)
y_pred1 = model1.predict(X_test_scaled)

model2.fit(X_train_scaled, y_train)
y_pred2 = model2.predict(X_test_scaled)

model3.fit(X_train_scaled, y_train)
y_pred3 = model3.predict(X_test_scaled)

Basic ensemble

The idea behind ensemble learning is that by combining the predictions of multiple models, the overall performance and accuracy can be improved compared to using a single model.

# Weighted Average
y_pred = y_pred1*0.25 + y_pred2*0.54 + y_pred3*0.21

We can see that the results of applying the Weighted Average technique have improved the R2 and RMSE compared to using any of the three models individually. The final results are as follows: In the best case,

  • R2 Score: 0.711
  • RMSE: 1.60

Conclusion

Overall, this study demonstrates the successful application of machine learning techniques to the house price prediction problem, providing insights into the factors influencing house prices and enabling more informed decision-making in the real estate market.

Contribution

This project is a collaborative effort, and we would like to acknowledge the contributions of the following individuals:

  • Vo Thi Khanh Linh
  • Nguyen Nhat Minh Thu
  • Nguyen Dang Anh Thu
  • Tran Ngoc Tuan
  • Pham Duy Son

We would like to express our gratitude to all the contributors for their hard work and dedication to this project. Without their efforts, this project would not have been possible.

house-price-prediction's People

Contributors

nnmthuw avatar tuantran0910 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.