Giter VIP home page Giter VIP logo

yelp_reviews_and_housing_markets_etl's Introduction

ETL Project

This project is designed to conduct extraction, transformations, and loading of two sets of data, namely housing data, and Yelp review data.

The purpose of this project was to migrate the data to a production database. The project stores, transforms, and loads both the datasets successfully into the database.

The project consists of two parts:

  1. ETL on housing data of 2 US cities, New York and Austin to answer some of the questions such as-

    • which city has higher housing price for given number of bedrooms
    • which city has more number of listings
    • the mean and median housing prices in both cities
    • the maximum and minimum house price for similar homes in these 2 cities, etc.
  2. ETL on 2 million+ Yelp user reviews to answer some of the questions such as -

    • who are the top 20 users based on the review count
    • who are the first 50 Yelp users, etc.

The project includes three parts:

1. Data Extraction (Finding Data)

Housing Data:

  • CSV formats of New York housing dataset and Austin housing dataset – Sourced from Kaggle.

Yelp User Data

  • The JSON format for Yelp users’ data - Sourced from Kaggle.

2. Data Transformation (Data Cleanup and Analysis)

The transformation includes following steps –

  • Using Pandas functions in Jupyter Notebook to transform all CSV files into dataframes. It includes

    • ‘.groupby()’ to combine house listings by number of bedrooms,
    • ‘.merge()’ to merge dataframes for New York and Austin housing listings.
  • Using Python transformation functions for data cleaning, joining, filtering, and aggregating such as-

    • ‘.copy()’ to select only desired columns,
    • ‘.dropna()’ to remove empty cells (NaN values) and duplicate rows,
    • ‘.mean()’ to find averages, and
    • ‘.median()’ to find median prices.
  • It also includes converting datatypes for certain columns. I used ‘. astype(float)’ to convert price column from string to float.

  • Dataframe formatting was performed by using ‘.map("${:,.2f}".format)’ to display price column with proper readability.

  • Analysis was done by comparing average and median house prices for both cities. It was done by merging two dataframes that had statistics for each city’s housing prices.

3. Load

I used a relational database called Postgresql as the final production database to load the data into. The following tables and columns were created -

  • newyork_housing with the columns - num_bedrooms, price, street_address, city.

  • austin_housing with the columns - num_bedrooms, price, street_address, city.

  • yelp_users with the columns - user_id, user_name, yelping_since, review_count, average_stars, us_state.

The data was loaded successfully in these tables.

  • New York housing table -

  • Austin housing table -

  • Yelp users table -

Analysis:

Comparision of housing prices in New York and Austin -

Top Yelp users by review counts -

First 50 Yelp users

yelp_reviews_and_housing_markets_etl's People

Contributors

poonam-ux avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.