Home_Sales_Analysis

I used my knowledge of SparkSQL to determine key metrics about home sales data. Then I used Spark to create temporary views, partition the data, cache and uncache a temporary table, and verified that the table has been uncached.

I answered the following questions using SparkSQL:
1. What is the average price for a four-bedroom house sold for each year? Round off your answer to two decimal places.
1. What is the average price of a home for each year it was built that has three bedrooms and three bathrooms? Round off your answer to two decimal places.
1. What is the average price of a home for each year that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.
1. What is the "view" rating for homes costing more than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.
I cached my temporary table home_sales.
Checked if my temporary table was cached.
Using the uncached data, I ran the query that filtered out the view ratings with an average price of greater than or equal to $350,000.

Using the cached data, I ran the query that filtered out the view ratings with an average price of greater than or equal to $350,000. Determined the runtime and compared it to uncached runtime.

I partitioned by the "date_built" field on the formatted parquet home sales data.

I created a temporary table for the parquet data.
I ran the query that filtered out the view ratings with an average price of greater than or equal to $350,000. Determined the runtime and compared it to uncached runtime.

I uncached the home_sales temporary table.
I verified that the home_sales temporary table is uncached using PySpark.

Runtime

Uncache Runtime: 1.368062973022461 seconds
Cached Runtime: 0.4902327060699463 seconds
Runtime with the parquet DataFrame: 0.9032614231109619 seconds

Conclusion

Based on the time table, it is evident that executing a query on the cached version of the table was the most efficient option.

chiomauche / home_sales Goto Github PK

home_sales's Introduction

Home_Sales_Analysis

Runtime

Conclusion

home_sales's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent