I used my knowledge of SparkSQL to determine key metrics about home sales data. Then I used Spark to create temporary views, partition the data, cache and uncache a temporary table, and verified that the table has been uncached.
-
I answered the following questions using SparkSQL:
-
I cached my temporary table home_sales.
-
Checked if my temporary table was cached.
-
Using the uncached data, I ran the query that filtered out the view ratings with an average price of greater than or equal to $350,000.
- Using the cached data, I ran the query that filtered out the view ratings with an average price of greater than or equal to $350,000. Determined the runtime and compared it to uncached runtime.
- I partitioned by the "date_built" field on the formatted parquet home sales data.
-
I created a temporary table for the parquet data.
-
I ran the query that filtered out the view ratings with an average price of greater than or equal to $350,000. Determined the runtime and compared it to uncached runtime.
-
I uncached the home_sales temporary table.
-
I verified that the home_sales temporary table is uncached using PySpark.
- Uncache Runtime: 1.368062973022461 seconds
- Cached Runtime: 0.4902327060699463 seconds
- Runtime with the parquet DataFrame: 0.9032614231109619 seconds
- Based on the time table, it is evident that executing a query on the cached version of the table was the most efficient option.