Data Mining for average movie/tag ratings
Follow the instructions will get you familiar with how to do data mining on large datasets. The open source datasets can be reached in the MovieLens | GroupLens. In this repository, use the MovieLens 20M Dataset and MovieLens Latest Datasets for implementation. Both datasets provide movieID and the rating record for each single movie. Each movie can also be categortized by tags. The goal is to find average ratings via Spark(PySpark) for movieID and tag, separately.
-
Task1 - find average movie ratings
-
Task2 - find average tag ratings
Put MovieLen datasets and two of python scripts inside the Spark Folder. As the relative code path is defined (For example: "ml-latest-small/ratings.csv"), the program will read the file when we use “sc.textFile”. If you want to test different task, just simply change the path to “ml-20m/ratings.csv”.
-
Put the source code (.py) and both datasets (ml-20m / ml-latest-small) inside the Spark folder
-
Start testing steps below
1.Open the source code(Po-Chuan_Tseng_task1.py)
2.Change the sc.textfile path depends on the dataset you want to test
3.Save the file and open the Terminal on Mac
4.Cd into the Spark Folder and type the following command
./bin/spark-submit Po-Chuan_Tseng_task1.py
5.After the program finishes task, the txt file will be generated inside the Spark folder.
6.Open the file and check the values.
1.Open the source code(Po-Chuan_Tseng_task2.py)
2.Change the sc.textfile path depends on the dataset you want to test
3.Save the file and open the Terminal on Mac
4.Cd into the Spark Folder and type the following command
./bin/spark-submit Po-Chuan_Tseng_task2.py
5.After the program finishes task, the csv file will be generated inside the Spark folder.
6.Open the file with TextEdit.app and check the values
This repository is credited to the course project of INF553 at USC