For NCCU 1052 Data Science and Big Data Analytics Final Project
Finish a machine learning challenged hosted by Kaggle in conjunction with Expedia : Expedia-personalized-sort
- visitor_hist_starrating
- prpo_starrating
- prop_review_score
- prop_brand_bool
- prop_location_score1
- prop_location_score2
- promotion_flag
- orig_destination_distance
- booking_bool
After download the data, I seperated the booking_true and booking_false data. And get the same number of rows with awk
awk -F "," '{if(substr($54,0,1)=="1") {print}} ' data.csv > book.csv
awk -F "," '{if(substr($54,0,1)=="0") {print}} ' data.csv > no_book.csv
tail -n ? no_book.csv > no_book_tmp.csv
cat book.csv no_book_tmp > data.csv
Use spark to run the python script
spark-submit --master local ./Main.py
I use RandomForest.trainClassifier and the null model is 1/2(guess)
- Accuracy: 77.89%
- Area under Precision/Recall (PR) curve: 87%
- Area under ROC curve: 78.251%