Giter VIP home page Giter VIP logo

dsci_522_payla_rzitomer's Introduction

Parameters for Predicting WineEnthusiast1 Ratings

When buying a new bottle of wine you are always wondering will this wine be any good? Should I grab the one from France? or Spain? or Australia? There are always so many to chooses how do you know that bottle will be excellent. This analysis wanted to help people make an informed choice when trying to choose a wine. WineEnthusiast1 magazine reviews wines and scores wines based off of a blind taste test. This analysis used machine learning to classify wine based on price, country, province and variety by using the WineEnthusiast1 data. We looked at the top predictors from these categories to determine the most important features in the classification. If you want to know how to pick an excellent bottle of wine before ever tasting it read on.

The question this project aims to answer: What are the strongest three predictors that a consumer has access to that will indicate if a wine will receive a WineEnthusiast rating of 90 or greater?

Dataset

The dataset2 used in this analysis is from Kaggle and is under an Attribution-ShareAlike License. The data was web scrapped from WineEnthusiast1 on November 22nd, 2017.

Reproducibility

This analysis has both a dockerfile and makefile and can be run reproducibility from either.

To run this analysis using docker, pull our docker image here: https://hub.docker.com/r/rzitomer/top_predictors_for_great_wine/, clone/download the repository, and then run the following command (filling in PATH_ON_YOUR_COMPUTER with the absolute path of the root of this project on your computer) from the root directory of this project:

docker run --rm  -v PATH_ON_YOUR_COMPUTER:/home/top_predictors_for_great_wine/ --memory=3g rzitomer/top_predictors_for_great_wine make -C '/home/top_predictors_for_great_wine/'

Note that you might have to increase the max memory allocated to docker to 3.0 GiB (its set at 2.0 GiB by default) to run this analysis. The reason so much memory is required is that the cleaned data file is quite large and so the pd.read_csv call is computationally expensive.

To clean up the analysis type:

docker run --rm  -v PATH_ON_YOUR_COMPUTER:/home/top_predictors_for_great_wine/ --memory=3g rzitomer/top_predictors_for_great_wine make -C '/home/top_predictors_for_great_wine/' clean

To reproduce this analysis with our makefile, clone or download the repository and then run the following from the root directory of this project:

make all

make all will run the scripts in the required order to reproduce the same results and analysis as presented. A list of the dependencies are below and in each script in the src directory.

To remove all the outputs and run the analysis clean run this from the root directory of this project:

make clean

make clean will remove all of the files and figures for the analysis and restart the analysis at unzipping and reading in the original file.

Data Analysis Pipeline

To reproduce the analysis manually the scripts should be run own in the order shown:

python src/load_data.py input_file output_file viz_file       # The input file is the raw data from Kaggle, the output_file is our cleaned data. The viz_file has a visualization of rating frequencies used in our report.
python src/explore_data.py input_file output_folder  # The input file is the cleaned data (the output_file you got from running load_data.py). The output_folder is where to put the resulting viz files.
python src/decision_tree.py input_file output_folder   # The input file is the cleaned data (the output_file you got from running load_data.py). The output_folder is where to put the results of the model.  
python src/result_plots.py input_file output_folder  # The input file is the results of the model (the output_file you got from running decision_tree.py). The output_folder is where to put the files that visualize the model.
Rscript -e "rmarkdown::render('output_file')"        # This line renders our final report, which relies on the output_folder, output_file, and output_folder of explore_data.py, decision_tree.py, result_plots.py respectively.

Example of the scripts with the file names from the repo:

python src/load_data.py data/winemag-data-130k-v2.csv.zip data/wine_data_cleaned.csv results/viz_class_frequencies.png
python src/explore_data.py data/wine_data_cleaned.csv results/viz_    
python src/decision_tree.py data/wine_data_cleaned.csv results/model_    
python src/result_plots.py results/model_rank.csv results/results_     
Rscript -e "rmarkdown::render('docs/results.Rmd')"     

Dependencies

Programs/Languages Version
Python 3.7.0
R 1.2.1114
Make 4.2.1
Packages Version
pandas 0.23.4
matplotlib 2.2.3
argparse 1.1
numpy 1.15.1
graphviz 0.8.4
scikit-learn 0.19.2
rmarkdown 1.10
tidyverse 1.2.1
knitr 1.20

References

  1. WineEnthusiast https://www.winemag.com/. Accessed November 2018

  2. Thoutt, Z. (2017, Nov). Wine Reviews, Version 2. Retrieved November 15, 2018 from https://www.kaggle.com/zynicide/wine-reviews/home.

dsci_522_payla_rzitomer's People

Contributors

aylapear avatar rzitomer avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.