Giter VIP home page Giter VIP logo

prettyma / feature-engineering-multi-class-classification Goto Github PK

View Code? Open in Web Editor NEW
1.0 0.0 0.0 1.97 MB

Data Analytics pipeline using Apache Spark | Build multi-class classification models | Test the model using test data and compute accuracy of each method

Python 100.00%
data-pipeline apache-spark word-frequency-count co-occurence mlib python logistic-regression naive-bayes-classification hadoop-mapreduce linux

feature-engineering-multi-class-classification's Introduction

REPORT

Environment Details:

Following are the environment chosen for this Lab:

  1. Apache Spark run on Virtual Machine (provided for Lab2 - Hadoop)
  2. Scripts were written in Python

Steps to run the program:

Note: Our file directory for the python script is /spark/lab3/src/script.py and for the data is /spark/lab3/data

  1. From the terminal, traverse to the folder spark/
  2. Run the script by spark-submit /lab3/src/script.py
  3. You can see the accuracy returned by the Logistic Regression and Naive Bayes on the console

Documentation

  1. Collect data
    Data was collected using NYTimesArticles-API (as used in lab2) for the following categories:
  • Business : Stock, Economy, Finance
  • Sports : NBA, BFL, Golf
  • Politics : President, Trump, Election
  • Entertainment : Met Gala, TMZ, SNL
  1. Dataframe
    Read all articles from each of the category and appended it into a spark-dataframe along with a column specifying the category name.
  2. Feature Engineering
  • Tokenizer : Tokenize each article into words using space delimiter. Used RegexTokenizer API from pyspark
  • Stop Words : Removed commonly used words unrelated to the categories. Used StopWordsRemover API from pyspark
  • Count Vectors : Used CountVectorizer to count the frequency of each word occurring in an article (Similar to term frequency)
  • String Indexer : Converted every category to an integer label using StringIndexer from pyspark
  • IDF : Used IDF API to calculate the frequency of each word in a category
  1. Splitting of Dataframe
    Used pyspark's random-split API to split the data to training data(80%) and test data(20%)
  2. Multi Class Classification
  • Logistic Regression : Used pyspark's LogisticRegression API to create a LR model using the training data. The labels of the test data were predicted using the trained LR Model.
Logistic Regression Classification pipeline:

alt text

  • Naive Bayes Classification : Used pyspark's NaiveBayes API to create a Naive Bayes model using the training data. The labels of the test data were predicted using the trained Naive Bayes model.
Naive Bayes Classification pipeline:

alt text 6. Accuracy
The accuracy of the classification model was determined using the MulticlassClassificationEvaluator API from pyspark by comparing the predicted labels from the classification model and the test data's labels 7. Testing
An unknown set of labelled data was collected and classified using the steps and the accuracy was determined

Output:

  1. Test Data Accuracy
    • Logistic Regression : 81.92%
    • Naive Bayes : 84.7%
  2. Unknown data Accuracy
    • Logistic Regression : 95.44%
    • Naive Bayes : 92.36%

ScreenShot: alt text

feature-engineering-multi-class-classification's People

Contributors

prettyma avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.