Giter VIP home page Giter VIP logo

data_mining's Introduction

Data Mining

Data recovery and classification techniques using Pandas, Scikit, Keras and Pytorch.

Data Mining

In this repo, there are 2 projects implemented to satisfy the Data Mining course:

  • The first project uses the Red Wine Quality dataset was used to classify the quality of red wines. Red wine has some attributes, which can be used to estimate the quality of the red wine. The scale of the quality is 0 to 10 describing a bad wine quality and an excellent wine quality respectively.
  • The second project uses the Onion or not dataset was used to classify news headlines. There are fake news headlines and legit news headlines. The target is to find patterns in the each headline, using the words (or tokens) of all the headlines.

The repo is organised as follows:

  • Red Wine Quality Project

    • data-mining-part-a-svm-simple.ipynb: Jupyter notebook that uses SVM technique to classify the quality of the Red Wine dataset
    • data-mining-part-a-svm-with-preprocessing.ipynb: Jupyter notebook where apart from classification, there is also some data preprocessing done, which helps the SVM classifier afterwards
    • data-mining-part-a-svm-without-pH.ipynb: Jupyter notebook where a column is dropped first, and then SVM is performed to classify the quality column like before
    • data-mining-part-a-svm-mean-completion.ipynb: Jupyter notebook where the pH column is distorted. There is an attempt to restore the integrity of the data using the mean values of the column to fill in the corrupt data cells. Finally, the SVM classifies the red wine quality
    • data-mining-part-a-logistic.ipynb: Jupyter notebook where the pH column is distorted. There is an attempt to restore the integrity of the data using logistic regression to predict the corrupt data cells data. Finally, the SVM classifies the red wine quality
    • data-mining-part-a-kmeans.ipynb: Jupyter notebook where the pH column is distorted. There is an attempt to restore the integrity of the data using k-means clustering algorithm to fill in the corrupt data cells. Finally, the SVM classifies the red wine quality
    • winequality-red.csv: The dataset of the project
  • Onion or not Project

    • data-mining-part-b-preprocess-data.ipynb: Jupyter notebook where data is preprocessed. There are techniques implemented, such as stemming, stopwords removal and tokenization, which prepare the data and make it compatible to the classifier aftwerwards
    • Classifiers implemented in two different frameworks for academic purposes:
      • data-mining-part-b-nn-keras.ipynb: Jupyter notebook where a neural network is implemented in Keras to classify the preprocessed data
      • data-mining-part-b-nn-pytorch.ipynb: Jupyter notebook where a neural network is implemented in Pytorch to classify the preprocessed data
    • Due to RAM limitations, the classifiers were re-implemented and merged with data preprocessing. The file exported after data preprocessing was 4 GB large. After observing memory changes, merging the two parts into one made the overall code lighter. Thus, the files where merging is done are:
      • combined-v.01-heavy.ipynb: Jupyter notebook where a neural network is implemented in Pytorch to classify the preprocessed data. This neural network was heavier than needed. Therefore, there was an attempt to make the classifier lighter
      • combined-v.01-light.ipynb: Jupyter notebook where lighter implementation of the neural network is implemented
      • onion-or-not.csv: The dataset of the project

The rest of the repo files help mostly the developer.

data_mining's People

Contributors

andreaskaratzas avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.