Giter VIP home page Giter VIP logo

news-classification-nlp-tfidfvectorizer's Introduction

#News-Classification

Dataset used from http://mlg.ucd.ie/datasets/bbc.html

Publication - D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. [PDF] [BibTeX].

Dataset: BBC All rights, including copyright, in the content of the original articles are owned by the BBC.

Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Class Labels: 5 (business, entertainment, politics, sport, tech)

We use Natural Language Processing to read each story and identify the label to which the story belongs Upon being provided with a news article, the predictive model we build is able to classify it into a single class label with a 97% accuracy

The 2225 text documents are separated into 5 labelled folders depending on which label they belong to as below: Business – 510 files Entertainment – 386 files Politics – 417 files Sport – 511 files Tech – 401 files

We use the OS module in python which provides a portable way of using operating system dependent functionality. Using a for loop and the OS module, we iterate through each text file to read and append the text to a list variable X. Simultaneously, we append the subsequent folder name to a list variable Y

Using pandas, we create a data frame using created lists X and Y, and write the data to a file, which will be used to build our model.

Upon exploring the data, we see that there are duplicate entries. We drop the duplicates to get 2127 unique columns Upon examining the news length, we see that most news have about 2275 characters. The longest news has 25,600 characters which is labelled as politics and is about terrorism.

Upon plotting the length on a histogram. There appear to be differences in the length based on the type of news.

We first clean the string by passing it through a function, to remove any numbers and characters, and return the string in lower case.

We also remove stop words such as ‘is’, ‘are’, ‘of’ using the nltk module of stop words.

We use TFIDFVectorizer from scikit-learn to extract features from the text data.

TFIDF is an abbreviation for Term Frequency - Inverse Document Frequency. In simple terms, it gives a numerical value (weight) for each word, based on how many times it appears in all the documents.

We are able to extract 14,788 features from the dataset.

We split the data into the below sets:

  1. Training set – 1,780 news articles
  2. Test set – 445 news articles

We train the classification models below:

  1. Naïve Bayes Classifier
  2. Decision Tree Classifier
  3. Random Forest Classifier

We evaluate the models based on the below metrics:

  1. Confusion matrix 2 Classification report
  2. Kappa
  3. Accuracy

Of the 3 models trained, Naïve Bayes performed the best with a 0.97 accuracy and F1 Score, marginally better than the Random Forest Classifier.

The Naïve Bayes Classifier is our selection for the final model

The model is able to identify the type of news based on the content with a 97% accuracy.

The model can be retrained using K-Fold cross validation with by ensuring that the training data set is selected at random, and reduce the scope of overfitting the model. However, with an accuracy and precision as high as 97%, this will not be necessary in for this model.

news-classification-nlp-tfidfvectorizer's People

Contributors

joeldias avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.