Giter VIP home page Giter VIP logo

-gender-classification-of-blog-authors-'s Introduction

Gender Classification of Blog Authors

This repository contains the entire source code for implementing our paper Gender Classification of Blog Authors: With Feature Engineering and Deep Learning using LSTM Networks1.

Prerequisites:

  • nltk version 3.2.2
  • scikit-learn 0.18.1
  • Keras 2.0.6 (Tensorflow backend version: 1.0.1)

We used the data set originally mentioned by Mukherjee and Liu2 in their work as well as The Blog Authorship Corpus for showing our results.

Feature Extraction:

The input fed to each of the module mentioned below is the blog post after minimal processing (removing stopwords and html tags):

Newly Added Feature Classes:

  • mineCharPats.py mines the Character Sequence Pattern Features.
  • wordClassFeatures.py mines the word class factors along with the 13 new word classes proposed by us.
  • baseFeatures.py contains all the surface features used by us. These include: Normalized count of sentences, Normalized count of words, Normalized count of characters, Normalized count of alphabets, Normalized count of digits, Normalized count of special characters, Normalized count of punctuation marks, Count of short words (< 4 characters) and Average word length.
  • sentiWordNet.py measures the average sentiment score based on the Senti WordNet 3.0 Lexical Resource.
  • yuleK.py measures the lexical richness of the blog based on Yule's K index.

Re-implemented Feature Classes:

  • minePOSPats.py mines the variable length POS sequence patterns on the basis of minimum support and minimum adherence thresholds specified by the user. Prior to running this file, the user needs to find the POS probability of all such words using probOfPOS.py.
  • FMeasure.py measures the text’s relative contextuality (implicitness), as opposed to the formality (explicitness).
  • genderPreferentialFeatures.py gives a measure of 10 distinguishing word endings.
  • get_CBOW_features.py extracts the Continuous Bag Of Words from the text. However, this didn't lead to any substantial improvement in the accuracy of the model.

Classification Algorithms

  • first_approach.py implements a Voting of Machine Learning Classifiers based on the features extracted above.

References

[1] S. Jha, V. P. Dwivedi, D. K. Singh, and Ranvijay, “Gender classification of blog authors: With feature engineering and deep learning using lstm networks,” in Proceedings of the Ninth International Conference on Advanced Computing (ICoAC-2018).

[2] A. Mukherjee and B. Liu, “Improving gender classification of blog authors,” in EMNLP, 2010.

-gender-classification-of-blog-authors-'s People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.