Giter VIP home page Giter VIP logo

preproc-textclassification's Introduction

Text Preprocessing in Neural Text Classification

Jose Camacho Collados and Mohammad Taher Pilehvar

The following repository includes the pre-trained word embeddings and preprocessed text classification datasets for the paper On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis .

Pre-trained word embeddings

We release the 300-dimension word embeddings used in our experiments as binary bin files. The embeddings were trained on the UMBC corpus with the following preprocessing techniques:

  • Vanilla (simple tokenization): Download here [~1.8 GB]
  • Lowercased: Download here [~1.6 GB]
  • Lemmatized: Download here [~1.7 GB]
  • Multiword-grouped: Download here [~2.1 GB]

Preprocessed datasets

We also release the text categorization and sentiment analysis datasets already preprocessed:

  • Text categorization: Available here
  • Sentiment analysis: Available here

Note 1: If you use any of these datasets, please acknowledge the original sources (you can find them in the reference paper).
Note 2: For each class file in the dataset directories, each line corresponds to an instance in the corpus, be it a phrase, sentence or document (depending on the dataset).

Code

The code to run our experiments is available in the following complementary repository: https://github.com/pilehvar/sensecnn

Reference paper

If you use any of these resources, please cite the following paper:

@InProceedings{camacho:preprocessing2018,
  author = 	"Camacho-Collados, Jose and Pilehvar, Mohammad Taher",
  title = 	"On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis",
  booktitle = 	"Proceedings of the EMNLP Workshop on Analyzing and interpreting neural networks for NLP",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  location = 	"Brussels, Belgium"
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.