Giter VIP home page Giter VIP logo

labr's Introduction

LABR: A Large-SCale Arabic Book Reviews Dataset

This dataset contains over 63,000 book reviews in Arabic. It is the largest sentiment analysis dataset for Arabic to-date. The book reviews were harvested from the website Goodreads during the month or March 2013. Each book review comes with the goodreads review id, the user id, the book id, the rating (1 to 5) and the text of the review.

Contents:

  • README.txt: this file

  • data/labr_data

    - reviews.tsv: a tab separated file containing the "cleaned up" reviews. It contains over 63,000 reviews. The format is:
                   
                   rating<TAB>review id<TAB>user id<TAB>book id<TAB>review
                   
      where:
    
                   rating: the user rating on a scale of 1 to 5
                   review id: the goodreads.com review id
                   user id: the goodreads.com user id
                   book id: the goodreads.com book id
                   review: the text of the review
    
    - 2class-balanced-train/test.txt: text file containing indices of reviews 
                   (from the reviews.tsv file) that are in the training/test
                   sets. Balanced means the number of reviews in the 
                   positive/negative classes are equal. The ratings are 
                   converted into positive (rating 4 & 5) and negative 
                   (rating 1 & 2) and rating 3 is ignored.
                   
    - 2class-unbalanced-train/test.txt: the same, but the sizes of the calsses 
                   are not equal.
                   
    
    - 3class-balanced/unbalanced-train/test/validation.txt: the same, but for 3 classes 
                   instead of just 2.
    
    - 5class-balanced/unbalanced-train/test.txt: the same, but for 5 classes 
                   instead of just 2.
    
  • data/labr_lexicon

    • POS.txt: A file contain positive phrases generated from the labr.
    • NEGATIVE.txt: A file contain negative phrases generated from the labr.
    • Neg.txt: A file contain some negation operators generated from the labr.
  • data/dr_samha_lex

    • POS.txt: A file contain positive phrases generated by El-Beltagy, Samhaa R and Ali, Ahmed, "Open issues in the sentiment analysis of arabic social media: A case study" (2013), 215--220.
    • NEGATIVE.txt: A file contain negative phrases generated by El-Beltagy, Samhaa R and Ali, Ahmed, "Open issues in the sentiment analysis of arabic social media: A case study" (2013), 215--220.
  • python/

    • labr.py: the main interface to the dataset. Contains functions that can read/write training and test sets.

    • experiments.py: a Python script containing the code used to generate the experiments of http://arxiv.org/abs/1411.6718

    • Defiantions.py: a python file contain the definations for the used classifiers and the feature generators.

    • Utilities.py: a python file contain the some reading functions and classifier performance measure functions.

Demo

In order to replicate the splits with different test/train/validation precent

l=LABR()

(rating, a, b, c, body)=l.read_clean_reviews()

l.split_train_validation_test_3class(self, rating, percent_test, percent_valid, balanced="unbalanced"):

In order to try new classifier just add it to "classifiers" list in Definations.py then run experiment.py

Reference

Please cite this paper for any usage of the dataset:

Mohamed Aly and Amir Atiya. LABR: Large-scale Arabic Book Reviews Dataset. Association of Computational Linguistics (ACL), Bulgaria, August 2013.

labr's People

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.