Giter VIP home page Giter VIP logo

dbrd's Introduction

DBRD: Dutch Book Reviews Dataset

GitHub release (with filter) GitHub GitHub all releases GitHub Sponsors

The DBRD (pronounced dee-bird) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the Large Movie Review Dataset and intended as a benchmark for sentiment classification in Dutch. The scripts that were used to scrape the reviews from Hebban can be found in the DBRD GitHub repository.

Dataset

Downloads

The dataset is ~79MB compressed and can be downloaded from here:

Dutch Book Reviews Dataset

A language model trained with FastAI on Dutch Wikipedia can be downloaded from here:

Dutch language model trained on Wikipedia

Overview

Directory structure

The dataset includes three folders with data: test (test split), train (train split) and unsup (remaining reviews). Each review is assigned a unique identifier and can be deduced from the filename, as well as the rating: [ID]_[RATING].txt. This is different from the Large Movie Review Dataset, where each file in a directory has a unique ID, but IDs are reused between folders.

The urls.txt file contains on line L the URL of the book review on Hebban for the book review with that ID, i.e., the URL of the book review in 48091_5.txt can be found on line 48091 of urls.txt. It cannot be guaranteed that these pages still exist.

.
├── README.md     // the file you're reading
├── test          // balanced 10% test split
│   ├── neg
│   └── pos:
├── train:        // balanced 90% train split
│   ├── neg
│   └── pos
└── unsup         // unbalanced positive and neutral
└── urls.txt      // urls to reviews on Hebban

Size

  #all:           118516 (= #supervised + #unsupervised)
  #supervised:     22252 (= #training + #testing)
  #unsupervised:   96264
  #training:       20028
  #testing:         2224

Labels

Distribution of labels positive/negative/neutral in rounded percentages.

  training: 50/50/ 0
  test:     50/50/ 0
  unsup:    72/ 0/28

Train and test sets are balanced and contain no neutral reviews (for which rating==3).

Reproduce data

Since scraping Hebban induces a load on their servers, it's best to download the prepared dataset instead. This also makes sure your results can be compared to those of others. The scripts and instructions should be used mostly as a starting point for building a scraper for another website.

Install dependencies

ChromeDriver

I'm making using of Selenium for automating user actions such as clicks. This library requires a browser driver that provides the rendering backend. I've made use of ChromeDriver.

macOS

If you're on macOS and you have Homebrew installed, you can install ChromeDriver by running:

brew install chromedriver

Other OSes

You can download ChromeDriver from the official download page.

Python

The scripts are written for Python 3. To install the Python dependencies, run:

pip3 install -r ./requirements.txt

Run

Two scripts are provided that can be run in sequence. You can also run run.sh to run all scripts with defaults.

Gather URLs

The first step is to gather all review URLs from Hebban. Run gather_urls.py to fetch them and save them to a text file.

Usage: gather_urls.py [OPTIONS] OUTFILE

  This script gathers review urls from Hebban and writes them to OUTFILE.

Options:
  --offset INTEGER  Review offset.
  --step INTEGER    Number of review urls to fetch per request.
  --help            Show this message and exit.

Scrape URLs

The second step is to scrape the URLs for review data. Run scrape_reviews.py to iterate over the review URLs and save the scraped data to a JSON file.

Usage: scrape_reviews.py [OPTIONS] INFILE OUTFILE

  Iterate over review urls in INFILE text file, scrape review data and
  output to OUTFILE.

Options:
  --encoding TEXT   Output file encoding.
  --indent INTEGER  Output JSON file with scraped data.
  --help            Show this message and exit.

Post-process

The third and final step is to prepare the dataset using the scraped reviews. By default, we limit the number of reviews to 110k, filter out some reviews and prepare train and test sets of 0.9 and 0.1 the total amount, respectively.

Usage: post_process.py [OPTIONS] INFILE OUTDIR

Options:
  --encoding TEXT              Input file encoding
  --keep-incorrect-date TEXT   Whether to keep reviews with invalid dates.
  --sort TEXT                  Whether to sort reviews by date.
  --maximum INTEGER            Maximum number of reviews in output
  --valid-size-fraction FLOAT  Fraction of total to set aside as validation.
  --shuffle TEXT               Shuffle data before saving.
  --help                       Show this message and exit.

Changelog

v3: Changed name of the dataset from 110kDBRD to DBRD. The dataset itself remains unchanged.

v2: Removed advertisements from reviews and increased dataset size to 118,516.

v1: Initial release

Citation

Please use the following citation when making use of this dataset in your work.

@article{DBLP:journals/corr/abs-1910-00896,
  author    = {Benjamin van der Burgh and
               Suzan Verberne},
  title     = {The merits of Universal Language Model Fine-tuning for Small Datasets
               - a case with Dutch book reviews},
  journal   = {CoRR},
  volume    = {abs/1910.00896},
  year      = {2019},
  url       = {http://arxiv.org/abs/1910.00896},
  archivePrefix = {arXiv},
  eprint    = {1910.00896},
  timestamp = {Fri, 04 Oct 2019 12:28:06 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1910-00896.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Acknowledgements

This dataset was created for testing out the ULMFiT (by Jeremy Howard and Sebastian Ruder) deep learning algorithm for text classification. It is implemented in the FastAI Python library that has taught me a lot. I'd also like to thank Timo Block for making his 10kGNAD dataset publicly available and giving me a starting point for this dataset. The dataset structure based on the Large Movie Review Dataset by Andrew L. Maas et al. Thanks to Andreas van Cranenburg for pointing out a problem with the dataset.

And of course I'd like to thank all the reviewers on Hebban for having taken the time to write all these reviews. You've made both book enthousiast and NLP researchers very happy :)

License

All code in this repository is licensed under a MIT License.

The dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

dbrd's People

Contributors

benjaminvdb avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dbrd's Issues

English reviews and advertisements in reviews

Hi there,
I used your dataset in a class I teach for a topic modeling assignment. Thanks for making it available! Students discovered the above mentioned issues in the data. If you make another version, you could take this into account.

The advertisements contain fixed phrases, but most of it is different each time. There are a lot of these in the dataset, so it causes a lot of noise. The English reviews are less frequent and should be easy to filter out with a language identification model, such as from fastText.

Example of advertisement:
jannie52-over-de-man-die-de-draak-doodde
Waarschijnlijk heb ik een teer zieltje. Ik heb me geërgerd aan het taalgebruik. Ik begrijp niet wat de toegevoegde waarde is van de seksistische en discriminerende taal. Het plot vond ik niet sterk. Eigenlijk een vervelend boek om te lezen. 'Het is autobiografisch, helemaal waargebeurd maar toch zie je elementen van fictie in de stijl en vooral de opbouw, dat maakt het des te sterker.' - Win boeken voor je hele leesclub! We gaan Wil van Jeroen Olyslaegers luisteren via de gratis Hebban Luisterboeken-app. Doe je mee? 'Wat beweegt de jonge zwarte deelpachter Tucker Caliban om huis, vee en akkers te vernietigen en met vrouw en kind naar het Noorden te vertrekken?'- Win Uit de maat voor je hele leesgroep!

Example of English review:
anca-over-house-of-leaves
WARNING: REVIEW CONTAINS SPOILERS ABOUT THE ENDING! So, technically I’m not completely finished with this book. I still have the exhibits, annexes and appendixes to go, but I’m finished with the main story of this book and to be honest, just want to be done, get this book of my currently reading shelve [...]

Duplicate example in train and test set

Hi

I was doing some sanity checking and found a duplicate item in the train and test set:

  • DBRD/train/neg/2074_2.txt
  • DBRD/test/neg/20602_2.txt

Content-wise they are identical, with the only difference being that the file in the train set has more newlines. But we filter out these new lines anyway during the training of our models (or at least I do and replace them with single spaces).

This seems important enough to have a revised version 3.1 where the duplicate is removed, as it impacts model training. Together with language filtering (#2), this might even be warranting a v4. Alternatively, I can make a fork and rework the whole thing - of course with acknowledgments to this repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.