Giter VIP home page Giter VIP logo

wikiplots's Introduction

WikiPlots

The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia. These stories are extracted from any English language article that contains a sub-header that contains the word "plot" (e.g., "Plot", "Plot Summary", etc.).

This repository contains code and instructions for how to recreate the WikiPlots corpus.

The dataset itself can be downloaded from here: plots.zip. The zip file contains two files:

  • plots: a text file containing all story plots. Each story plot is given with one sentence per line. Each story is followed by <EOS> on a line by itself.
  • titles: a text file containing a list of titles for each article in whih a story plot was found and extracted.

Using the code to recreate the corpus

I have also included the Python script used to extract the story plots.

wikiPlots.py requires:

To use wikiPlots.py:

  1. Download an English Wikipedia dump. From this link you fill find a file named something like "enwiki-20170401-pages-articles-multistream.xml.bz2". Make sure you download the .bz2 file that is not the index file.
  2. Unzip the bz2 file to extract the .xml file.
  3. Download wikiextractor. You do not need to set it up. Run it as follows:

python wikiextractor.py -o output_directory --json --html -s enwiki-...xml

You must run wikiextractor.py with these parameters. wikiPlots.py requires json files with nested html and with section header information preserved. Wikiextractor will produce a number of subfolders named "AA", "AB", "AC"... Within each folder will be a wiki_xx file containing a number of json records, one per article.

  1. Install the BeautifulSoup4 python package
  2. Download and run wikiPlots.py from this repository:

python wikiPlots.py wiki_dump_directory plot_file_name title_file_name

wiki_dump_directory should be the path to the directory containing the "AA", "AB", etc. folders. plot_file_name will be the name of the file that will contain the story plots. title_file_name will be the name of the file that will contain the list of story titles.

wikiplots's People

Contributors

markriedl avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.