Giter VIP home page Giter VIP logo

wd_textual_references_dataset's Introduction

WD_Textual_References_Dataset

This repository is intended to generate Datasets of Wikidata Triples and References that consist of textual information mostly.

Notebooks should have a head that explains their usage and are ordered in the same sequence as they should be executed.

    1. Processing Parsed Wikidata Dumps.ipynb: This notebook takes the parsed Wikidata dumps and transforms them into a dataset of unique references and URLs.
    1. Extract and Process HTML From Sampled URLs.ipynb: This notebook takes the references and their URLs and extracts their HTML code.
    1. Obtaining Claim Data.ipynb: This notebook retrieves claims that are supported by each unique reference.
    1. Turn HTML to Text.ipynb: This notebook retrieves the text from the HTML pages. In order to prepare for the creation of WTR, we also added sliding concatenation windows for text extraction.

This process creates the following main data products:

  • reference_html_as_sentences_df.csv.csv: The unique references and their html/text.
  • text_reference_claims_df.csv: The unique claims backed by the references and their components.

To begin this process, a parsed Wikidata dump should be available. It can be generated by feeding a JSON Wikidata dump to the wikidata_parser.py script.

Note: The data used for WTR consist of a further filtered version of the data seen here, given that it needed to be smaller in order to make crowdsourced annotations affordable.

Note: For a list of the conda environment under which this repository was developed, check environment.yml.

wd_textual_references_dataset's People

Contributors

gabrielmaia7 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.