This repository is intended to generate Datasets of Wikidata Triples and References that consist of textual information mostly.
Notebooks should have a head that explains their usage and are ordered in the same sequence as they should be executed.
-
- Processing Parsed Wikidata Dumps.ipynb: This notebook takes the parsed Wikidata dumps and transforms them into a dataset of unique references and URLs.
-
- Extract and Process HTML From Sampled URLs.ipynb: This notebook takes the references and their URLs and extracts their HTML code.
-
- Obtaining Claim Data.ipynb: This notebook retrieves claims that are supported by each unique reference.
-
- Turn HTML to Text.ipynb: This notebook retrieves the text from the HTML pages. In order to prepare for the creation of WTR, we also added sliding concatenation windows for text extraction.
This process creates the following main data products:
- reference_html_as_sentences_df.csv.csv: The unique references and their html/text.
- text_reference_claims_df.csv: The unique claims backed by the references and their components.
To begin this process, a parsed Wikidata dump should be available. It can be generated by feeding a JSON Wikidata dump to the wikidata_parser.py script.
Note: The data used for WTR consist of a further filtered version of the data seen here, given that it needed to be smaller in order to make crowdsourced annotations affordable.
Note: For a list of the conda environment under which this repository was developed, check environment.yml.