Giter VIP home page Giter VIP logo

wikidatatools's Introduction

wikidatatools

Tools for preprocessing Wikidata for knowledge graph embeddings.

Wikidata tools is a set of tools for efficiently preprocessing Wikidata n-triples dumps into formats that can be digested by knowledge graph embedding (KGE) libraries such as PyKeen.

We provide two files:

  • processTruthy.py - Set of tools for processing a Wikidata n-triples dump from its .nt.bz2 format into .tsv files. To run this file, simply navigate to the directory that it is stored and run the command python processTruthy.py -f path/to/ntriplesdump.nt.bz2. You may also add the -o flag to specify an output directory and the -c flag to specify a chunksize in lines to pass to each worker. This will create four .tsv files in the current directory if -o is not specified, or in the directory of your choice:
    • edges.tsv - Entity-Property-Entity triples of the form head/relation/tail detailing relationships between items
    • meta.tsv - Entity-Label-Value triples containing labels of each Wikidata item
    • data.tsv - Entity-Property-Value triples containing data about items
    • errors.tsv - Output file for any errors encountered during decoding
  • subsetTools.py - Set of tools for processing edges.tsv file created using processTruthy.py into smaller subsets. The script first queries Wikidata's SPARQL endpoint to retrieve a list $C$ of all subclasses of a user defined class. This list is then used to create a filter for the Dask dataframe $I = (h, r_{instance}, c) \forall c \in C$, where $r_{instance}$ is a special property, P31, indicating the head entity $h$ is an instance of class $c$. The instance list $I$ is then used to create the subset $S = (h,r,t) \forall (h,t) \in I$. To run this script, simply run the create_subset(targets, input_path, output_path) function specifying a target Wikidata class(es) in the form 'Q123' as a string or iterable of strings.

As this set of tools was created as a pipleine to supply data to a PyKEEN model as part of our capstone project at the University of Virginia, we include an example model training script in the examples folder. We also include the datasets from our paper Review of Knowledge Graph Embedding Models for Link Prediction on Wikidata Subsets as examples.

wikidatatools's People

Contributors

aaedelman avatar q-maze avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

aaedelman

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.