Giter VIP home page Giter VIP logo

pyrdf2vec's Introduction

Logo

Python Versions Downloads Version License

Actions Status Documentation Status Coverage Status Code style: black

Python implementation and extension of RDF2Vec to create a 2D feature matrix from a Knowledge Graph for downstream ML tasks.


What is RDF2Vec?

RDF2Vec is an unsupervised technique that builds further on Word2Vec, where an embedding is learned per word, in two ways:

  1. the word based on its context: Continuous Bag-of-Words (CBOW);
  2. the context based on a word: Skip-Gram (SG).

To create this embedding, RDF2Vec first creates "sentences" which can be fed to Word2Vec by extracting walks of a certain depth from a Knowledge Graph.

This repository contains an implementation of the algorithm in "RDF2Vec: RDF Graph Embeddings and Their Applications" by Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim ([paper] [original code]).

Getting Started

We provide a blog post with a tutorial on how to use pyRDF2Vec here. Below is a short overview of the different functionalities.

Installation

pyRDF2Vec can be installed in two ways:

  1. from PyPI using pip:
pip install pyRDF2vec
  1. from any compatible Python dependency manager (e.g., poetry):
poetry add pyRDF2vec

Introduction

To create embeddings for a list of entities, there are two steps to do beforehand:

  1. create a Knowledge Graph object;
  2. define a walking strategy.

For a more elaborate example, check at the example.py file:

PYTHONHASHSEED=42 python example.py

NOTE: the PYTHONHASHSEED (e.g., 42) is to ensure determinism.

Create a Knowledge Graph Object

To create a Knowledge Graph object, you can initialize it in two ways.

  1. from a file using RDFlib:
from pyrdf2vec.graphs import KG

# Define the label predicates, all triples with these predicates
# will be excluded from the graph
label_predicates = ["http://dl-learner.org/carcinogenesis#isMutagenic"]
kg = KG("samples/mutag/mutag.owl", label_predicates=label_predicates)
  1. from a server using SPARQL:
from pyrdf2vec.graphs import KG

kg = KG("https://dbpedia.org/sparql", is_remote=True)

Define Walking Strategies With Their Sampling Strategy

All supported walking strategies can be found on the Wiki page.

As the number of walks grows exponentially in function of the depth, exhaustively extracting all walks quickly becomes infeasible for larger Knowledge Graphs. In order to circumvent this issue, sampling strategies can be applied. These will extract a fixed maximum number of walks per entity. The walks are sampled according to a certain metric.

For example, if one wants to extract a maximum of 5 walks of depth 4 for each entity using the Random walking strategy and Uniform sampling strategy (SEE: the Wiki page for other sampling strategies), the following code snippet can be used:

from pyrdf2vec.samplers import UniformSampler
from pyrdf2vec.walkers import RandomWalker

walkers = [RandomWalker(4, 5, UniformSampler())]

Create Embeddings

Finally, the creation of embeddings for a list of entities simply goes like this:

from pyrdf2vec import RDF2VecTransformer

transformer = RDF2VecTransformer(walkers=walkers)
# Entities should be a list of URIs that can be found in the Knowledge Graph
embeddings = transformer.fit_transform(kg, entities)

Documentation

For more information on how to use pyRDF2Vec, visit our online documentation which is automatically updated with the latest version of the master branch.

From then on, you will be able to learn more about the use of the modules as well as their functions available to you.

Contributions

Your help in the development of pyRDF2Vec is more than welcome. In order to better understand how you can help either through pull requests and/or issues, please take a look at the CONTRIBUTING file.

FAQ

How can I load my large KG in memory and avoid the slowness of the SPARQL endpoint server?

Loading large RDF files into memory will cause memory issues as the code is not optimized for larger files. We welcome any PRs that better optimize the memory usage! Remote KGs serve as a solution for larger KGs, but using a public endpoint will be very slow due to overhead caused by HTTP requests. For that reason, it is better to set-up your own local server and use that for your "Remote" KG. Please find a guide on our wiki.

How to ensure the generation of similar embeddings?

pyRDF2Vec's walking strategies and sampling strategies work with randomness. To get reproducible embeddings, you have to use a seed to ensure determinism:

PYTHONHASHSEED=42 python foo.py

However, you must also fix the randomness of the sampler after importing numpy, by adding the following code:

import numpy as np
np.random.seed(42)

This will ensure the np.random calls in pyRDF2Vec are seeded.

Referencing

If you use pyRDF2Vec in a scholarly article, we would appreciate a citation:

@inproceedings{pyrdf2vec,
  author       = {Gilles Vandewiele and Bram Steenwinckel and Terencio Agozzino
                  and Michael Weyns and Pieter Bonte and Femke Ongenae
                  and Filip De Turck},
  title        = {{pyRDF2Vec: Python Implementation and Extension of RDF2Vec}},
  organization = {IDLab},
  year         = {2020},
  url          = {https://github.com/IBCNServices/pyRDF2Vec}
}

pyrdf2vec's People

Contributors

benedekrozemberczki avatar bsteenwi avatar gillesvandewiele avatar mweyns avatar rememberyou avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.