weso / hercules-sync Goto Github PK

Tools to synchronise data between the ontology files and Wikibase instance for the Hercules project at University of Murcia.

License: GNU General Public License v3.0

Python 95.95% Shell 0.70% Dockerfile 0.33% Gherkin 3.02%

asio wikibase rdf ontology semantic-web

hercules-sync's Introduction

hercules-sync

License
Build Status
Coverage

Tools to synchronise data between the ontology files and Wikibase instance for the Hercules project at University of Murcia.

Directory structure

docs: Development documentation of this module.
hercules_sync: Source code of the application.
tests: Test suite used to validate the project.

Defining a webhook in the source repo

In order to perform the synchronization automatically, a webhook must be created in the original repository where the ontology is stored. This webhook will be launched whenever a new push event occurs in the repo, and the synchronization service will be called to sync the changes with the wikibase instance. When creating a new webhook, the payload url must point to the URL where this server will be available. It is also important to define a secret key that will be used to accept only requests from the source repo and not from other ones. An example configuration will look like this one:

Launching the app with Docker

In order to execute the app you need to set the following configuration in the docker-compose.yml file:

GITHUB_OAUTH: Github token with access to read the repository where the ontology is stored. This token will be used to download the modified files through the GitHub API. For more information, see the official GitHub page about creating a personal access token.
WBAPI: Endpoint of the target Wikibase instance API where the ontology will be sincronized. E.g. https://hercules-demo.wiki.opencura.com/w/api.php
WBSPARQL: SPARQL endpoint of the target wikibase. E.g. https://hercules-demo.wiki.opencura.com/query/sparql
WBUSER: Username of the user that will perform the synchronization operation in the target wikibase.
WBPASS: Password of the user defined by the username stated above.
WEBHOOK_SECRET: Secret key of the webhook created in the previous step.

Launching the app directly with Python

This application is compatible with Python 3.6 forwards, but the recommended Python version is at least Python 3.7 due to performance. After you have installed Python, you can run the following command to install every dependency:

pip install -r requirements.txt

After the requirements have been installed, you need to set directly the environment variables described in the previous section. After that, the following command can be executed to launch the app:

python wsgi.py

Alternatively, there is a sh script available that installs the dependencies and runs the server automatically. In order to execute this script, you have to set the environment variables and then run:

sh start_server.sh

hercules-sync's People

Contributors

Stargazers

Watchers

Forkers

mistermboy

hercules-sync's Issues

Mapping of external entities to their original URI

If we have to insert in wikibase a triple like this:

asio:ResearchGroup rdfs:subclassOf foaf:Organization .

Each item will be created if it doesn't exist in the Wikibase. However, for elements that do not belong directly to the asio Ontology (e.g. foaf:Organization) we should automatically create a mapping statement to the original resource.

Maybe this should be done for every resource and not just for the external ones, since this allows the traversal from each item to its original URI from the ontology. We need to decide this before we can start with the implementation of this issue.

Better handling of labels and descriptions in wikibase

Right now we are settings labels and descriptions in wikibase when rdfs:label or rdfs:description URIs are found in the triple respectively.

From the Wikibase RDF Dump Format page we can see this:

Entity labels - the main name of the entity. Labels are defined as schema:name, rdfs:label and skos:prefLabel predicates with objects being language-tagged string literals.
Entity aliases - the secondary names of the entity. Aliases are defined as skos:altLabel predicates with objects being language-tagged string literals.
Entity description - the longer description of the entity. Defined as schema:description predicates with objects being language-tagged string literals.

So we should considering setting labels also when a schema:name or skos:prefLabel is found, and setting a description with the schema:description URI.

Set Python 3.7 as preferred version

The following page from the official Python site states the additions to Python 3.7. The following one is specially important for this project:

As a result of PEP 560 work, the import time of typing has been reduced by a factor of 7, and many typing operations are now faster. (Contributed by Ivan Levkivskyi in bpo-32226.)

Since we are making heavy use of the typing package in our project, I think we should use the 3.7 version of Python in the Travis build and indicate in the docs that it is the preferred version. Backwards compatibility to Python 3.6 and 3.5 should still be delivered.

Deployment of documentation to Github Pages

Some remarks:

Documentation should follow the arc42 markdown template.
Markdown files should be converted to html and a custom style should be added to the final page.
This deployment should be added to CI.

I think sphinx should be used for this task, since it can work with markdown and we could easily generate automatic documentation from our Python docstrings in the future.

Parsing of property datatype

When we create a new property in wikibase we need to provide the type of the property. The following datatypes are allowed:

wikibase-item
wikibase-property
string
monolingualtext
external-id
quantity
time
url
math
geo-shape

More information about the datatypes can be found here.

We need to implement a system that parses the type of each property based on its range in the ontology file. If the property operates on URIs its type will be 'wikibase-item' if the URI is a item, or 'wikibase-property' if the URI is a property (we need to implement #14 to infer this). On the other hand, if a triple has as an object a literal, we need to parse which specific type of literal it is.

Problems to solve

There are some cases where knowing the type of the property based on a unique triple might be hard or even impossible. Let's illustrate this with an example.

Suppose that we have the following triple:

ex:myProperty rdfs:domain ex:Person .

We can infer that ex:myProperty is a property, since it is the object of the rdfs:domain predicate. We can also know that a subject with this property belongs to the class ex:Person. However, we don't know the datatype of this property yet, so we can't introduce it in wikibase.

Add information about each implemented test to the docs

In #55 we have added a small description of the tests regarding each module.

It could be interesting to add a small description and expected outcome of each one of the tests that have been implemented in the library. If this content ends up being too long, we could maybe move it to an annex if needed.

Parsing of information received from Webhook

Right now we are receiving the following information from a Webhook when a push event is received:

data = {
    "before": "before_push_sha",
    "after": "after_push_sha",
    "repository": {
        "full_name": "user_name/repo_name"
    }
}

We need to process this information to obtain the lines that were added and removed, for each file that was modified in the push.

We will use the output of this process later on to collect the items that need to be changed in the wikibase instance.

Update documentation with information about benchmarks

We should update the main documentation of the system with some of the results recently obtained in #61 and #62

Return ModificationResult objects from wikibase adapter

In the triplestore_manager module the following class to present and manage the synchronisation results was implemented:

class ModificationResult():
    def __init__(self, successful: bool, message: str):
        self.successful = successful
        self.message = message

We should return this object from the wikibase adapter methods so we can handle the results.

Deployment of initial version of the system

The synchronization system is going to be ready for an initial deployment after the implementation of #2.

We should connect all the pieces and deploy the system to respond to changes in the gh-pages branch of the hercules-ontology repo, modifying our local wikibase instance (http://156.35.94.149:8181/wiki/Main_Page) with these changes.

After the system is deployed, there will be a period to detect potential bugs that were not identified in the initial implementation of the system, and correct them before the end of our first milestone.

Create benchmarks notebook

After issue #41 is finished, we should add a new notebook where we measure the performance of the system with and without performance improvements activated.

Allow boolean datatypes

Boolean datatypes are not supported by wikibase (see https://phabricator.wikimedia.org/T145528). The alternative is to represent the boolean with a wikibase-item datatype which points to a Truth or False item in wikibase.

We need to implement the create_wditemid_from_bool function from the mappings module to provide this functionality. This method will parse the content (True of False) and reference the qid of the boolean item. The qid will be created if it doesn't exist yet (we have to set a custom label and description to look for it or create it if the search doesn't return any value).

Implementation of naive synchronization algorithm

For now, the first synchronisation algorithm should be a naive one that updates the triple store overwriting its contents with the new files from the ontology repository.

Additional synchronisation algorithms will be developed later on and compared with this one.

Decouple TripleElement from wdi

The current implementation of the LiteralElement and URIElement classes provide directly methods to translate to wdi datatypes and classes. For example:

class URIElement(TripleElement):

    ...

    @property
    def wdi_class(self) -> Union[Type[WDItemID], Type[WDProperty]]:
        """ Returns the wikidataintegrator class of the element.
        """
        assert self.etype in self.VALID_ETYPES
        return WDItemID if self.etype == 'item' else WDProperty

    @property
    def wdi_dtype(self) -> str:
        """ Returns the wikidataintegrator DTYPE of this element.
        """
        return self.wdi_class.DTYPE

    def to_wdi_datatype(self, **kwargs) -> Union[WDItemID, WDProperty]:
        return self.wdi_class(value=self.id, **kwargs)

This code should be refactored, so the wikidataintegrator objects are totally decoupled from the TripleElements. For example, the mappings module could be used to provide functions that can convert from one type to the other.

Remove .idea folder from root directory

At the root of the repository, you can find the .idea folder that corresponds to the folder that IntelliJ generates. It would be better to remove it and add the necessary statements to the .gitignore so that this does not happen again.

Folder: https://github.com/weso/hercules-sync/tree/master/.idea

Base implementation of ontology reasoner

We need to implement a basic reasoner that determines whether an URI is an rdf:Property or a rdfs:Class. This information is used later on when we use the wikibase adapter to know if we are inserting an item (Q) or a property (P). We know that a URI represents a property at least in the following cases:

When it is a subject or object of the rdfs:subPropertyOf predicate.
When it is a subject of the rdfs:domain predicate.
When it is a subject of the rdfs:range predicate.

More cases and support for OWL will be discussed in additional issues.

Possible improvements in performance

There is currently a bottleneck with the execution of operations to the triple store (addition, removal...). These are some possible performance improvements that could be done:

Add a new BatchOperation that executes at once all the operations of a given subject (create it if needed and write all its related statements at once). We also need to provide a way to transform multiple simple operations into BatchOperations.
Use the fast run mode from WikidataIntegrator.

Add dockerfile

Create a Dockerfile to run this web service. The Dockerfile should have a lightweight image as a base and install Python 3.7 and the library dependencies from the requirements.txt file.

Fix problems with multiple values for the same property

If we have the following statement:

ex:Example owl:disjointWith ex:A ,
                            ex:B .

A new statement will be created for the ex:Example entity, the property owl:disjointWith, and value ex:A. However, after that the value of this statement will be replaced by ex:B.
The expected behaviour should be that a new value should be added to the statement, so the disjointWith property of ex:Example will have as values both ex:A and ex:B.

Update wikibase instance to 1.35.0 whenever possible

Read last comment of #30 for more information.

Fix error when setting label and description to the same value

The label/description (last one being set) should not be written to wikibase. Response should be added to the modification result object.

Fix autodoc warnings

When using Sphinx to generate the documentation of the module the following warnings are raised by Travis:

332. /home/travis/build/weso/hercules-sync/hercules_sync/git.py:docstring of hercules_sync.git.GitFile.added_lines:7: WARNING: Definition list ends without a blank line; unexpected unindent.
333. /home/travis/build/weso/hercules-sync/hercules_sync/git.py:docstring of hercules_sync.git.GitFile.removed_lines:7: WARNING: Definition list ends without a blank line; unexpected unindent.
334. WARNING: autodoc: failed to import module 'listener' from module 'hercules_sync'; the following exception was raised:
...
361. /home/travis/build/weso/hercules-sync/hercules_sync/webhook.py:docstring of hercules_sync.webhook.WebHook.hook:4: WARNING: Unexpected section title.

The last one will be eventually fixed when the module is finished, but the others need to be considered.

Better handling of unrecognised languages

Suppose that we have the following triple:

ex:Person rdfs:label "Persona"@es-ES

Since the es-ES language is not valid in wikibase an error is thrown. For the moment we should catch this error and return it in the ModificationResult object.

Due to that, #20 should be implemented before this issue is fixed.

Move operations to execute to a Kafka queue

Thank you to @TheWilly for the original idea.

The main point of this issue is to propose the addition of a Kafka queue (or similar technology, we still have to discuss the details) where the operations to execute in Wikibase will be pushed and then handled by the Wikibase adapter. Although this will add a new layer of complexity (bigger on smaller depending on the final solution details), we will also get several benefits. Some of them are:

Better decoupling between the generation of operations and their consumption.
Ability to let the execution of operations running in the background once they have been created.
If there are multiple requests made in a short span of time, we would above potential problems where operations could be executed in a non-valid order and result in a non-consistent triple-store state.

In general, I really like the idea and I think it could benefit the system considerably.

Set and remove aliases

We need to set and remove aliases of entities from wikibase using the wikidataintegrator package. An alias is going to be changed when the skos:altLabel URI is found in a triple.

Upload docker image to DockerHub and load config from environment variables

Handle errors when setting descriptions with more than 250 characters

We have to think whether we should still write the description (but with less characters and '...'), or we should not write anything to the description.

Broke continuous integration when PR #69

I just broke continuous integration when I merge the pull request #69 together.

Add behaviour tests

We should add behavioural tests to the current test suite. The pytest-bdd should be used to implement the tests, since it integrates seamlessly with the pytest library, so no changes should be made in travis to execute all the tests.

Creation of listener module

The listener module will be connected to the webhook module (see #1) and create the route where the payload from the hercules-ontologies repository will be sent.

Internal logic to process the request will be implemented by other modules.

After the request is processed, a response will be returned.

Proposal: Create a new docker file with the synchronisation module and the wikibase instance

The main idea would be to "link" automatically the synchronisation module with the other container where the wikibase instance would be launched.

Advantages:

Less configuration to be made (we don't need to write the API and SPARQL endpoints of our target wikibase).
Everything is contained in a single docker-compose.yml file.

Disadvantages:

We lose a lot of flexibility. We can't use the synchronisation module with a wikibase that is not launched together with it.

I need to think if there is a way to try to connect to the wikibase from the docker-compose file directly, and if there is no wikibase to connect to look for the WBAPI and WBSPARQL environment variables and try the connection with them. With something like this we would get the benefits from both approaches.

If that is not possible, in my opinion the flexible approach which can connect to any wikibase instance is preferable.

Creation of Webhook module

Create a module that receives webhook updates from the hercules-ontologies repository. The module will handle authentication features and can be used by listeners to receive the updates and implement their own logic.

Allow ontologies with blank nodes

We need to support the synchronisation of changes regarding blank/anonymous nodes. For example:

:played_by a owl:ObjectProperty ;
     rdfs:label "played by"@en ;
     rdfs:domain [ a owl:Restriction ;
             owl:onProperty :played_by ;
             owl:someValuesFrom :Song ] ;
     rdfs:range [ a owl:Restriction ;
             owl:onProperty :played_by ;
             owl:someValuesFrom :Artist ] .

In order to implement this we will need to add a way to parse BNodes from RDFLib in our TripleInfo.from_rdflib method.

I don't know if this will be straightforward or not.

I think that maybe the predicates inside these blank nodes should be treated as qualifiers in wikibase, but we need to discuss this (e.g. what predicate could be used to define the value and not the qualifiers...).

We should also consider if this issue includes the functionality of parsing 'QualifiedValues' from the hercules-ontology.

Update project documentation

There have been some changes in the project architecture for the last month. We should add these changes to the project documentation and write the remaining chapters.

Create basic directory structure of the application

It will include:

Config directory with config files for the different environments (production, development and testing).
Requirements.txt file with the necessary dependencies to run the app.
Setup.sh file to run the app from a new working environment.
Entry point of the application.
Directory where the main code will be organised.

Load logging config when starting Flask server

We need to create a json config file for the logging system that is loaded before the flask app is created. This config file will include at least:

Minimum logging level.
Stream where the logging information will be written to.
Handlers to manage each logging level.

For reference, see the official Python docs and this post