diachron / quality Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 4.0 154.32 MB

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)

License: MIT License

Java 100.00%

quality's People

Contributors

Stargazers

Watchers

Forkers

keme686 anukat2015 beyzayaman

quality's Issues

(Metric Impl) Ontology is a member of the OBO Foundry

This is a Reputation Dimension.

In this metric we need to check if the used ontologies (therefore we need to check only for the type of an instance) are part of the OBO Foundry

For more information check D5.1

In the class comment, mention that this metric is specific to the EBI use-case

EmptyAnnotationValue metric

Implement a metric EmptyAnnotationValue (in the category of Representational dimensions; Understandability dimension) that identifies triples whose property is from a pre-configured list of annotation properties, and whose object is an empty string.

We consider the following widely used annotation properties (labels, comments, notes, etc.):

http://www.w3.org/2004/02/skos/core#altLabel
http://www.w3.org/2004/02/skos/core#hiddenLabel
http://www.w3.org/2004/02/skos/core#prefLabel
http://www.w3.org/2004/02/skos/core#changeNote
http://www.w3.org/2004/02/skos/core#definition
http://www.w3.org/2004/02/skos/core#editorialNote
http://www.w3.org/2004/02/skos/core#example
http://www.w3.org/2004/02/skos/core#historyNote
http://www.w3.org/2004/02/skos/core#note
http://www.w3.org/2004/02/skos/core#scopeNote
http://purl.org/dc/terms/description
http://purl.org/dc/elements/1.1/description
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2000/01/rdf-schema#comment

For now, this list of properties can be hard-coded; we might think about a more extensible implementation later.

E.g. a triple like the following should be matched:

<http://...> <http://www.w3.org/2000/01/rdf-schema#comment> "" .

The metric value is defined as the ratio of annotations with empty objects to all annotations (i.e. all triples having such properties).

(Background: D3.1 Table 20 on page 91)

Cc: @nfriesen

Update POM

Remove required JARs and add them as maven dependencies

Fix semantics of UndefinedClassesOrProperties metric

The predicate of a quad can be an undefined property, and the object of a quad can be an undefined class or an undefined property when the quad's predicate is one out of the list given below.

The subject of a quad never references classes or properties in external vocabularies, so we don't have to analyse the subject for this metric.

This is the list of predicates that indicate that the object must be a defined class:

rdf:type (FYI just supporting this is sufficient for most LOD datasets. The following ones are only relevant when LOD datasets define their own vocabulary, or in the case that a vocabulary/ontology happens to be implemented as a LOD dataset.)
rdfs:domain
rdfs:range
rdfs:subClassOf
owl:allValuesFrom
owl:someValuesFrom
owl:equivalentClass
owl:complementOf
owl:onClass
owl:disjointWith

This is the list of predicates that indicate that the object must be a defined property:

rdfs:subPropertyOf
owl:onProperty
owl:assertionProperty
owl:equivalentProperty
owl:propertyDisjointWith

In all of the cases above, "being defined" may also mean "defined in the current LOD dataset" (but we can assume that a class/property is defined at an earlier position in the current dataset, i.e. at a position that we have processed already). I.e. "being defined" does not only mean "defined in some external ontology".

FYI there are some more predicates for which we don't know whether the object is expected to be a class or property, but we'll ignore these predicates for now.

BTW, the current implementation for predicate and object looks a bit redundant to me; maybe we can shorten it by factoring out some of the common source code lines into a shared method.

Exploring Datahub.io Interface - Task 2

Create an interface to access the available datasets - mainly SPARQL endpoints, RDF dumps.

Review Dereferencibility, NoDereferencedBackLinks/ForwardLinks, UnstructuredData

These metrics might have partly overlapping implementations. This is just a superficial impression I got and should only be reviewed when we do a complete review of the implementation.

UnstructuredData speaks of “dead URIs”, whereas a dead link is something that's not dereferenceable. My intuitive understanding of UnstructuredData is rather that the link works, but provides, say, HTML instead of RDF>

Poor man's streaming on arbitrary SPARQL endpoints

by selecting ?s ?p ?o with a limit/offset (and order if necessary)

First draft of D3.2 for internal reviewers

The deadline for the first draft is 22 July.

Availability Metric Tests

Please create tests for the availability metrics

Use of PageRank and Jenkins Hash

Would it be possible to use the libraries specifically rather than reimplementing (or reusing) the classes. For graph algorithms there is the Jung Library (http://jung.sourceforge.net) which I used and I find it very useful. If you require it, I can add the the pom dependency for you

(Metric Impl) Check to what extent the dataset reuses reputable datasets/ontologies

This is a Reputation Dimension and a ComplexQualityMetric.

For this metric we need to check if the dataset resources are hosted in a reputable source.

The list of reputable sources should be "loaded" in the before method.

Consider splitting UndefinedClassesOrProperties metric in two

@nfriesen summarising what we discussed: at least for cleaning it will make sense to split UndefinedClassesOrProperties in UndefinedClasses and UndefinedProperties, as otherwise if a triple <s> _:undefinedProperty _:undefinedClass is reported to be problematic w.r.t. UndefinedClassesOrProperties it will be impossible to find out whether the predicate or the object is the culprit.

If you think you'll need this for cleaning soon, could you please discuss with @jerdeb, and then rephrase this issue into a more concrete instruction for implementation?

OntologyHijacking Metric

I realised that the OntologyHijacking metric is not completely correct.
@clange I would like to discuss with you the implementation of the metric

Scaling of Performance Metrics

The HighThroughput and LowLatency metrics are taking a lot of time to compute on EBI datasets

Improving daq Vocabulary

Check out datacube and LODStats. The subset used in LODStats might be sufficient for us to use to improve the vocabulary

UndefinedClassesOrProperties: really check whether the target is a class or property

The current implementation of UndefinedClassesOrProperties finds triples where a class or property is expected in the object position and then looks whether that “object resource” is accessible for the VocabularyReader. If a resource was found, it does not check whether the resource actually is a class or property. (Example below.)

So we need to check whether the data we found for that resource (usually: the data we downloaded from the object URI) contains something that convinces us that it is an rdfs:Class or an owl:Class, or an rdf:Property. (Note that if something is an owl:Class it is also an rdfs:Class, and that OWL defines a lot of special cases of rdf:Property, such as owl:ObjectProperty or owl:TransitiveProperty. I can write down the full list here once we are starting to implement this; please let me know.)

Let <o> be the URI of the object. From just looking at the data, without doing OWL reasoning, we can look for, e.g. <o> rdf:type owl:Class and will know that the triple <...> rdf:type <o> is a “good” triple w.r.t. this metric. We can even look for <o> ?p ?o and will know that …

if ?p is rdfs:subClassOf or owl:unionOf or owl:intersectionOf or owl:equivalentClass or owl:oneOf, then <o> is an rdfs:Class.
if ?p is rdfs:domain or rdfs:subPropertyOf or rdfs:range or owl:propertyDisjointWith or owl:equivalentProperty, then <o> is an rdf:Property.
if ?p is owl:disjointUnionOf or owl:complementOf or owl:disjointWith or owl:hasKey, then <o> is an rdfs:Class.
if ?p is owl:inverseOf or owl:propertyChainAxiom, then <o> is an rdf:Property.

Example: imagine a triple <...> rdf:type socialnetwork:Alice where socialnetwork:Alice rdf:type foaf:Person, i.e. socialnetwork:Alice is actually not an owl:Class but an owl:Individual (which is declared to be disjoint with owl:Class). This is a “bad triple” even if socialnetwork:Alice is defined.

LabelsUsingCapitals metric

Implement a metric LabelsUsingCapitals that identifies triples whose property is from a pre-configured list of label properties (a subset of the annotation properties from #32), and whose object uses a bad style of capitalisation.

We consider the following widely used label properties:

http://www.w3.org/2004/02/skos/core#altLabel
http://www.w3.org/2004/02/skos/core#hiddenLabel
http://www.w3.org/2004/02/skos/core#prefLabel
http://www.w3.org/2000/01/rdf-schema#label

For now, this list of properties can be hard-coded (maybe somehow shared with #32); we might think about a more extensible implementation later.

For now we define "bad" capitalisation as "camel case", for which we should design a regular expressions to match such strings. Consider, e.g., a label "InterestingThing": this is a suitable name for a class/resource, but the label should rather be "interesting thing" or "Interesting Thing"

E.g. a triple like the following should be matched:

<http://...> <http://www.w3.org/2000/01/rdf-schema#label> "InterestingThing" .

The metric value is defined as the ratio of labels with "bad capitalisation" to all labels (i.e. all triples having such properties).

Note: in the cleaning UI, triples that match this metric should be reported as non-critical errors.

(Background: D3.1 Table 20 on page 91)

Declarative language for quality metric definitions

Let's think about a declarative language for quality metrics.

I.e. that large parts of the implementation of a new metric would be implemented in the form of a dataset that's an instance of the daQ vocabulary.

In pseudo code e.g. a declarative representation of the UndefinedClassesOrProperties metric could look like this:

IF TRIPLE MATCHES ?s rdf:type|rdfs:subClassOf|rdfs:domain|rdfs:range ?c 
                # ^^^ This would be a SPARQL graph pattern
THEN CHECK
  # Here we could use a SPARQL FILTER expression:
  (dqf:DereferenceableAsLOD(?c)
   || dqf:ExistsLocallyInThisDataset(?c)
   || dqf:OtherwiseKnownToUs(?c))
  && dqf:QuerySucceeds(?c a owl:Class)
                     # ^^^ once more a SPARQL graph pattern
    # Actually this check is more complex
    # but I'll leave it like this for now for the example

Complex operators like DereferenceableAsLOD or ExistsLocallyInThisDataset or QuerySucceeds would be realised as custom SPARQL functions with a Java implementation, reusing code from methods we already have. (I used dqf for our custom namespace of “data quality functions”.)

Compare page 7 of http://svn.aksw.org/papers/2013/ISWC_LODStats/public.pdf. They get by without complex operators, but their task is simpler than ours.

This language could include elements for generating problem reports, which we need for cleaning. (@jerdeb @nfriesen please edit this into "quality report" if that's the correct term)

Think about source code license and tag all files with it

We need to decide on a license under which we are releasing our code. To figure out a reasonable one I suggest the following process:

find out how other publicly available code in the project or closely related to the project (e.g. from AKSW) is licensed
let ourselves be aware of the possibilities, such as copyleft (e.g. GPL) vs. as few restrictions as possible (e.g. BSD-style)
coordinate with the partners (particularly those from industry) – I'll do a first round of asking them in today's telecon.

Once decided, there should be a LICENSE file in the top-level directory, and a short reference to the license in all source files. Compare https://github.com/formare/auctions/blob/master/isabelle/Auction/Partitions.thy from my former project (just forget about the dual licensing, which is specific to the requirements of that project).

ValidOWL metric

Implement a metric ValidOWL (in the category of Intrinsic dimensions; Consistency dimension) that determines whether the given RDF dataset is a valid OWL ontology.

At the very least this metric should return a value of true or false.

In Jena it should be possible to try having an RDF graph parsed as OWL (which means that additional consistency rules are checked), and to obtain error messages if the RDF graph is not valid OWL.

After this basic step we might be able to go a step further and determine the ratio of triples that are invalid w.r.t. the OWL semantics. E.g. owl:Class owl:Class owl:Class . is a valid RDF triple, but doesn't make sense in OWL. Jena might be able to give us a list of such invalid triples for free. If Jena doesn't do it, maybe the OWL API does. (Not sure it supports streaming; let's find out.)

@nfriesen: Before we invest a lot of effort into using the OWL API, let's talk to the Repairing partners.

@muhammadaliqasmi: a note about the second step: If we manage to identify all individual triples that are not valid OWL, this also covers the job of MisusedOwlDatatypeOrObjectProperties, i.e. MisusedOwlDatatypeOrObjectProperties is a special case of "finding all triples that are not valid OWL", and thus we could refactor it to make it reuse some of the implementation of the ValidOWL metric, so that we only need to run the OWL parser once.

(Background: D3.1 Table 20 on page 90)

Save memory by fuzzy approximation of statistics

For, e.g., the DuplicateInstance metric we are currently keeping a complete record of all instances found so far in memory. For huge datasets we might have to do some fuzzy approximation, similar to LODStats. I.e. that we throw away part of the full details we have in memory, and replace them by fuzzier approximations that consume less memory.

Deliverable 5.2

Documentation for D5.2

https://docs.google.com/document/d/1mZBXm7G33xWWf1zLz_aAuL-HDn4nAU5RZUMCpcEyL38/edit

CKAN extension for pulling new/updated datasets from a "master" CKAN

… where "master" in practice means http://datahub.io.

Maybe such an extension exists already. This feature is helpful towards achieving our goals but not really related to quality, so it has a low priority.

Of course any such pulled datasets should be fed into the quality metrics computation as said in #7.

Consider centralising some implementation that all metrics have in common

getQualityProblems looks the same for all metrics (and is currently copied). Can't we centralise it into some common superclass instead of only having a Metric super-interface?
Some metrics have a before method, others don't. If this same work were done in a constructor, it would be easier to do this in a constructor because then it would be done automatically when initialising the metric.

Scaling of Currency Metrics

CurrencyDocumentStatements and TimeSinceModification are taking a lot of time to compute on large datasets

CKAN extension for feeding new/updated datasets into quality metrics computation

Whenever a dataset is added or updated in the local CKAN installation, it should automatically be fed into the quality metrics computation machinery.

D6.1 - Design of Restful API

Design Rest API which will act as an interface between the UI/Repository and the metrics computation

API design and messages: 6.1.1 has a RESTful API with JSON I/O. Our
API will be heterogeneous I guess. I could imagine:

REST for triggering quality metrics computation from externally
LOD for requesting quality metrics information that's been computed
already (and thus encoded in terms of the daQ vocabulary)
for the CKAN extensions: whatever CKAN is using already

Also assigned to: @clange @nfriesen

Representation of quality problems

Task 3.1 deals with repairing and cleaning data sets according to their quality. While repairing is about purely logical constraints, cleaning intents to fix problem related to data itself. One of the possible examples for cleaning task is to prove whether literal type correspond to the type defined by the schema. The first step towards cleanup concerns with detection of quality problems. Since estimation of quality will be done in task 5.2, only detailed quality problem description is missing and should be implemented

Deliverable 3.2

Specification and implementation of monitoring, synchronization and Repairing services

UndefinedClassesOrProperty metric implementation might not be entirely correct

I'm not following the literature here but rather just my own intuition. @jerdeb, could you please compare this metric's implementation with the literature, and post comments to this issue as appropriate?

The current implementation looks into the quad's subject, of which I'm not sure it's necessary, as when you reuse an ontology (and don't hijack namespaces, for which we have a separate metric) you usually don't redefine its classes/properties.

The current implementation also assumes that for a property to be defined the property must have a domain and a range. However in OWL ontologies it's common that properties are declared subproperties of other properties, or instances of "object property", or "transitive property", etc., and that's perfectly sufficient for a property to "be defined".

Also I think that checking whether the object is a defined class is only of interest when the predicate is rdf:type. If the predicate is, say, foaf:knows, the object could be anything, e.g. any other instance from our dataset, and we don't care. At least not for this metric.

If datasets do not only consist of instance data but also define some of their local vocabulary, we have a special case. In this case we might also inspect the objects of triples whose predicate is, e.g., rdfs:subClassOf, to see whether the object is a class defined in some ontology. @jerdeb we should discuss whether we want to support this case.

Implement Problem Reporting for SPARQLAccessibility, RDFAccessibility metrics

Implement outProblematicInstancesToStream() method for the metrics SPARQLAccessibility and RDFAccessibility.
The corresponding quality problems are already created in the QR vocabular.
The collection of problematic triples is already implemented.

@jeremy: please check

Exploring Datahub.io Interface - Task 1

Explore the possibility of finding the type of api interface (SPARQL endpoint, rdf dump etc...) of a dataset in datahub.io. What is required here is that (if possible) at the end we can create some SPARQL query which returns for example all datasets with a SPARQL endpoint interface (this should include those which are labeled api/sparql, void/sparql, and sparql) and those which contain an rdf dump as well.

We also need to find out if there are any formats which we can be used as streamable data (such as RDF Dumps)

Consider "format SPARQL" (http://datahub.io/dataset?res_format=api/sparql&_tags_limit=0) vs. "VoID SPARQL endpoint" (http://datahub.io/dataset?tags=void-sparql-endpoint&_tags_limit=0), and consider datasets that do have a SPARQL endpoint (http://datahub.io/dataset/l3s-dblp) but are not correctly tagged.

Make rdfs:labels in dqm vocabulary more human-readable

As discussed with @jerdeb: several rdfs:labels in the dqm vocabulary (e.g. [this one](https://github.com/diachron/quality/blob/master/src/main/resources/vocabularies/dqm/dqm.trig#L235 this one)) are rather machine than human friendly. Please rephrase them. Often this seems as easy as RemovingTheCamelCase.

Fix UnstructuredData Metric

The metric looks somehow not in line with what Hogan et al. (http://aidanhogan.com/docs/pedantic_ldow10.pdf) suggested. Please take a look at the paper. Also, make sure to reflect the new changes in the CommonDataStructure class (similar to the one I fixed myself in the Dereferencability metric).

For content types, please look at the WebContent class provided by jena (https://jena.apache.org/documentation/javadoc/arq/org/apache/jena/riot/WebContent.html) and (http://pedantic-web.org/fops.html#contenttype).

Check out http://validator.linkeddata.org/vapour ; they have an API which we can use directly in Java via maven. It might be useful for us for this metric and the Dereferencability metric.

This requires and urgent fix.

WhitespaceInAnnotation metric

Identify annotations (using the same properties as in #32) whose objects have leading or trailing whitespace (use the regular expression \s), e.g.

<http://...> <http://www.w3.org/2000/01/rdf-schema#comment> " this is new   " .

WhitespaceInAnnotation is a metric in the category of Representational dimensions; Understandability dimension.

The metric value is defined as the ratio of annotations with whitespace to all annotations (i.e. all triples having such properties).

Some of the implementation may be shared with #32.

Web-based User Interface

after we can compute some basic metrics

input:
- choose the dataset to analyse
- input: choose the metrics to be computed
- input: provide parameters to the algorithm (if required by the metric)
output:
- value of the metric

A more advanced UI should only be implemented once we know our relation to https://github.com/eccenca/ckanext-diachron, as probably some features (e.g. filtering datasets by quality) would best be added as extensions to the CKAN software

Some fixes in MisuseOwlDatatypeOrObjectProperties

In https://github.com/diachron/quality/blob/master/src/main/java/de/unibonn/iai/eis/diachron/qualitymetrics/intrinsic/consistency/MisuseOwlDatatypeOrObjectProperties.java please fix the following:

change the class name to MisusedOwlDatatypeOrObjectProperties
match the full, exact OWL namespace URI: http://www.w3.org/2002/07/owl# (FYI, http://prefix.cc is a helpful service for finding out such well-known URIs)
class names such as DatatypeProperty are case sensitive.

(Metric Impl) Ontology Hijacking

Please refer to Amrapali Survey paper for more information

Homogeneous Datatypes Metric return result

The HomogeneousDatatypesMetric is giving values such as 5.933908131234312E-6 (this might be correct but it is strange that it is the only metric that it is giving us such a result).

Scaling of Dereferencibility metric

The Dereferencibility metric is taking a lot of time to compute on EBI datasets

Port existing metrics implementation (interfaces and concrete metrics) to Diachron data model (D1.3)

This is mainly for @jerdeb.

We will proceed by studying the data model specification in D1.3, and then figuring out with the D1.3 how algorithms (such as metrics) can be implemented on top of the Diachron data model.

Scaling of MisplacedClassesOrProperties

The MisplacedClassesOrProperties metric is taking a lot of time to compute on EBI datasets

Check literals beyond XSD built-ins for well-formedness

In https://jena.apache.org/documentation/notes/typed-literals.html (@nfriesen, thanks for pointing out this helpful guide!), I think the following sections will help us to get beyond built-in XSD data types such as numbers or dates:

User defined XSD data types (i.e. those that you can conveniently define in the XML Schema language, and load from an XML Schema document)
User defined non-XSD data types (entirely custom data types)

I'm initially assigning this Issue to you. Later on, you may want to split into more specific per-datatype Issues assigned to Ali.

Once we know how to handle built-in XSD data types, such as numbers or dates, we are planning to proceed to things like percentages, ISBNs, email addresses, etc. These can be defined in XML Schema as restrictions of base types, such as numbers within a range (integer percentage = integer between 0 and 100), or strings that match regular expressions (an easy way to handle email addresses or credit card numbers or to approximate ISBNs).

For a more thorough check, things like ISBNs or possibly email addresses require further work. The last digit of an ISBN is a checksum, which needs to be computed from the other digits. Suppose we were interested in validating email addresses by checking whether the server responds to pings, this would also require a custom implementation.

Task 3.2 Change detection API

Uni Bonn is responsible for the task 3.2 aiming to identify temporally related data sets, This is a very challenging task and they exists many different approaches to prove weather a one data set is an older version of another. The easiest ones are:

Compare data sets URL's and to look at the version number (date).
Look up metadata for date information.

Streaming data

We need to parse ntrig (data dumps) and SPARQL end points in the most efficient way. Possible readings: LODStats (http://jens-lehmann.org/files/2012/ekaw_lodstats.pdf)

CKAN extension for filtering datasets by quality

i.e. an extension of the existing faceted browsing facilities (e.g. filtering by license).

Quality Problems Reporting in EmptyAnnotationValue, LabelsUsingCapitals, WhitespaceInAnnotation

Please implement outProblematicInstancesToStream() method for EmptyAnnotationValue, LabelsUsingCapitals, WhitespaceInAnnotation metrics. The corresponding quality problems already exists in the QR vocabulary.

Interface (as in "API") for the computation of one metric

What is the

input
output

that's needed for computing one metric?

Maybe figure out by implementing the concrete computation of one metric, and then abstracting it into an interface. Then, further metrics could instantiate this interface.

Reuse idea of https://github.com/AKSW/LODStats/blob/master/lodstats/stats/RDFStatInterface.py but implement in Java (as most concrete metrics have already been implemented in Java)

Support to Data Cube

Add latest extension of Data Cube for generating daq triples

Accessibility Dimension - Availability metric group

Implement the following metrics and their subsequent test classes:
SPARQL Accessibility
RDF Accessibility
Dereferencibility
Unstructured Data

(Implementation details can be found http://goo.gl/pTvUW4)