diachron / quality Goto Github PK
View Code? Open in Web Editor NEWDataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
License: MIT License
Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
License: MIT License
This is a Reputation Dimension.
In this metric we need to check if the used ontologies (therefore we need to check only for the type of an instance) are part of the OBO Foundry
For more information check D5.1
In the class comment, mention that this metric is specific to the EBI use-case
Implement a metric EmptyAnnotationValue
(in the category of Representational dimensions; Understandability dimension) that identifies triples whose property is from a pre-configured list of annotation properties, and whose object is an empty string.
We consider the following widely used annotation properties (labels, comments, notes, etc.):
http://www.w3.org/2004/02/skos/core#altLabel
http://www.w3.org/2004/02/skos/core#hiddenLabel
http://www.w3.org/2004/02/skos/core#prefLabel
http://www.w3.org/2004/02/skos/core#changeNote
http://www.w3.org/2004/02/skos/core#definition
http://www.w3.org/2004/02/skos/core#editorialNote
http://www.w3.org/2004/02/skos/core#example
http://www.w3.org/2004/02/skos/core#historyNote
http://www.w3.org/2004/02/skos/core#note
http://www.w3.org/2004/02/skos/core#scopeNote
http://purl.org/dc/terms/description
http://purl.org/dc/elements/1.1/description
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2000/01/rdf-schema#comment
For now, this list of properties can be hard-coded; we might think about a more extensible implementation later.
E.g. a triple like the following should be matched:
<http://...> <http://www.w3.org/2000/01/rdf-schema#comment> "" .
The metric value is defined as the ratio of annotations with empty objects to all annotations (i.e. all triples having such properties).
(Background: D3.1 Table 20 on page 91)
Cc: @nfriesen
Remove required JARs and add them as maven dependencies
The predicate of a quad can be an undefined property, and the object of a quad can be an undefined class or an undefined property when the quad's predicate is one out of the list given below.
The subject of a quad never references classes or properties in external vocabularies, so we don't have to analyse the subject for this metric.
This is the list of predicates that indicate that the object must be a defined class:
rdf:type
(FYI just supporting this is sufficient for most LOD datasets. The following ones are only relevant when LOD datasets define their own vocabulary, or in the case that a vocabulary/ontology happens to be implemented as a LOD dataset.)rdfs:domain
rdfs:range
rdfs:subClassOf
owl:allValuesFrom
owl:someValuesFrom
owl:equivalentClass
owl:complementOf
owl:onClass
owl:disjointWith
This is the list of predicates that indicate that the object must be a defined property:
rdfs:subPropertyOf
owl:onProperty
owl:assertionProperty
owl:equivalentProperty
owl:propertyDisjointWith
In all of the cases above, "being defined" may also mean "defined in the current LOD dataset" (but we can assume that a class/property is defined at an earlier position in the current dataset, i.e. at a position that we have processed already). I.e. "being defined" does not only mean "defined in some external ontology".
FYI there are some more predicates for which we don't know whether the object is expected to be a class or property, but we'll ignore these predicates for now.
BTW, the current implementation for predicate and object looks a bit redundant to me; maybe we can shorten it by factoring out some of the common source code lines into a shared method.
Create an interface to access the available datasets - mainly SPARQL endpoints, RDF dumps.
These metrics might have partly overlapping implementations. This is just a superficial impression I got and should only be reviewed when we do a complete review of the implementation.
UnstructuredData speaks of “dead URIs”, whereas a dead link is something that's not dereferenceable. My intuitive understanding of UnstructuredData is rather that the link works, but provides, say, HTML instead of RDF>
by select
ing ?s ?p ?o
with a limit
/offset
(and order
if necessary)
The deadline for the first draft is 22 July.
Please create tests for the availability metrics
Would it be possible to use the libraries specifically rather than reimplementing (or reusing) the classes. For graph algorithms there is the Jung Library (http://jung.sourceforge.net) which I used and I find it very useful. If you require it, I can add the the pom dependency for you
This is a Reputation Dimension and a ComplexQualityMetric.
For this metric we need to check if the dataset resources are hosted in a reputable source.
The list of reputable sources should be "loaded" in the before method.
The reputable sources are (links to owl files):
Ontology for Biomedical Investigations http://purl.obolibrary.org/obo/obi.owl
Cell Type Ontology http://purl.obolibrary.org/obo/cto.owl
Gene Ontology http://purl.obolibrary.org/obo/go.owl
PATO http://purl.obolibrary.org/obo/pato.owl
ChEBI http://purl.obolibrary.org/obo/chebi.owl
ORDO http://www.orphadata.org/data/ORDO/ordo_orphanet.owl.zip (note this one is zipped)
IAO http://purl.obolibrary.org/obo/iao.owl
NCBI Taxon http://purl.obolibrary.org/obo/ncbitaxon.owl (warning, this is a very big file!)
Uberon http://purl.obolibrary.org/obo/uberon.owl
Unit Ontology http://purl.obolibrary.org/obo/uo.owl
Software Ontology http://sourceforge.net/projects/theswo/files/SWO%20ontology%20release (sorry zipped again)
@nfriesen summarising what we discussed: at least for cleaning it will make sense to split UndefinedClassesOrProperties in UndefinedClasses and UndefinedProperties, as otherwise if a triple <s> _:undefinedProperty _:undefinedClass
is reported to be problematic w.r.t. UndefinedClassesOrProperties it will be impossible to find out whether the predicate or the object is the culprit.
If you think you'll need this for cleaning soon, could you please discuss with @jerdeb, and then rephrase this issue into a more concrete instruction for implementation?
I realised that the OntologyHijacking metric is not completely correct.
@clange I would like to discuss with you the implementation of the metric
The HighThroughput and LowLatency metrics are taking a lot of time to compute on EBI datasets
Check out datacube and LODStats. The subset used in LODStats might be sufficient for us to use to improve the vocabulary
The current implementation of UndefinedClassesOrProperties finds triples where a class or property is expected in the object position and then looks whether that “object resource” is accessible for the VocabularyReader. If a resource was found, it does not check whether the resource actually is a class or property. (Example below.)
So we need to check whether the data we found for that resource (usually: the data we downloaded from the object URI) contains something that convinces us that it is an rdfs:Class
or an owl:Class
, or an rdf:Property
. (Note that if something is an owl:Class
it is also an rdfs:Class
, and that OWL defines a lot of special cases of rdf:Property
, such as owl:ObjectProperty
or owl:TransitiveProperty
. I can write down the full list here once we are starting to implement this; please let me know.)
Let <o>
be the URI of the object. From just looking at the data, without doing OWL reasoning, we can look for, e.g. <o> rdf:type owl:Class
and will know that the triple <...> rdf:type <o>
is a “good” triple w.r.t. this metric. We can even look for <o> ?p ?o
and will know that …
?p
is rdfs:subClassOf
or owl:unionOf
or owl:intersectionOf
or owl:equivalentClass
or owl:oneOf
, then <o>
is an rdfs:Class
.?p
is rdfs:domain
or rdfs:subPropertyOf
or rdfs:range
or owl:propertyDisjointWith
or owl:equivalentProperty
, then <o>
is an rdf:Property
.?p
is owl:disjointUnionOf
or owl:complementOf
or owl:disjointWith
or owl:hasKey
, then <o>
is an rdfs:Class
.?p
is owl:inverseOf
or owl:propertyChainAxiom
, then <o>
is an rdf:Property
.Example: imagine a triple <...> rdf:type socialnetwork:Alice
where socialnetwork:Alice rdf:type foaf:Person
, i.e. socialnetwork:Alice
is actually not an owl:Class
but an owl:Individual
(which is declared to be disjoint with owl:Class
). This is a “bad triple” even if socialnetwork:Alice
is defined.
Implement a metric LabelsUsingCapitals
that identifies triples whose property is from a pre-configured list of label properties (a subset of the annotation properties from #32), and whose object uses a bad style of capitalisation.
We consider the following widely used label properties:
http://www.w3.org/2004/02/skos/core#altLabel
http://www.w3.org/2004/02/skos/core#hiddenLabel
http://www.w3.org/2004/02/skos/core#prefLabel
http://www.w3.org/2000/01/rdf-schema#label
For now, this list of properties can be hard-coded (maybe somehow shared with #32); we might think about a more extensible implementation later.
For now we define "bad" capitalisation as "camel case", for which we should design a regular expressions to match such strings. Consider, e.g., a label "InterestingThing"
: this is a suitable name for a class/resource, but the label should rather be "interesting thing"
or "Interesting Thing"
E.g. a triple like the following should be matched:
<http://...> <http://www.w3.org/2000/01/rdf-schema#label> "InterestingThing" .
The metric value is defined as the ratio of labels with "bad capitalisation" to all labels (i.e. all triples having such properties).
Note: in the cleaning UI, triples that match this metric should be reported as non-critical errors.
(Background: D3.1 Table 20 on page 91)
Let's think about a declarative language for quality metrics.
I.e. that large parts of the implementation of a new metric would be implemented in the form of a dataset that's an instance of the daQ vocabulary.
In pseudo code e.g. a declarative representation of the UndefinedClassesOrProperties metric could look like this:
IF TRIPLE MATCHES ?s rdf:type|rdfs:subClassOf|rdfs:domain|rdfs:range ?c
# ^^^ This would be a SPARQL graph pattern
THEN CHECK
# Here we could use a SPARQL FILTER expression:
(dqf:DereferenceableAsLOD(?c)
|| dqf:ExistsLocallyInThisDataset(?c)
|| dqf:OtherwiseKnownToUs(?c))
&& dqf:QuerySucceeds(?c a owl:Class)
# ^^^ once more a SPARQL graph pattern
# Actually this check is more complex
# but I'll leave it like this for now for the example
Complex operators like DereferenceableAsLOD or ExistsLocallyInThisDataset or QuerySucceeds would be realised as custom SPARQL functions with a Java implementation, reusing code from methods we already have. (I used dqf
for our custom namespace of “data quality functions”.)
Compare page 7 of http://svn.aksw.org/papers/2013/ISWC_LODStats/public.pdf. They get by without complex operators, but their task is simpler than ours.
This language could include elements for generating problem reports, which we need for cleaning. (@jerdeb @nfriesen please edit this into "quality report" if that's the correct term)
We need to decide on a license under which we are releasing our code. To figure out a reasonable one I suggest the following process:
Once decided, there should be a LICENSE file in the top-level directory, and a short reference to the license in all source files. Compare https://github.com/formare/auctions/blob/master/isabelle/Auction/Partitions.thy from my former project (just forget about the dual licensing, which is specific to the requirements of that project).
Implement a metric ValidOWL (in the category of Intrinsic dimensions; Consistency dimension) that determines whether the given RDF dataset is a valid OWL ontology.
At the very least this metric should return a value of true
or false
.
In Jena it should be possible to try having an RDF graph parsed as OWL (which means that additional consistency rules are checked), and to obtain error messages if the RDF graph is not valid OWL.
After this basic step we might be able to go a step further and determine the ratio of triples that are invalid w.r.t. the OWL semantics. E.g. owl:Class owl:Class owl:Class .
is a valid RDF triple, but doesn't make sense in OWL. Jena might be able to give us a list of such invalid triples for free. If Jena doesn't do it, maybe the OWL API does. (Not sure it supports streaming; let's find out.)
@nfriesen: Before we invest a lot of effort into using the OWL API, let's talk to the Repairing partners.
@muhammadaliqasmi: a note about the second step: If we manage to identify all individual triples that are not valid OWL, this also covers the job of MisusedOwlDatatypeOrObjectProperties, i.e. MisusedOwlDatatypeOrObjectProperties is a special case of "finding all triples that are not valid OWL", and thus we could refactor it to make it reuse some of the implementation of the ValidOWL metric, so that we only need to run the OWL parser once.
(Background: D3.1 Table 20 on page 90)
For, e.g., the DuplicateInstance
metric we are currently keeping a complete record of all instances found so far in memory. For huge datasets we might have to do some fuzzy approximation, similar to LODStats. I.e. that we throw away part of the full details we have in memory, and replace them by fuzzier approximations that consume less memory.
Documentation for D5.2
https://docs.google.com/document/d/1mZBXm7G33xWWf1zLz_aAuL-HDn4nAU5RZUMCpcEyL38/edit
… where "master" in practice means http://datahub.io.
Maybe such an extension exists already. This feature is helpful towards achieving our goals but not really related to quality, so it has a low priority.
Of course any such pulled datasets should be fed into the quality metrics computation as said in #7.
getQualityProblems
looks the same for all metrics (and is currently copied). Can't we centralise it into some common superclass instead of only having a Metric
super-interface?before
method, others don't. If this same work were done in a constructor, it would be easier to do this in a constructor because then it would be done automatically when initialising the metric.CurrencyDocumentStatements and TimeSinceModification are taking a lot of time to compute on large datasets
Whenever a dataset is added or updated in the local CKAN installation, it should automatically be fed into the quality metrics computation machinery.
Design Rest API which will act as an interface between the UI/Repository and the metrics computation
API design and messages: 6.1.1 has a RESTful API with JSON I/O. Our
API will be heterogeneous I guess. I could imagine:
Task 3.1 deals with repairing and cleaning data sets according to their quality. While repairing is about purely logical constraints, cleaning intents to fix problem related to data itself. One of the possible examples for cleaning task is to prove whether literal type correspond to the type defined by the schema. The first step towards cleanup concerns with detection of quality problems. Since estimation of quality will be done in task 5.2, only detailed quality problem description is missing and should be implemented
Specification and implementation of monitoring, synchronization and Repairing services
I'm not following the literature here but rather just my own intuition. @jerdeb, could you please compare this metric's implementation with the literature, and post comments to this issue as appropriate?
The current implementation looks into the quad's subject, of which I'm not sure it's necessary, as when you reuse an ontology (and don't hijack namespaces, for which we have a separate metric) you usually don't redefine its classes/properties.
The current implementation also assumes that for a property to be defined the property must have a domain and a range. However in OWL ontologies it's common that properties are declared subproperties of other properties, or instances of "object property", or "transitive property", etc., and that's perfectly sufficient for a property to "be defined".
Also I think that checking whether the object is a defined class is only of interest when the predicate is rdf:type
. If the predicate is, say, foaf:knows
, the object could be anything, e.g. any other instance from our dataset, and we don't care. At least not for this metric.
If datasets do not only consist of instance data but also define some of their local vocabulary, we have a special case. In this case we might also inspect the objects of triples whose predicate is, e.g., rdfs:subClassOf
, to see whether the object is a class defined in some ontology. @jerdeb we should discuss whether we want to support this case.
Implement outProblematicInstancesToStream() method for the metrics SPARQLAccessibility and RDFAccessibility.
The corresponding quality problems are already created in the QR vocabular.
The collection of problematic triples is already implemented.
@jeremy: please check
Explore the possibility of finding the type of api interface (SPARQL endpoint, rdf dump etc...) of a dataset in datahub.io. What is required here is that (if possible) at the end we can create some SPARQL query which returns for example all datasets with a SPARQL endpoint interface (this should include those which are labeled api/sparql, void/sparql, and sparql) and those which contain an rdf dump as well.
We also need to find out if there are any formats which we can be used as streamable data (such as RDF Dumps)
Consider "format SPARQL" (http://datahub.io/dataset?res_format=api/sparql&_tags_limit=0) vs. "VoID SPARQL endpoint" (http://datahub.io/dataset?tags=void-sparql-endpoint&_tags_limit=0), and consider datasets that do have a SPARQL endpoint (http://datahub.io/dataset/l3s-dblp) but are not correctly tagged.
As discussed with @jerdeb: several rdfs:label
s in the dqm vocabulary (e.g. [this one](https://github.com/diachron/quality/blob/master/src/main/resources/vocabularies/dqm/dqm.trig#L235 this one)) are rather machine than human friendly. Please rephrase them. Often this seems as easy as RemovingTheCamelCase.
The metric looks somehow not in line with what Hogan et al. (http://aidanhogan.com/docs/pedantic_ldow10.pdf) suggested. Please take a look at the paper. Also, make sure to reflect the new changes in the CommonDataStructure class (similar to the one I fixed myself in the Dereferencability metric).
For content types, please look at the WebContent class provided by jena (https://jena.apache.org/documentation/javadoc/arq/org/apache/jena/riot/WebContent.html) and (http://pedantic-web.org/fops.html#contenttype).
Check out http://validator.linkeddata.org/vapour ; they have an API which we can use directly in Java via maven. It might be useful for us for this metric and the Dereferencability metric.
This requires and urgent fix.
Identify annotations (using the same properties as in #32) whose objects have leading or trailing whitespace (use the regular expression \s
), e.g.
<http://...> <http://www.w3.org/2000/01/rdf-schema#comment> " this is new " .
WhitespaceInAnnotation
is a metric in the category of Representational dimensions; Understandability dimension.
The metric value is defined as the ratio of annotations with whitespace to all annotations (i.e. all triples having such properties).
Some of the implementation may be shared with #32.
after we can compute some basic metrics
A more advanced UI should only be implemented once we know our relation to https://github.com/eccenca/ckanext-diachron, as probably some features (e.g. filtering datasets by quality) would best be added as extensions to the CKAN software
In https://github.com/diachron/quality/blob/master/src/main/java/de/unibonn/iai/eis/diachron/qualitymetrics/intrinsic/consistency/MisuseOwlDatatypeOrObjectProperties.java please fix the following:
MisusedOwlDatatypeOrObjectProperties
http://www.w3.org/2002/07/owl#
(FYI, http://prefix.cc is a helpful service for finding out such well-known URIs)DatatypeProperty
are case sensitive.Please refer to Amrapali Survey paper for more information
The HomogeneousDatatypesMetric is giving values such as 5.933908131234312E-6 (this might be correct but it is strange that it is the only metric that it is giving us such a result).
The Dereferencibility metric is taking a lot of time to compute on EBI datasets
This is mainly for @jerdeb.
We will proceed by studying the data model specification in D1.3, and then figuring out with the D1.3 how algorithms (such as metrics) can be implemented on top of the Diachron data model.
The MisplacedClassesOrProperties metric is taking a lot of time to compute on EBI datasets
In https://jena.apache.org/documentation/notes/typed-literals.html (@nfriesen, thanks for pointing out this helpful guide!), I think the following sections will help us to get beyond built-in XSD data types such as numbers or dates:
I'm initially assigning this Issue to you. Later on, you may want to split into more specific per-datatype Issues assigned to Ali.
Once we know how to handle built-in XSD data types, such as numbers or dates, we are planning to proceed to things like percentages, ISBNs, email addresses, etc. These can be defined in XML Schema as restrictions of base types, such as numbers within a range (integer percentage = integer between 0 and 100), or strings that match regular expressions (an easy way to handle email addresses or credit card numbers or to approximate ISBNs).
For a more thorough check, things like ISBNs or possibly email addresses require further work. The last digit of an ISBN is a checksum, which needs to be computed from the other digits. Suppose we were interested in validating email addresses by checking whether the server responds to pings, this would also require a custom implementation.
Uni Bonn is responsible for the task 3.2 aiming to identify temporally related data sets, This is a very challenging task and they exists many different approaches to prove weather a one data set is an older version of another. The easiest ones are:
We need to parse ntrig (data dumps) and SPARQL end points in the most efficient way. Possible readings: LODStats (http://jens-lehmann.org/files/2012/ekaw_lodstats.pdf)
i.e. an extension of the existing faceted browsing facilities (e.g. filtering by license).
Please implement outProblematicInstancesToStream() method for EmptyAnnotationValue, LabelsUsingCapitals, WhitespaceInAnnotation metrics. The corresponding quality problems already exists in the QR vocabulary.
What is the
that's needed for computing one metric?
Maybe figure out by implementing the concrete computation of one metric, and then abstracting it into an interface. Then, further metrics could instantiate this interface.
Reuse idea of https://github.com/AKSW/LODStats/blob/master/lodstats/stats/RDFStatInterface.py but implement in Java (as most concrete metrics have already been implemented in Java)
Add latest extension of Data Cube for generating daq triples
Implement the following metrics and their subsequent test classes:
SPARQL Accessibility
RDF Accessibility
Dereferencibility
Unstructured Data
(Implementation details can be found http://goo.gl/pTvUW4)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.