wikidata / wikidata-toolkit Goto Github PK

View Code? Open in Web Editor NEW

367.0 40.0 100.0 17.94 MB

Java library to interact with Wikibase

Home Page: https://www.mediawiki.org/wiki/Wikidata_Toolkit

License: Apache License 2.0

Java 98.39% HTML 1.61%

java wikidata wikibase rdf wikidata-toolkit

wikidata-toolkit's Introduction

Wikidata Toolkit

Wikidata Toolkit is a Java library for accessing Wikidata and other Wikibase installations. It can be used to create bots, to perform data extraction tasks (e.g., convert all data in Wikidata to a new format), and to do large-scale analyses that are too complex for using a simple SPARQL query service.

Documentation

Wikidata Toolkit homepage: project homepage with basic user documentation, including guidelines on how to setup your Java IDE for using Maven and git.
Wikidata Toolkit examples: stand-alone Java project that shows how to use Wikidata Toolkit as a library for your own code.
Wikidata Toolkit Javadocs: API documentation

License and Credits

Authors: Markus Kroetzsch, Julian Mendez, Fredo Erxleben, Michael Guenther, Markus Damm, Antonin Delpeuch, Thomas Pellissier Tanon and other contributors

License: Apache 2.0

The development of Wikidata Toolkit has been partially funded by the Wikimedia Foundation under the Wikibase Toolkit Individual Engagement Grant, and by the German Research Foundation (DFG) under Emmy Noether grant KR 4381/1-1 "DIAMOND".

How to make a release

During development, the version number in the pom.xml files should be the next version number assuming that the next version is a patch release, followed by -SNAPSHOT. For instance, if the last version to have been released was 1.2.3, then the pom.xml files should contain <version>1.2.4-SNAPSHOT</version>.

Pick the version number for the new release you want to publish, following SemVer. If this is going to be a patch release, it should be the version currently in pom.xml without the -SNAPSHOT suffix. In the following steps, we will assume this new version is 1.2.4.
Write the new version number in the pom.xml files with mvn versions:set -DnewVersion=1.2.4
Add some release notes in the RELEASE-NOTES.md file at the root of the repository
Commit the changes: git commit -am "Set version to 1.2.4"
Add a tag for the version: git tag -a v1.2.4 -m "Version 1.2.4"
Write the next version number in the pom.xml file, by incrementing the patch release number: mvn versions:set -DnewVersion=1.2.5-SNAPSHOT
Commit the changes: git commit -am "Set version to 1.2.5-SNAPSHOT"
Push commits and tags: git push --tags && git push
In GitHub's UI, create a release by going to https://github.com/Wikidata/Wikidata-Toolkit/releases/new. Pick the tag you just created, give a title to the release and quickly describe the changes since the previous release (see existing releases for examples).
Update the version number mentioned in https://www.mediawiki.org/wiki/Wikidata_Toolkit
Update the examples in https://github.com/Wikidata/Wikidata-Toolkit-Examples (generally just bumping WDTK's version in the pom.xml file works. Make sure it still compiles afterwards.)

The library is automatically packaged and uploaded to Maven Central by the continuous deployment (with GitHub Actions). So is the HTML version of the javadoc (to GitHub Pages).

wikidata-toolkit's People

Contributors

Stargazers

Watchers

Forkers

chinmay26 eldur qwaider gladwig sysoev-a dlzhangxg notconfusing aissatech imclab shilad chagge ardock danangcode seralf congwang-ai brianified egonw alansaid nathan929 danielhz motazsaad datnt-bkhn p4rus shubhamsaxena02 antjpar steven-ho mashirokinji noa xinke0802 dswarm 0607060123 monkey2000 hyqleonardo sbrunk rub3nlh miselico judgegoodwin yuejiyang elshimone mparaz jgwill pebsconsulting bnmin jmorra jon-morra-zefr xianshu1 taoer1996 zhanglbjames jsubercaze psh karl7 isspek aymansalama livvi tool-recommender-bot cass-green flaskytest1 bennofs yidanhu33 orai-nlp ordtesters chetansharmaatgithub afkbrb francescoz93 usc-isi-i2 darecoder karlwettin hims92 kanikasaini snehasi tanish-gupta rpatil524 62mkv boost-entropy-repos-org brett-matson robertvazan yukhnevich wetneb benjaminaaron remhardt thejacksonthomas lacinoire lhaaits martinjdksn forcing-com henryworkingson javacentric wsmliubin jmformenti axxter99 nicocanela g46y wikidonne talank mavenprojects mggnzu001 testboost theeaterr s-rb

wikidata-toolkit's Issues

Processor for page revisions in XML dumps

A component for processing the page revisions found in one or more dump files in MediaWiki XML format should be provided. The component should process a file given by a Java file handler (no path!) and iterate over the revisions in that file. Other components that process revisions should be able to register as listeners, which will be called back for every new revision. The iterator should also be able to parse multiple files (last dump + incremental daily dumps), obtained from the dump file management component of Issue #8.

Bug with date value IDs

I am a researcher from KAIST SWRC(Semantic Web Research Center).
Our team is working for making RDF repository referencing to Wikidata.
We have been looking at data from simple statement and full statement.
We found one bug in RDF data, where the simple statement file and full statement file have two different dates from the same generated ID. This could cause a problem when incorporating data from simple statement and ones from full statement.

For example, date from full statement, http://www.wikidata.org/entitiy/VT74cee5440e7c65414d6c62820efa3dc2
http://www.wikidata.org/ontology#time
"1991-11-25""^^http://www.w3.org/2001/XMLSchema#date .

date from simple statement
http://www.wikidata.org/entity/VT74cee5440e7c65414d6c62820efa3dc2 http://www.wikidata.org/ontology#time
"1942-11-26"^^http://www.w3.org/2001/XMLSchema#date .

The problem may be in the digest function in Vocabulary.java in getTimeValueUri function.

Currently, it's not an urgent problem for us.

Require Java 7

There does not seem a good reason to provide support for Java 6 in a new project today. The maven and travis configurations should be updated to use Java 7 only. Java 7 has several improved libraries and a number of features that lead to more readable code.

Where to place additional files for JavaDoc

Additional documentation files like images are to be handled like desctibed here.

Miscellaneous Unprocessed Files

You can also include in your source any miscellaneous files that you want the Javadoc tool to copy to the destination directory. These typically includes graphic files, example Java source (.java) and class (.class) files, and self-standing HTML files whose content would overwhelm the documentation comment of a normal Java source file.

To include unprocessed files, put them in a directory called doc-files which can be a subdirectory of any package directory that contains source files. You can have one such subdirectory for each package. You might include images, example code, source files, .class files, applets and HTML files. For example, if you want to include the image of a button button.gif in the java.awt.Button class documentation, you place that file in the /home/user/src/java/awt/doc-files/ directory. Notice the doc-files directory should not be located at /home/user/src/java/doc-files because java is not a package -- that is, it does not directly contain any source files.

All links to these unprocessed files must be hard-coded, because the Javadoc tool does not look at the files -- it simply copies the directory and all its contents to the destination. For example, the link in the Button.java doc comment might look like:

    /**
     * This button looks like this: 
     * <img src="doc-files/Button.gif">
     */

Rewrite JsonConverter tests

The current tests are not satisfying enough and highly redundant. Smaller, more precise tests are needed.

Also @eldur pointed out, that one could use fasterxml.jackson for this purpose. This issue is for discussing this aspect also.

Wikibase interface to storage backend

A component for exchanging Wikibase objects, especially EntityDocuments, with a storage backend should be provided. The mapping of ids will use a dictionary #19 and data will be stored in tuples using one or more "tables" #18. The interface should also provide some basic query features to retrieve Wikibase objects based on partially specified features.

Exception: Unrecognized field "claims"

I just tried to run the example ClassPropertyUsageAnalyzer out of the box and I'm getting this exception. I'm not sure if this is a bug or just a problem of how I'm using it!?

com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "claims" (class org.wikidata.wdtk.datamodel.json.jackson.JacksonPropertyDocument), not marked as ignorable (5 known properties: "datatype", "descriptions", "id", "aliases", "labels"])
 at [Source: java.util.zip.GZIPInputStream@53a41860; line: 5, column: 15511] (through reference chain: org.wikidata.wdtk.datamodel.json.jackson.JacksonPropertyDocument["claims"])
    at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
    at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:671)
    at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:773)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1297)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1275)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:247)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeOther(BeanDeserializer.java:155)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:126)
    at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:118)
    at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:87)
    at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:132)
    at com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:41)
    at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:189)
    at org.wikidata.wdtk.dumpfiles.JsonDumpFileProcessor.processDumpFileContents(JsonDumpFileProcessor.java:66)
    at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processDumpFile(DumpProcessingController.java:471)
    at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processMostRecentDump(DumpProcessingController.java:456)
    at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processMostRecentJsonDump(DumpProcessingController.java:426)
    at org.wikidata.wdtk.examples.ExampleHelpers.processEntitiesFromWikidataDump(ExampleHelpers.java:157)
    at org.wikidata.wdtk.examples.ClassPropertyUsageAnalyzer.main(ClassPropertyUsageAnalyzer.java:276)

Space-efficient bitvector implementation with fast rank and select

The basic bitvector implementation #3 should be extended to support fast select and rank queries. This functionality can be useful for space-efficient linearized memory organization.

Statement MainSnak type has no .getValueID interface.

In version 0.1
One can get the property string of a claim with

String PID = si.getClaim().getMainSnak().getPropertyId().getId().toString();

but there is not equivalent method to get the valueID() if one is of type Wikidata item.

Property parsing might fail

Property parsing fails due to the property string passed to the JSON converter might be of the form "Property:P21".

Refactor JsonConverter

JsonConverter is a class in the wdtk-dumpfiles package which consists of almost 1000 lines and thus should be splitted up a bit. Furthermore, this class is not strongly connected to the dump files but may also be used to parse other JSON data, eg. collected from Special:EntityData. Maybe it would even be worth it to create another package which handles serialization/deserialization of JSON to the datamodel. This way we'd have a much cleaner and more reusable component.

Change the attribute "year" in TimeValue to long

One might want to change the "year"-attribute to long. While in JSON this is already represented as 12-digits, the internal integer representation overflows above certain values.

Such large time values are for example used in astrophysics, geology, sience-fiction. Also see Q1: the universe.

Basic tuple storage implementation

A basic implementation for managing a list of n-tuples (one might say: a table, but maybe with additional constraints) in memory should be provided. The implementation should support basic (non-compound) queries and iteration operations. This functionality should be defined in an interface.

GlobeCoordinate handling has changed

The GlobeCoordiantes in the Json dumps seem to have changed from using nanodegrees for everything into using full degrees and switching the type from long to double.

This leads to wrong results when parsing GlobeCoordinates with our code.

Error constructing StatementGroups from dump files

When parsing data from export files (from the internal JSON format), the grouping of Statements is broken. Instead of having one group per property, there is one group per main snak, which usually means that every group contains just one statement.

Incorrect serialization of references into RDF format

I analyzed content of http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-statements.nt.gz
and found out that references for some statements are incorrect. For example, for item Q42 (Douglas Adams) and property P26 (spouse) the correct reference is

reference URL: http://www.nndb.com/people/731/000023662/
original language: English
title: Douglas Adams
publisher: NNDB
date retrieved: 7 December 2013

while RDF data encodes something different:

<http://www.wikidata.org/entity/Q42> <http://www.wikidata.org/entity/P26s> <http://www.wikidata.org/entity/Q42Sb88670f8-456b-3ecb-cf3d-2bca2cf7371e> .
...
<http://www.wikidata.org/entity/Q42Sb88670f8-456b-3ecb-cf3d-2bca2cf7371e> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://www.wikidata.org/entity/R801b4ec5d49856ea46b591da3ba8596c> .
...
<http://www.wikidata.org/entity/R801b4ec5d49856ea46b591da3ba8596c> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/ontology#Reference> .
<http://www.wikidata.org/entity/R801b4ec5d49856ea46b591da3ba8596c> <http://www.wikidata.org/entity/P364r> <http://www.wikidata.org/entity/Q1860> .
<http://www.wikidata.org/entity/R801b4ec5d49856ea46b591da3ba8596c> <http://www.wikidata.org/entity/P854r> <http://www.douglasadams.eu/en_adams_athee.php> .
<http://www.wikidata.org/entity/R801b4ec5d49856ea46b591da3ba8596c> <http://www.wikidata.org/entity/P357r> "Douglas Adams and God. Portrait of a radical atheist" .
<http://www.wikidata.org/entity/R801b4ec5d49856ea46b591da3ba8596c> <http://www.wikidata.org/entity/P813r> <http://www.wikidata.org/entity/VT784d3c688173e05b96ab15870c7a36a9> .

The same problem was present when I applied RdfSerializationExample by myself.

The problem seems to be in org.wikidata.wdtk.rdf.ReferenceRdfConverter.writeReferences() method:

public void writeReferences() throws RDFHandlerException {
    Iterator<Reference> referenceIterator = this.referenceQueue.iterator();
    for (Resource resource : this.referenceSubjectQueue) {
        if (!this.declaredReferences.add(resource)) {
            // XXX referenceIterator.next() is not invoked - referenceSubjectQueue and referenceQueue become unsynchronized
            continue;
        }
        Reference reference = referenceIterator.next();
        writeReference(reference, resource);
    }
    this.referenceSubjectQueue.clear();
    this.referenceQueue.clear();

    this.snakRdfConverter.writeAuxiliaryTriples();
}

Mismatch in PropertyIdValueImpl

There is a mismatch in PropertyIdValueImpl between the exception message and the reason why the exception was thrown:

        if (!id.matches("^P[1-9][0-9]*$")) {
            throw new IllegalArgumentException(
                    "Wikibase item ids must have the form \"Q[1-9]+\"");
        }

Is it intentional that no property with the id "P0" is allowed?

EntityDocumentProcessors called twice when using JSON dumps

Entity processors can be registered for specific content models. However, since JSON dumps do not provide content models, they just call all registered entity document processors on each document. The broker component that manages this does not check for duplicates, and as a result, a processor that is registered individually for item and property content will be called twice.

This is a critical bug that causes unnecessary work and leads to inflated outputs, in particular in the RDF dumps.

As a workaround, users should register their processors for one content model only (or simply for null) when using the JSON dumps.

Use correct site links when importing data from dumps

The current code has no way to know what the keys used in sitelinks (such as "enwiki") mean for the site that the data is imported from. Empty base URLs are used, which is wrong.

To fix this, there needs to be a point to inject site information into the parsing process and, possibly, the factory (it would make sense to share this information across sitelinks instead of copying URL strings). An initial workaround could be to inject hard-coded information that is correct for www.wikidata.org. In a later refinement, it should be possible to extract this information from dumps instead.

Command line client for data format conversions

A command line tool for processing dumps to create exports in several formats should be provided. The tool should be able to work offline, using only previously downloaded dumps, and it should provide some basic filtering options to select the data to include in the dump.

Implement updated Wikibase data model

Recent discussions with the Wikidata team in Berlin have led to some updates to the data model. These should be fully reflected in the current implementation, both as interfaces and as default implementation classes.

Exceptions in globe coordinates get propagated up too far

Instead of just ignoring an errorous snak, an exception in the globe coordinates parsing will grind the whole processing chain to a halt.

Dump:

Exception in thread "main" java.lang.IllegalArgumentException: Latitude must be
between 90 degrees and -90 degrees.
        at
org.wikidata.wdtk.datamodel.implementation.GlobeCoordinatesValueImpl.<init>(GlobeCoordinatesValueImpl.java:56)
        at
org.wikidata.wdtk.datamodel.implementation.DataObjectFactoryImpl.getGlobeCoordinatesValue(DataObjectFactoryImpl.java:87)
        at
org.wikidata.wdtk.dumpfiles.JsonConverter.getGlobeCoordinatesValue(JsonConverter.java:831)
        at org.wikidata.wdtk.dumpfiles.JsonConverter.getValueSnak(JsonConverter.java:652)
        at org.wikidata.wdtk.dumpfiles.JsonConverter.getSnak(JsonConverter.java:549)
        at org.wikidata.wdtk.dumpfiles.JsonConverter.getClaim(JsonConverter.java:478)
        at org.wikidata.wdtk.dumpfiles.JsonConverter.getStatement(JsonConverter.java:441)
        at
org.wikidata.wdtk.dumpfiles.JsonConverter.getStatementGroups(JsonConverter.java:403)
        at
org.wikidata.wdtk.dumpfiles.JsonConverter.convertToItemDocument(JsonConverter.java:176)
        at
org.wikidata.wdtk.dumpfiles.WikibaseRevisionProcessor.processItemRevision(WikibaseRevisionProcessor.java:76)
        at
org.wikidata.wdtk.dumpfiles.WikibaseRevisionProcessor.processRevision(WikibaseRevisionProcessor.java:66)
        at
org.wikidata.wdtk.dumpfiles.MwRevisionProcessorBroker.notifyMwRevisionProcessors(MwRevisionProcessorBroker.java:175)
        at
org.wikidata.wdtk.dumpfiles.MwRevisionProcessorBroker.processRevision(MwRevisionProcessorBroker.java:139)
        at
org.wikidata.wdtk.dumpfiles.MwDumpFileProcessorImpl.processXmlRevision(MwDumpFileProcessorImpl.java:423)
        at
org.wikidata.wdtk.dumpfiles.MwDumpFileProcessorImpl.processXmlPage(MwDumpFileProcessorImpl.java:341)
        at
org.wikidata.wdtk.dumpfiles.MwDumpFileProcessorImpl.tryProcessXmlPage(MwDumpFileProcessorImpl.java:278)
        at
org.wikidata.wdtk.dumpfiles.MwDumpFileProcessorImpl.processXmlMediawiki(MwDumpFileProcessorImpl.java:198)
        at
org.wikidata.wdtk.dumpfiles.MwDumpFileProcessorImpl.processDumpFileContents(MwDumpFileProcessorImpl.java:151)
        at
org.wikidata.wdtk.dumpfiles.WmfDumpFileManager.processAllRecentDumps(WmfDumpFileManager.java:124)
        at
de.tudresden.inf.lat.wdtkuser.DumpProcessingExample.main(DumpProcessingExample.java:72)

Should branch externalJsonReader be working with DumpProcessingExample?

I know that the dump format change broke things adn that supposedly the new branch externalJsonReader fixes them, according to https://www.mediawiki.org/wiki/Wikidata_Toolkit#How_to_use_Wikidata_Toolkit .
But even after checking out that branch and trying to run the dumpProcessingExample I get the errors like

2014-09-15 11:20:28 ERROR - Failed to process JSON for item Revision 157416716 of page Q18 (ns 0, id 115). Created at 2014-09-14T10:32:04Z by Robin0van0der0vliet (742954) with comment "/* wbsetdescription-set:1|nl */ werelddeel". Model wikibase-item (application/json). Text length: 45630 (org.json.JSONException: JSONObject["claims"] is not a JSONArray.)

But I was wondering if this is a separate issue?

Complete dump file processing pipeline and give an example

The components for processing dump files (dump file manager, XML revision parser, JSON parser) need to be connected to a processing pipeline that outputs the final EntityDocuments. This still requires some small interface additions and minimal adjustments.

An example should be given on how to utilize this pipeline to access Wikidata dump content in a streaming fashion.

Maven signing fails for javadocs

The signing of jars with Maven -Psign fails for javadoc jars with the current configuration (BAD signature). The apparent reason for this is that the javadoc jar is modified after the signature is created, as can be seen from the time stamps of the files in the target directory. Maven needs to be reconfigured to sign the javadocs only after they are done.

The invalid signatures are the reason why the upload at sonatype does not work yet.

Missing linkage between generated properties

We are trying to use RDF data from WikiDataToolKit, and we noticed the generated properties like those shown below:

original property generated properties
ns : P01230 -> ns: P01230c
ns : P01230 -> ns: P01230s
ns : P01230 -> ns: P01230v
ns : P01230 -> ns: P01230r
ns : P01230 -> ns: P01230q

We have found out that there were no linkages between original properties and generated properties. Actually we would like to use some information such as labels, descriptions from original properties, so we think there could be an addition
to org.wikidata.wdtk.rdf.OwlDeclarationBuffer.writePropertyDeclarations(), which makes some links such as:

http://www.wikidata.org/entity/P1230 ns:simpleClaimProperty http://www.wikidata.org/entity/P1230c

Thank you for making the toolkit.

RDF serialization

A component for serializing data objects as RDF (in several serialization formats) should be provided. The component should receive a stream of objects to serialize and write the result to a file handler provided. The serialization should use some well-tested Java library for RDF.

Component for Wikibase API access

There should be a component for accessing the functions of the Wikibase Web API. Initially, this component should only support some of the basic reading operations; other operations will be added later. It should be configured by providing the base URL of the API endpoint of the Wikibase instance to access. A JSON parser and URL manipulation libraries should be used to cleanly interact with the API.

java.lang.VerifyError thrown when processing dumpfile

A java.lang.VerifyError is thrown when processing a dumpfile.

Running on:
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Details:
Exception in thread "main" java.lang.VerifyError: Stack map does not match the one at exception handler 55
Exception Details:
Location:
org/wikidata/wdtk/dumpfiles/JsonConverter.getStatementGroups(Lorg/json/JSONArray;Lorg/wikidata/wdtk/datamodel/interfaces/EntityIdValue;)Ljava/util/List; @55: astore
Reason:
Type 'org/json/JSONException' (current frame, stack[0]) is not assignable to 'java/lang/RuntimeException' (stack map, stack[0])

...

at org.wikidata.wdtk.dumpfiles.WikibaseRevisionProcessor.startRevisionProcessing(WikibaseRevisionProcessor.java:60)
at org.wikidata.wdtk.dumpfiles.MwRevisionProcessorBroker.startRevisionProcessing(MwRevisionProcessorBroker.java:122)
at org.wikidata.wdtk.dumpfiles.MwDumpFileProcessorImpl.processXmlMediawiki(MwDumpFileProcessorImpl.java:194)
at org.wikidata.wdtk.dumpfiles.MwDumpFileProcessorImpl.processDumpFileContents(MwDumpFileProcessorImpl.java:151)
at org.wikidata.wdtk.dumpfiles.WmfDumpFileManager.processAllRecentDumps(WmfDumpFileManager.java:124)
at de.tudresden.inf.lat.wdtkuser.DumpProcessingExample.main(DumpProcessingExample.java:72)

JsonConverter - JSONException

Bug in JsonConverter. It expects claims to be a JSONArray instead of JSONObject. (line 179)

if (jsonObject.has(KEY_CLAIM)) {
    JSONArray jsonStatements = jsonObject.getJSONArray(KEY_CLAIM);
    statements = this.getStatementGroups(jsonStatements, itemId);
}

JSON serialization

A component for serializing data objects in the external JSON format of the Wikibase API should be provided. If possible, a JSON library should be used for this purpose.

JSON serialization fails for property documents

The JSON serialization does not work for property documents. When processing properties in the dump, the following kind of error appears:

2014-04-29 16:12:04 ERROR - Failed to process JSON for property Revision 123411026 of page Property:P1251 (ns 120, id 18134330). Created at 2014-04-28T11:46:17Z by Palapa (45891) with comment "/* wbsetlabel-add:1|bs */ ABS ASCL kod". Model wikibase-property (application/json). Text length: 340 (org.json.JSONException: Wikibase property ids must have the form "P<positive integer>")

This was unnoticed since the example program did not actually process property documents at all (it registered the serializer only for items), which was fixed in #66.

Process internal JSON to obtain Wikibase data objects

A component for transforming the internal JSON format as found in Wikibase dump files into Wikidata Toolkit data objects should be provided. The component should use a JSON parser to access the data structure and own code to turn this data into Wikibase objects. The main output of the component should be EntityDocuments. The initial implementation may not support all substructures yet, especially not all kinds of data values (which require additional parsing).

Downloading main dumps fails

Hi all,

I'm currently trying out the RDF serialization on the rdf-serializer branch (RdfSerializationExample). I'm running into a problem when the toolkit tries to retrieve a main dump:

2014-05-07 16:26:46 INFO  - Downloading current dump file wikidatawiki-20140420-pages-meta-current.xml.bz2 from http://dumps.wikimedia.org/wikidatawiki/20140420/wikidatawiki-20140420-pages-meta-current.xml.bz2 ...
2014-05-07 16:26:47 ERROR - Dump file wikidatawiki-current-20140420 could not be processed: java.io.IOException: Failed to retrieve maximal revision id. Aborting dump retrieval.

I did some quick debugging and traced the error to the WmfOnlineStandardDumpFile.fetchMaximalRevisionId() method which expects a particular format for the maximal revision id. However, the revision id seems to be missing in the most recent dump:

http://dumps.wikimedia.org/wikidatawiki/20140420/

Compare this to the previous dump:

http://dumps.wikimedia.org/wikidatawiki/20140331/

Cheers,
Günter

Download and manage Wikidata dumpfiles

There should be a component to download and manage dump files in the format provided for Wikidata.org. It should access dumps from a specified location, find out which dumps are available, and fetch dumps as needed. Relevant types of dumps (current revisions, full, daily) should be distinguished and treated suitably. The component should provide access to any of these files transparently (without requiring accessing components to know about their location or compression format).

Don't download dump files that are not done yet

E.g. using wdtk-example, it is right now already downloading the April 20 dump, but the dump is not fully generated yet. Can we first check whether a dump is complete and available before starting to download it?

Support parallel processing in dump file parsing

All stages of the dump file processing should make use of parallel processing to improve performance. To enable this, these objects need to get access to a shared Executor object that manages the threads. The following tasks should be executed in independent threads:

downloading required dump files (in parallel to processing the first ones)
unzipping the dump files
parsing the XML contents
parsing the JSON for a revision

For some of these, separate issues should be created later.

Properties now have statements

See https://www.wikidata.org/wiki/Wikidata:Project_chat#statements_on_properties_are_live_.28.2Bdeletion_propagation.29

Adjust data model to references with somevalue-snaks

The references in the dumpfiles might contain somevalue-snaks additional to value-snaks. The data model should reflect this possibility. The current state is that somevalue-snaks in referenced will be ignored.

JsonConverter - JSONException

I guess there is a bug in JsonConverter class. At line 179 it expects "claims" to be a JSONArray, in fact it is a JSONObject.

if (jsonObject.has(KEY_CLAIM)) {
    JSONArray jsonStatements = jsonObject.getJSONArray(KEY_CLAIM);
    statements = this.getStatementGroups(jsonStatements, itemId);
}

Parsing Wikidata entities missing a "type" tag.

Greetings, I'm trying to replace my custom Wikidata parsing code with WikidataToolkit. I'm using the following bit of code to parse the JSON associated with entities:

if (MwRevision.MODEL_WIKIBASE_ITEM.equals(mwRevision.getModel())) {
    mwDoc = mapper.readValue(mwRevision.getText(), JacksonItemDocument.class);
} else if (MwRevision.MODEL_WIKIBASE_PROPERTY.equals(mwRevision.getModel())) {
    mwDoc = mapper.readValue(mwRevision.getText(), JacksonPropertyDocument.class);
}

The problem I'm encountering is that some entities in the XML dump are missing a "type" attribute. For example, the JSON for Q25 (Jack Bauer) in http://dumps.wikimedia.org/wikidatawiki/20141009/wikidatawiki-20141009-pages-articles.xml.bz2 starts with the following and never has a type attribute.

{"label":{"en":"Jack Bauer",

Is this a known problem? If so, is there a workaround? Since I'm jumping into WikidataToolkit at a relatively low level, I wonder if some of the higher level code I'm not using is working around this in some way.

Thanks for the excellent tool!

Create project homepage and update progress and team pages

Set up the basic web pages to document the project:

Create a project page at mediawiki.org and link it from everywhere
Set up ohloh page
Add all team members to people page on IEG pages at meta.wikimedia.org

Prepare, create and announce release 0.1.0

This involves:

decide which packages to release
packaging (make sure maven is configured correctly to create packages)
upload release at maven central
create tag on github
update version numbers several times
document the release on the web page
send email notifications

Basic dictionary implementation

A basic implementation for a dictionary of String-identified objects should be provided. This component needs to assign integer ids to incoming String ids, and it must manage a bijective mapping that allows to access either type of id when providing the other. Space efficiency is a primary concern, but a workable speed must also be given.

Use correct base IRI when loading from dumps

When loading from dumps, we currently use the "baseUrl" from the MediaWiki dump file as our base IRI for entities. This is not correct, since the baseUrl is the full link to the main page of the wiki.

We need to find out how to get the correct base IRI and use this instead. Meanwhile, one can only rely on the local id of entities (such as "Q42") but not on their full IRI.

Create a conversion from list of statements to list statement groups

There is a need for a conversion functionality that takes an arbitrary ordered list of statements and creates a list of statement groups, whereas the statements in each statement group are ordered by statement rank.

Since this would not only be needed by the JSON converter, it might be a utility method in the data model (e.g. in the DataObjectFactory-class).

@mkroetzsch If you dont want to do this yourself, you may assign this issue to me.

Provide a space-efficient bitvector implementation

A space-efficient implementation of a BitVector should be provided. It only needs to support the basic read and write operations. This will be needed initially to keep track of flags during dump parsing (Issue #12).

Find some way to quickly respond in maven to breaking changes in export format

I suspect that many of Wikidata Toolkit's customers are (like me) maven users. When a breaking change happens in Wikidata's export format (which seems to occur regularly), it would be excellent if there is some way for me to quickly access a new maven release that adapts to the change.

Two ideas for this are:

Maintain a "stable" release branch that includes quickly released compatibility fixes from the dev branch.
Publish maven snapshot releases.

Right now, there's no good way for me to make my downstream library work during release gaps in Wikidata Toolkit without releasing a custom version of Wikidata-Toolkit to maven (yuck!)

Improve logging output of the JsonConverter

The current output allows not to determine what went wrong and how the broken JSON looks like, which might be helpful for debugging or manually fixing the data. For improvement, the logging output should be more verbose and helpful.

Make datamodel objects serializable

Would it mind if all of the datamodel classes implemented the Serializable interface so that they can be serialized in several ways? This would make lots of things like storing single objects much easier.