Giter VIP home page Giter VIP logo

Comments (8)

mkroetzsch avatar mkroetzsch commented on August 15, 2024

Thanks for the feedback. We are considering to refactor this code using another JSON library for parsing (which might also make processing faster). However, the code will not work for Special:EntityData since this returns another kind of JSON. The JsonConverter is for the internal JSON that is only found in dump files; this is different from the JSON you get through the API.

from wikidata-toolkit.

Benestar avatar Benestar commented on August 15, 2024

Hmm, this is weird and I think the dumps should also use the external JSON format. However, are you planning to support the external JSON format, too, or would it help you if I created a new component for that purpose?

from wikidata-toolkit.

mkroetzsch avatar mkroetzsch commented on August 15, 2024

We plan to have a new component for the external format, since we will need this for interpreting API result. The external format is much less messy and more uniform (and it is even documented somewhere, which is not the case for the internal JSON). The internal JSON format is what is actually stored in the database; even if it would change now, the old revisions would remain the same. This is why we have to support several versions of the format -- for eternity ...

It has been discussed if all the internal JSON should be rewritten (even in old revisions) to be the new format, but this has not happened yet.

We already have support for serializing to the external JSON format. In principle, one could start with the parsing code from scratch, without the burden of the other JSON parser that has to cover all those special cases. We are thinking about using fasterxml.jackson for the new implementation, which leads to much nicer code (see also the discussion at #47).

from wikidata-toolkit.

fer-rum avatar fer-rum commented on August 15, 2024

On 01.06.2014 20:05, Markus Krötzsch wrote:

We plan to have a new component for the external format, since we will need this for interpreting API result. The external format is much less messy and more uniform (and it is even documented somewhere, which is not the case for the internal JSON). The internal JSON format is what is actually stored in the database; even if it would change now, the old revisions would remain the same. This is why we have to support several versions of the format -- for eternity ...

It has been discussed if all the internal JSON should be rewritten (even in old revisions) to be the new format, but this has not happened yet.
If we could agree with the people responsible for dumpfile generation
that we could use the external JSON in the dumpfiles, I would strongly
favour it. Converting the old dumps into external JSON would be a
one-time effort, but it becomes bigger the longer one waits.

Having to support legacy dumpfile formats cretes a lot of legacy code
burden.

We already have support for serializing to the external JSON format. In principle, one could start with the parsing code from scratch, without the burden of the other JSON parser that has to cover all those special cases. We are thinking about using fasterxml.jackson for the new implementation, which leads to much nicer code (see also the discussion at #47).


Reply to this email directly or view it on GitHub:
#73 (comment)

from wikidata-toolkit.

mkroetzsch avatar mkroetzsch commented on August 15, 2024

I just got news that the Wikidata team is now starting to plan the gradual conversion of text blobs to the external JSON format. This means that we should give high priority to implementing the parsing of this format, or we will soon start to miss data. The parsing can be independent of the current code, since the two types of JSON will be distinguished (I guess by some content model); so we don't have to decide from the JSON which format we have.

@fer-rum It would be good if you could start working on this next, ideally using the ideas you got for jackson-based parsing. This can go into a package under data model, since it is a very central functionality that will be used for API result parsing as well as for future dump processing.

from wikidata-toolkit.

mkroetzsch avatar mkroetzsch commented on August 15, 2024

As @JeroenDeDauw points out, there is already a PHP reference implementation for parsing the datamodel from external JSON now: https://github.com/wmde/WikibaseDataModelSerialization/tree/master/src
The PHP object model may have some smaller differences to the Java implementation, but this might still be helpful to resolve questions about how to interpret the structure in each case.

from wikidata-toolkit.

fer-rum avatar fer-rum commented on August 15, 2024

Acknowledged, I will start implementing this ASAP.

from wikidata-toolkit.

mkroetzsch avatar mkroetzsch commented on August 15, 2024

This problem has now been fixed by the new JSON parsing code that was merged with #91. The JSON conversion is now split across dozens of (smaller) files. It's also much faster. Since the Wikidata dump export code regenerates the format even for old revisions, we should be able to retire the old code completely soon.

from wikidata-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.