Giter VIP home page Giter VIP logo

Comments (6)

maltelueken avatar maltelueken commented on August 23, 2024

The overleaf document covers XML, YAML, JSON, and tabular formats (e.g., TSV, CSV) so far. Is this sufficient or should I look for more formats?

from notebooks.

maltelueken avatar maltelueken commented on August 23, 2024

I also found two standards (XML) that might be useful for this project:

  • TEI: This could apply to data storage in XML
  • REFI: This applies more to data that has been processed by qualitative analysis software

from notebooks.

kevinpijpers avatar kevinpijpers commented on August 23, 2024

Thanks Malte, I think this sums up the basic formats we should consider.

TEI looks interesting, very mature, maintained, and with incredibly detailed documentation. I am interested in learning more about TEI.

REFI also looks interesting, but (as you say) it might only be useful if we really decide to continue with a QDAS application. Furthermore, it is already possible to export a project in Atlas.ti in a specific QDA-XML format (.qdpx) for use between applications, and I'm curious how this relates to REFI. So I'd stable that for now.

Another one of interest might be Resource Description Framework (RDF/XML). An example of this is the Lemon model, which also employs LMF ("LMF is the ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)"). Lemon splits the 'ontological entity' (the thing) and the 'lexical sense' of that thing, and builds on that, which has some advantages. However, Lemon is a model built on existing standards (and a Java API), and in that sense we might need to open another task if we want to investigate this further.

from notebooks.

kevinpijpers avatar kevinpijpers commented on August 23, 2024

@maltelueken Could you also look at Folia, developed by the Radboud University? This is a standard for encoding NLP annotations of text? This is a kind of alternative for TEI.

from notebooks.

maltelueken avatar maltelueken commented on August 23, 2024

Yes I will look at it!

from notebooks.

maltelueken avatar maltelueken commented on August 23, 2024

I found this nice paper which compares FoLiA to other linguistic annotation formats. They are mostly XML-based. FoLiA seems to be more of a framework for how to store annotated text (or other) data in XML files, whereas TEI for instance is more specific about the different annotation tags and metadata entries. Taken together, these formats are for annotated data and not necessarily for raw text data as described in the document. Given that they are all based on XML, we should probably choose XML as our raw data storage format for consistency.

For the storage of annotated text (as the output of our NLP pipelines), FoLiA seems to be a very promising candidate (see paper, section 2). The main points are that it is flexible, somewhat human readable, and explitcit (can be validated). Moreover, there are many Dutch tools that interoperate with FoLiA, like BlackLab, Brat, and Frog, and there is even a Python package for the format. It also has an extension for SpaCy.

The problem with most of these formats is that they are very complex and it might take users a considerable amout of time to learn them.

from notebooks.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.