In what format will we store the data (text) files of the project? Including metadata.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I found this nice <a href="https://clinjournal.org/clinj/article/view/26/22" rel="nofo

Determine storage format for text files about notebooks HOT 6 CLOSED

maltelueken commented on August 23, 2024

Determine storage format for text files

from notebooks.

Comments (6)

maltelueken commented on August 23, 2024

The overleaf document covers XML, YAML, JSON, and tabular formats (e.g., TSV, CSV) so far. Is this sufficient or should I look for more formats?

from notebooks.

maltelueken commented on August 23, 2024

I also found two standards (XML) that might be useful for this project:

TEI: This could apply to data storage in XML
REFI: This applies more to data that has been processed by qualitative analysis software

from notebooks.

kevinpijpers commented on August 23, 2024

Thanks Malte, I think this sums up the basic formats we should consider.

TEI looks interesting, very mature, maintained, and with incredibly detailed documentation. I am interested in learning more about TEI.

REFI also looks interesting, but (as you say) it might only be useful if we really decide to continue with a QDAS application. Furthermore, it is already possible to export a project in Atlas.ti in a specific QDA-XML format (.qdpx) for use between applications, and I'm curious how this relates to REFI. So I'd stable that for now.

Another one of interest might be Resource Description Framework (RDF/XML). An example of this is the Lemon model, which also employs LMF ("LMF is the ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)"). Lemon splits the 'ontological entity' (the thing) and the 'lexical sense' of that thing, and builds on that, which has some advantages. However, Lemon is a model built on existing standards (and a Java API), and in that sense we might need to open another task if we want to investigate this further.

from notebooks.

kevinpijpers commented on August 23, 2024

@maltelueken Could you also look at Folia, developed by the Radboud University? This is a standard for encoding NLP annotations of text? This is a kind of alternative for TEI.

from notebooks.

maltelueken commented on August 23, 2024

Yes I will look at it!

from notebooks.

maltelueken commented on August 23, 2024

I found this nice paper which compares FoLiA to other linguistic annotation formats. They are mostly XML-based. FoLiA seems to be more of a framework for how to store annotated text (or other) data in XML files, whereas TEI for instance is more specific about the different annotation tags and metadata entries. Taken together, these formats are for annotated data and not necessarily for raw text data as described in the document. Given that they are all based on XML, we should probably choose XML as our raw data storage format for consistency.

For the storage of annotated text (as the output of our NLP pipelines), FoLiA seems to be a very promising candidate (see paper, section 2). The main points are that it is flexible, somewhat human readable, and explitcit (can be validated). Moreover, there are many Dutch tools that interoperate with FoLiA, like BlackLab, Brat, and Frog, and there is even a Python package for the format. It also has an extension for SpaCy.

The problem with most of these formats is that they are very complex and it might take users a considerable amout of time to learn them.

from notebooks.

Determine storage format for text files about notebooks HOT 6 CLOSED

Comments (6)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent