Giter VIP home page Giter VIP logo

Comments (15)

JessedeDoes avatar JessedeDoes commented on June 2, 2024

Unfortunately, things may get even messier. Metadata on structural elements is not always a possible solution: metadata may cross the boundaries of structural elements.

This is why, in TEI, we assign metadata to arbitrary text ranges.

Example:

<p>
We were attacked by a giant <milestone xml:id="m0"/>dog
</p>
<p>with enormous <milestone xml:id="m1"/> teeth</p>

Here, the text from milestone m0 to milestone m1 may for instance be supplied by another author.

Katrien may be able to come up with a more realistic example.

from folia.

kosloot avatar kosloot commented on June 2, 2024

I would suggest a third approach:
Allow references in structure elements that refer back to meta-data both internal or external to the document.
Document internal meta-data is stored once in the meta-data section (which might need some extensions). External data can be anywhere, using an URL.
FoLiA itself shouldn't have knowledge about what is is the meta-data, just provide you the link or the XML fragment.
Several links can of course refer to the same meta-data fragment.

Maybe it is better to have external links indirectly: store those in the meta-data section too, and use internal links to refer to these links. This keeps all meta-data in one place.

One obvious drawback: finding all structure nodes that 'belong' to the same meta-data.
But that is easily solved by creating in index on the fly when parsing the document.
A simple API extension could then deliver all nodes connected to a certain meta-data part.

from folia.

JanOdijk avatar JanOdijk commented on June 2, 2024

Maarten/Ko, Can we also discuss what the CHILDES CHAT format allows in this respect and check that FoLiA covers these (or explicitly decide that it does not cover all of these or even none of them). There are some (I think) simple things such as speaker changing with every utterance (hence also the associated speaker characteristics) but there are also annotations on separate tiers and inside the transcription tier (speaking errors, phonetic representation of the pronunciation, hesitations, false starts and retracing, etc etc). I consider these mostly annotations, not metadata, but this distinction is not sharp and certainly not generally accepted.

For the short term I would like to know how to deal with the speaker (and speaker characteristics) changing with every new utterance in FoLiA.

Jan

from folia.

proycon avatar proycon commented on June 2, 2024

@JanOdijk: Speaker information I indeed also consider annotation and not metadata, those are currently already catered for in FoLiA (there is a generic speaker attribute which can be used on structural elements in a speech context) so should not pose a problem, although it's not extensively tested in practise yet. Hesitations, false starts, and retracing might need a new FoLiA element; there already is a distortion element in a speech context but that might not be appropriate for those. We should discuss it in a new issue when the need arises, I don't consider it metadata as is in this issue.

from folia.

hennie avatar hennie commented on June 2, 2024

I like Ko's suggestion. One of the (Nederlab) cases that gave rise to this issue is the need for author identification at level of text segments. Nederlab authors are complex entities with their own metadata and associations with titles, external to the FoLiA texts. And they have a unique identifier that can be used to refer to them.

from folia.

kdepuydt avatar kdepuydt commented on June 2, 2024

Uitgangspunt voor een diachroon corpus is dat ieder woord in de tekst de correcte metadata krijgt: correcte auteur, correcte datering... Wij proberen dit voor de teksten die we voor Nederlab aanleveren zo netjes mogelijk te doen. Alleen daarvoor ontbreken nog voorzieningen, zowel in Folia als in de database.

Maarten heeft mijn vraag goed verwoord, en uiteraard wil ik graag een mechanisme waarin ik op 1 plek metadata bijhoudt en verwijs naar gedeeltes in de tekst waar die "afwijkende" metadata op van toepassing zijn. Alleen kunnen die "afwijkingen" zich op verschillende wijze manifesteren.

Makkelijkste variant
Pieter van Dam schrijft een geschiedenis en in de appendix bij de hoofdstukken bijlagen geeft hij met documenten (teksten) die niet van zijn hand zijn. Deze bijlagen hebben hun eigen metadata. Dit is een simpel geval, want er zijn corresponderende structuurelementen.

Variant 2
In het kleine deelproject (corpus 15e en 16e eeuw) wordt het al complexer.
Een hofboek met aantekeningen kan zomaar voor een volgende zin een andere datering geven, die dan weer een tijdje geldt totdat er weer een nieuwe datum komt. Je wil hier geen aparte teksten van maken, maar ergens de tekst zo metadateren dat je kan zeggen: vanaf hier tot en met daar gelden die metadata. Hier is het een kwestie van datum, maar voor andere documenten heb je ook een indicatie van de hand (auteurswissel dus).

pag. 108- Ahoff Aº XX (=1520)
Wessel then Horne betalt peper xvi pond und lersen
Aº XXI op dach Divisionis Aplorum
It Herbert to Holthuijs maegt to Hijginck oir hoffrecht nijet verwaert gewijset Bernt Mijrt up genaden des herrn
It Egbert ten Kreijll wonende up dessen guijt sal in XIIII daeghen komen und betalen sijn gerechticheijt beij de
Joncher und Drosten und sich dan ingeliveren lathen nae haves rechte hort in den hoff to Mijste. lijse sijn
huijsfrouwe.
It Nale ten culve gehijlickt bijnnen Bocholt staet tot bewijsnisse und vragen off sije in echte staat sijnt.
Hoffgerichte Tegeder Tegeders hoff
It Gebele wijll in XIIII scheijden van den Drosten, sijn wijff peper ende wass. Solvit peper und wass
It Aº XXII Nale ten Wijnckell gehillickt ongescheiden.
It Aº XXIII. herbert to Holthusen betalt 1 £ pepers
Henricus Vaget, Henricus, Willem Portener, Kerstgen Wijbbolt, Bernt van Mijste, Dexx ten hurne, Egbert Elkijnck,
Tebbe Smorckens, Gerbelt Smijt, Johann Gelijnck, Hermen Weert, Herman Stoteler, Bernt Bolijnck, Dirck
Meerden, Schulte ten Ahave, Roert, Schulte Buckel, Essel Snaben Smedijnck
It Roerdinck den schadeloiss brieff 't maeken
It dije Gijldemesters uthn Wolde hebben benompt
It Kaeten benompt
It Raetman benompt
It Huppel Henxsell und Raetman

Variant 3; additioneel probleem.
In een teksteditie zijn woorden of zinnen toevoegingen van een editeur. In de TEI heb je daar codering voor (resp=editor, hangend aan een structuurelement; of add resp=editor, del resp=editor...). Denk ook maar aan de voetnotensectie bij teksten, die meestal van de editeur is.

In de conversie naar folia verdwijnt nu deze informatie, ook in de reeds aangeleverde bestanden. Strikt genomen is dat niet goed.

Toen wij op het INT een aantal jaren geleden aan de slag gingen met een selectie van de DBNL, en wij gebruiken TEI, hebben we voor 1 mechanisme gekozen voor variant 1 en 2, namelijk door het neerzetten van milestones (tags die overal in de structuur kunnen staan waarmee we begin en eind van een stuk kunnen aangeven); in de header met metadata zag dat er dan zo uit:

Het voorbeeld komt uit Bredero Liedboek, dat voorin gedichten heeft van andere auteurs:
Voorin staan de metadata die op het geheel slaan, en daarna de metadata specifiek voor stukken tekst

</p><listBibl id="inlMetadata"><bibl id="dbnl-bred001groo01_01"><interpGrp type="title.level1"><interp value="Groot lied-boeck" type="main"/></interpGrp><interpGrp type="author.level1"><interp value="G.A. Bredero"/></interpGrp><interpGrp type="editor"><interp value="editie G. Stuiveling e.a."/></interpGrp><interpGrp type="date.publication"><interp value="1975"/><interp value="1983"/><interp value="1979"/></interpGrp><interpGrp type="dbnl-datumcontrole"><interp value="G.A. Bredero, Groot lied-boeck, 3 delen, editie G. Stuiveling, A. Keersmaekers, C.F.P. Stutterheim, F. Veenstra en C.A. Zaalberg (deel I); G. Stuiveling, A. Keersmaekers, C.F.P. Stutterheim, F. Veenstra, C.A. Zaalberg en P.J.J. van Thiel (deel II) en F.H. Matter (deel III). Tjeenk Willink-Noorduijn, Culemborg 1975 (deel I) / Martinus Nijhoff, Leiden 1983 (deel II) / Tjeenk Willink-Noorduijn, Den Haag 1979 (deel III)  "/></interpGrp><interpGrp type="idno"><interp value="dbnl-bred001groo01_01"/></interpGrp></bibl></listBibl><listBibl id="dbnl-specific-metadata" default="NO"><bibl id="interp_bred001groo01_1" default="NO">
<interpGrp type="textYear_from"><interp value="1616"/></interpGrp>
<interpGrp type="textYear_to"><interp value="1616"/></interpGrp>
<interpGrp type="witnessYear_from"><interp value="1622"/></interpGrp>
<interpGrp type="witnessYear_to"><interp value="1622"/></interpGrp>
<interpGrp type="authors"><interp value="G.A. Bredero"/></interpGrp>
<biblScope>
<xref from="milestone_bred001groo01_bo_1" to="milestone_bred001groo01_eo_1" targOrder="U"/>
</biblScope>
</bibl><bibl id="interp_bred001groo01_2" default="NO">
<interpGrp type="textYear_from"><interp value="1616"/></interpGrp>
<interpGrp type="textYear_to"><interp value="1616"/></interpGrp>
<interpGrp type="witnessYear_from"><interp value="1622"/></interpGrp>
<interpGrp type="witnessYear_to"><interp value="1622"/></interpGrp>
<interpGrp type="authors"><interp value="C. Aerssens"/></interpGrp>
<biblScope>
<xref from="milestone_bred001groo01_bo_2" to="milestone_bred001groo01_eo_2" targOrder="U"/>
</biblScope>

We wilden niet met een mix van id's bij structuurelementen in het ene geval werken, en met milestones in het geval de metadata niet samenvielen met een structuurelement.

De discussie metadata / annotatie begrijp ik wel: in principe is elke informatie die iets zegt over een woord in de tekst een annotatie. Alleen inhoudelijk valt m.i. wel degelijk een verschil te maken tussen de types informatie die als verrijking wordt meegegeven, en dan zou ik metadata toch scheiden van andere types annotatie.

from folia.

proycon avatar proycon commented on June 2, 2024

Proposal for submetadata

Thanks for all the feedback. Here is a proposal mostly in line with Ko's suggestion, and hopefully accommodating everybody's needs:

  • The document contains one <metadata> block in the header (no change)
  • That <metadata> block may contain further <submetadata> blocks that define metadata for arbitrary parts of the document.
    • Each <submetadata> block carries an xml:id attribute uniquely identifying it.
    • The <submetadata> element mostly carries the same attributes and behaves the same as the <metadata> element, that is:
      • It takes a type attribute defining the type of metadata (e.g metadata scheme)
      • It may refer to an external resource using the src attribute.
      • It may hold one or more foreign-data elements, allowing it to be used with any metadata schema (e.g. dublin core, CMDI). If native FoLiA metadata is used (type="native"), then it takes meta elements instead.
      • Unlike <metadata>, <submetadata> does not allow for an <annotations> block with declarations.
  • References are made from within the document; any element may take a metadata attribute that refers to the ID of a submetadata block, e.g.: <s metadata="some.metadata"><t>This is a sentence</t></s>. Metadata is inherited, so it automatically applies to all elements within its scope, unless embedded elements refer to new metadata, which then counts as a replacement. Multiple metadata blocks may be referenced at the same time by space delimited IDs in the metadata attribute, though it is up to the user to ensure this does not lead to conflicts (clashing metadata fields). An empty value for the metadata attribute is also allowed to explicitly cancel any inheritance of a higher element.
  • The metadata attribute is not restricted to structural elements but is allowed on all FoLiA structure and annotation elements, allowing for maximum flexibility.
  • The root-level <metadata> (i.e. not submetadata) always applies to the entire document.

So this implies that all metadata is together in the document header, if there are references to external metadata sources, then these are also explicit in the header. The references, however, flow from the document to the header rather than vice versa. The is in line with the FoLiA principle to keep things as local as possible, allowing people to readily identify if a particular section they are looking at has particular submetadata associated with it. It also facilitates the job of simple parsers who can quickly obtain all elements a submetadata block applies to with an Xpath expression.

In anticipation of certain questions: the milestone approach is interesting but would have some problems for FoLiA. Milestones would occur either INSIDE the text (inside <t>) or between structural elements. The latter renders the need for milestones obsolete this would imply there are structural elements which cover the content anyway. The former is problematic because there can be multiple text layers (think of e.g. historical layer vs. modernized), no text layers at all (think of speech), or redundancy in text layers (expressed at multiple levels). Moreover, this current proposal allows (sub)metadata to be associated with anything, not just text, hopefully preventing any future situation where we find that we can't sufficiently express metadata.

An example excerpt (details omitted) of how this would look:

<FoLiA>
<metadata>
 <annotations>...</annotations>
 <submetadata xml:id="metadata.1" type="native">
   <meta id="author">proycon</meta>
   <meta id="language">nld</meta>
 </submetadata>
 <submetadata xml:id="metadata.2" type="native">
   <meta id="author">Shakespeare</meta>
   <meta id="language">eng</meta>
 </submetadata>
</metadata>
<text>
 <p metadata="metadata.1">
   <t>Het volgende vers komt uit Hamlet:</t>
 </p>
 <p metadata="metadata.2">
  <s><t>To be, or not to be, that is the question:</t></s>
  <s><t>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</s></t>
 </p>
</text>
</FoLiA>

Since metadata can be associated with anything, any arbitrary sub-parts of untokenised text can be selected and associated with the existing facilities <str> or <t-str>. Some redundancy takes place only when structural boundaries are crossed (the metadata element might have to be repeated on multiple structural elements if there is no catch-all structure).

What do you think of this proposal? Does this cover all use-cases?

from folia.

kdepuydt avatar kdepuydt commented on June 2, 2024

Dear Maarten,
I think we are almost there. Could you please explain how we should add metadata in variant 2 of my previous comment? There, there is no structural element I could attach the reference to.

from folia.

proycon avatar proycon commented on June 2, 2024

I don't know what kind of structural elements you have in that particular example, but all cases can be made to work. In case you have something sentence structure but no overarching paragraph, division or whatever that would be the most appropriate level to associate the metadata; you can simply refer to the metadata from each sentence. So building on my previous example, instead of:

 <p metadata="metadata.2">
  <s><t>To be, or not to be, that is the question:</t></s>
  <s><t>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</s></t>
 </p>

You can also do, for instance if there's no <p> or other structure to attach it to:

  <s metadata="metadata.2"><t>To be, or not to be, that is the question:</t></s>
  <s metadata="metadata.2"><t>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</s></t>

If you don't have sentences but words/tokens, then you can do it at that level. But it's most efficient to group things in bigger yet sensible structural units of course, whatever they may be.

If the whole text is part of a big untokenised chunk of text for which any further structure has not yet been determined, then you can use the <str> or <t-str> elements to any mark arbitrary parts of it (see section 2.10.13 of the FoLiA documentation). But the use of proper structure elements is always preferred if possible and a requirement for deeper linguistic annotation! An example of this scenario:

<text>
   <t><t-str metadata="metadata.1">Het volgende vers komt uit Hamlet</t-str><br/>
   <t-str metadata="metadata.2">To be, or not to be, that is the question:<br/>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</t-str></t>
</text>

Does this answer your question?

from folia.

hennie avatar hennie commented on June 2, 2024

"If you don't have sentences but words/tokens, then you can do it at that level. But it's most efficient to group things in bigger yet sensible structural units of course, whatever they may be."

If I understand Katrien correctly she wants to associate metadata with segments that go across boundaries of the XML structure, and that are potentially very large. Milestones allow you to identify such segments. To annotate a long sequence of tokens using your proposal would require one to replicate the metadata attribute for each token of the sequence. Is it feasible and compliant with FoLiA design principles to identify a sequence of tokens using the id's of begin and end token, and and in that way associate a metadata attribute with the sequence only once?

from folia.

proycon avatar proycon commented on June 2, 2024

I see the issue yes, but I think it can be remedied in other more FoLiA-like ways, although a small amount of duplication may indeed occur in certain cases where structural boundaries are crossed.

Taking your example of a large amount of word tokens, of which a large subset gets different metadata: The range can be marked by simply introducing a new structural element, if there is no proper semantic choice such as paragraph, sentence, division (e.g. chapter/section/subsection), event.... then one can always fall back to the <part> structural element, which is a kind of a catch-all solution. Assume we start with text that has mere tokens, possibly some linebreaks, but no further structure at all:

<text>
  <w><t>Het</t></w>
  <w><t>volgende</t></w>
  <w><t>vers</t></w>
  <w><t>komt</t></w>
  <w><t>uit</t></w>
  <w><t>Hamlet:</t></w>
  <br />
  <w><t>To<t/></w>
  <w><t>be<t/></w>
  <w><t>or<t/></w>
  <w><t>not<t/></w>
  <w><t>to<t/></w>
  <w><t>be<t/></w>
</text>

We can then simply introduce a <part> structure element (but preferably a semantically more informed choice if possible!) to group structure and assign it metadata once:

<text>
  <part metadata="metadata.1">
   <w><t>Het</t></w>
   <w><t>volgende</t></w>
   <w><t>vers</t></w>
   <w><t>komt</t></w>
   <w><t>uit</t></w>
   <w><t>Hamlet:</t></w>
  </part>
  <br />
  <part metadata="metadata.2">
   <w><t>To<t/></w>
   <w><t>be<t/></w>
   <w><t>or<t/></w>
   <w><t>not<t/></w>
   <w><t>to<t/></w>
   <w><t>be<t/></w>
 </part>
</text>

FoLiA does not currently use any referencing system that refers to a begin and end, but opts for explicit references of the entire range (also in e.g. span annotation; consider we also support discontinuous spans). FoLiA tries to make use of the hierarchy of XML wherever possible. Hence my reluctance to opt for ranges where the burden is shifted to the client to resolve reference and then iterate over it (which is not even as trivial as it might seem at first), complicating retrieval.

from folia.

kosloot avatar kosloot commented on June 2, 2024

from folia.

kdepuydt avatar kdepuydt commented on June 2, 2024

Hi Maarten,
I think I understand the reason for the solutions you suggest.
From a data producing perspective, this is quite terrible,. You have to know that for the processing of the texts for Nederlab, manual correction of the XML-encoding is done.
The reason why we chose milestones in TEI was to avoid having to introduce a structural element like seg, that had to be repeated within different structural elements to indicate a specific section with metadata that does not properly nest with the rest of the XML.
Now are we using TEI for the processing of the texts and the Folia format is reached by automatic conversion. So this would only bother Jesse who does the TEI to Folia conversion.
But having this solution for people who would like to start with Folia straight away is not ideal :-(.

There are several layers of information one wants to give to a text, and which is indicated separately in Folia.
Why has there not been chosen for a solution in which in
layer 1: original text (complete), this is the core text, containing structural encoding and milestones (metadata) (or layer 1b with structural encoding, layer 1 c with metadatannotations)
layer 2: Ticcl
layer 3: linguistic annotation etc....

from folia.

proycon avatar proycon commented on June 2, 2024

(Proposal accepted after skype call)

from folia.

proycon avatar proycon commented on June 2, 2024

Implementation in Python library and FoLiA tools is ready and available in master branch (pending release).

from folia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.