It is often desireable to know exactly what tools (and what versions thereof and even

so you think of a kind oh 'history' field in the metada? <div class="snippet-clipb

This is now implemented as proposed, and documented here: <a href="https://folia.readt

Comments (10)

kosloot commented on June 2, 2024

so you think of a kind oh 'history' field in the metada?

<history>
  <created date="" tool="frog" version="1.3"/>
  <modified data="" tool="ucto" version="0.17"/>
  <modified data="" tool="FoLiA-correct" version="3.14"/>
</history>

Or is this to simple?
and than still:

We need a very clean and simple API to add items to the history, otherwise nobody will use it.
not all TICCL and FoLiA tools have there own version information yet. That might be needed?
do we need to add some checksum (md5 hash or so) to ensure a FoLiA document is untouched since the last history change?

lots of things to think about

from folia.

proycon commented on June 2, 2024

Yes, indeed, but it also has to be tied to the specific annotation types+sets so you know what has been modified by what tool, so it goes further indeed.

I was also thinking along the lines of checksums to ensure the document integrity, but full document checksums may be too expensive to compute and slow down things too much, so I'm not sure yet whether that's something FoLiA should to itself or should be left external.

We will at least need some kind of document version hashing too (probably hashes analogous to how git commit hashes work, hashing in the parent and all), so the provenance trail can be established..

A clean and simple API is indeed an important requirement.

I'll ponder about all this for a bit :)

from folia.

proycon commented on June 2, 2024

After some pondering, here is a rough first proposal, still very much open for debate of course:

(THIS IS SUPERSEDED BY THE SECOND PROPOSAL, SCROLL DOWN)

Example

<FoLiA document_version="<HASH>" version="<FOLIA_VERSION>" generator="libfolia...">
<metadata>
 <annotations>
    <pos-annotation set="https://....">
        <annotator name="mbpos" type="auto" processor="p1">
        <annotator name="proycon" type="manual" processor="p2">
    </pos-annotation>
    <lemma-annotation set="https://....">
        <annotator name="mblem" type="auto" processor="p1">
    </lemma-annotation>
 </annotations>
 <provenance>
    <processor xml:id="p1" name="frog" version="0.12" document_version="<HASH>" folia_version="1.6" generator="libfolia-..." command="frog --skip=pn" host="mlp04.science.ru.nl" user="proycon" begindatetime=".." enddatetime=".." />
    <processor xml:id="p2" name="flat" version="0.7.12" host="flat.science.ru.nl" begindatetime=".." enddatetime="..." /> 
 </provenance>
</metadata>

Explanation

The annotator and annotatortype attributes on declarations become an annotator element inside the declaration (with name and type attributes respectively).
- This would allow for multiple/all annotators to be explicitly listed, so the default would need to be marked.
- The processor attribute provides an ID that links to the provenance data:
The provenance layer lists processors in chronological order
- Each processor has an ID
- The name identifies actual tool
- version is the version of the processor aka tool
- document_version is the version hash (more on this later) generated by that processor, the last processor never has a document_version because it corresponds to the current version, which is already stated in the root FoLiA tag.
- command (optional) - The exact command that was run
- host (optional) - The host on which the processor ran
- user (optional) - The user/executor which ran the processor
- folia_version (optional, not on last processor) - The folia version that was written
- generator (optional, not on last processor) - The generator (i.e. FoLiA library used)
- begindatetime/enddatetime

With processors and annotators decoupled to an extent, we'd allow for one processor to support multiple annotators. An example could be the case of frog and its submodules, or more strikingly, FLAT and the human annotators that make use of it. In many simple cases, processor and annotator will likely be the same. The current proposal would also allow for one processor (like Frog) to do multiple layers at once, generally implying they are somehow related (pos and lemma layers both refer to p1 here).

To really tie a particular annotation to specific processor (an annotator name might be reused by multiple processors), we might either need to introduce a processor attribute on annotations, or rely on deducing it from timestamps (assuming those are present..).

Version hash

The document version hash is a checksum on the whole metadata block (or possibly the whole document if we want to have a very strict version, we could implement an optional 'secure' mode or something)

from folia.

kosloot commented on June 2, 2024

An alternative would be to use some external tool which stores this history:
I could imaging a file containing a filename plus MD5 hash and the history of how that file is created.
When another tool modifies this file, the history must be updated, and a new MD5 calculated.

The advantage being not to interfere in the FoLiA
The disadvantage having another tool to guard the provenance. But it could be used for all kind of files.

Such tools DO exist: Taverna and Kepler

A good read might be: Provenance

from folia.

proycon commented on June 2, 2024

Tools like taverna are fully fledged workflow systems indeed, keeping track of provenance is one of their features. Similar provenance tracking features are planned in the WP3 VRE (@menzowindhouwer).

This idea is on a lower-level and data/FoLiA-centric only of course, if external tools can handle provenance, that is perfect. But what prompted me in thinking in this direction in the first place was what if the VRE invokes a 'blackbox' workflow system like PICCL, the calling system (VRE) would only be able to do rough provenance tracking but can't see what happens inside the black box. For at least the FoLiA parts, we can make this explicit very precisely with this idea. The calling system can optionally read this provenance data and integrate it into its own tracing.

Also, if FoLiA capable tools support this (and we have a lot of control over that since we produce most), then proper provenance data would be available even in the absence of any over-arching pipeline system.

from folia.

proycon commented on June 2, 2024

I'm going to start an initial implementation of this in the python library, leaving the versioning hashing for later as that needs some more careful thought.

from folia.

proycon commented on June 2, 2024

Second proposal

This is a second proposal for provenance tracking in FoLiA, differing from the first in various regards after some more deliberation. I'll first give an example and then elaborate:

Example

<FoLiA document_version="3" version="<FOLIA_VERSION>" generator="libfolia...">
<metadata>
 <annotations>
    <pos-annotation set="https://posset">
        <annotator processor="p1.1">
        <annotator processor="p2.1">
    </pos-annotation>
    <lemma-annotation set="https://....">
        <annotator processor="p1.2">
    </lemma-annotation>
 </annotations>
 <provenance>
    <processor xml:id="p0" name="ucto" version="0.14" folia_version="1.6" command="ucto -Lnld" host="mlp04.science.ru.nl" user="proycon" begindatetime=".." enddatetime=".." document_version="1">
        <meta id="config">tokconfig-nld</meta>
        <meta id="language">nld</meta>
        <processor xml:id="p0.1" name="libfolia" version="1.15" folia_version="1.6" type="generator">
    </processor>
    <processor xml:id="p1" name="frog" version="0.12" folia_version="1.6" command="frog --skip=pn" host="mlp04.science.ru.nl" user="proycon" begindatetime=".." enddatetime=".." document_version="2">
        <meta id="config">nld</meta>
        <processor xml:id="p1.0" name="libfolia" version="1.15" folia_version="1.6" type="generator" />
        <processor xml:id="p1.1" name="mbpos" />
        <processor xml:id="p1.2" name="mblem" />
    </processor>
    <processor xml:id="p2" name="flat" version="0.7.12" host="flat.science.ru.nl" begindatetime=".." enddatetime="..." document_version="3">
         <processor xml:id="p2.0" name="pynlpl.formats.folia" version="1.3.0" folia_version="1.6" type="generator" />
         <processor xml:id="p2.1" name="proycon" type="manual" />
    </processor>
 </provenance>
</metadata>
...
</FoLiA>

Inside a FoLiA document, a new generic attribute processor is introduced for all annotations, allowing us to refer directly to processors, which will be the preferred behaviour from 1.6 onwards:

<pos processor="p2.1" class="noun" />

The old-style annotator/annotatortype attributes on annotations also remain supported of course, in this case FoLiA libraries must implicitly resolve this to a processor (provided that provenance is used at all); if there is ambiguity, an error should be raised!

<pos annotator="proycon" annotatortype="manual" class="noun" />

In the absence of any annotator/processor information as in the following example, a default will be sought:

<pos class="noun" />

If no provenance is provided, old-style annotator and annotatortype attributes are used as the default. If provenance is provided, i.e. the declaration contains <annotator> elements referring to processors, then if there is only one annotator, this is the default, if there are multiple annotators, no default is possible and an error should be raised.

Explanation

Provenance is optional, we don't want to force the added complexity on people. In the absence of provenance, things behave as they always have, guaranteeing also full backward compatibility with earlier FoLiA versions.
The various declarations in the <annotations> block specify one or more annotators, each annotators is linked to a processor in the <provenance> block. The order of the annotators here is irrelevant.
The <provenance> block lists processors in order, processors may also be nested (in contrast to the first proposal), this allows capturing situations where various tools are wrapped (consider a web-based VRE that invokes a PICCL pipeline which in turn invokes Frog which consists of various modules..). The processors support various, attributes, most of which are optional:
- Each processor has an ID
- The name identifies actual tool
- version (optional but strongly recommended) is the version of the processor aka tool
- document_version is the version hash (more on this later) generated by that processor, the last processor never has a document_version because it corresponds to the current version, which is already stated in the root FoLiA tag. (optional but strongly recommended)
- command (optional) - The exact command that was run
- host (optional) - The host on which the processor ran
- user (optional) - The user/executor which ran the processor
- folia_version (optional) - The folia version that was written
- begindatetime/enddatetime (optional)
- resourcelink (optional) - The URI of any RDF resource describing this processor
- Additional custom metadata is allowed in the form of <meta> elements (just like with folia native metadata) inside the scope of a processor, FoLiA does not define the semantics of any such metadata, i.e. they are tool/application-specific and could for instance be used to specify tool parameters used.
Each processor contains a type, this subsumes the concept of annotationtype:
- auto - (default) - The processor is an automated tool that provided annotations
- manual - The processor refers a manual annotator
- generator - The processor indicates the FoLiA library used by the parent and sibling processors (unless sibling processes specify another generator in their scope)
The idea of document version hashing is abandoned in this proposal, document_version (optional) simply refers to any label the user desires to indicate a version of the document. The version order is implicit in the order of the processors. Integrity checks and version control are left to external tools, e.g. a certain git commit hash may or checksum may map to a certain document_version. The root FoLiA tag also carries a document_version tag that corresponds with the document_version of the latest processors if the provenance chain is complete.

Internal vs external provenance

I'm aware that work is being done to log provenance 'externally', especially by pipeline systems such as the CLARIAH WP3 VRE. I see the proposed FoLiA-internal provenance logging more as complementary to such efforts. The obvious limitation is that this captures only FoLiA, but from that arises its main strength as well: it allows logging provenance of linguistic annotations on a very deep level. The provenance of each single annotation in a FoLiA document can be traced this way, which external format-agnostic solutions could never accomplish.

If support for provenance is built into the FoLiA libraries, it becomes available to the large number of FoLiA-capable tools (and we developers of most of them have a lot of control over that), this means provenance can then be tracked nicely even in the absence of any overarching pipeline system.

Feedback welcome! (@dgbroeder, @menzowindhouwer, @vicding-mi, @BasLee, @JessedeDoes, possibly also interesting for @JanOdijk?)

from folia.

dgbroeder commented on June 2, 2024

Well is not purely complementary, with respect to the 'keeping track' functionality its largely overlapping. But of course its excellent if you also keep history of the folia file within the folia file itself, there is safety in numbers and an advantage is that if you move data to or from a provenance tracking VRE system in Folia format it will keep the information. As a disadvantage as you say it only covers Folia type of data and the Folia ecosystem of processors.
I think there have been quite a few systems in the past where the provenance track was also kept in the file header, notably Entropics ESPS comes to mind allowing also complex information to be added next to the processing parameters that were stored by default already. If you limited yourself to the ESPS (and you were an expert) it worked fine.

Important is that there is a way to convert between the different domains of provenance information handling.

from folia.

proycon commented on June 2, 2024

Amendment to Second Proposal

During the CLARIAH Provenance Workshop last monday it became clear that there is also a need to register in the data provenance chain what data resources are used/consumed/queried by a tool. For instance, what data is the tool trained on?

As my proposal allows for nesting processors, I think we can accommodate this idea (to a certain limit) by introducing a new subprocessor type datasource. A processor can have zero or more datasource subprocessors which describe relevant data the tool used.

Example (look inside frog/mbpos):

    <processor xml:id="p1" name="frog" version="0.12" folia_version="1.6" command="frog --skip=pn" host="mlp04.science.ru.nl" user="proycon" begindatetime=".." enddatetime=".." document_version="2">
        <meta id="config">nld</meta>
        <processor xml:id="p1.0" name="libfolia" version="1.15" folia_version="1.6" type="generator" />
        <processor xml:id="p1.1" name="mbpos">
              <processor xml:id="p1.1.1" type="datasource" name="CGN Corpus" version="unknown" />
              <processor xml:id="p1.1.2" type="datasource" name="WOTAN Corpus" version="unknown" />
              <processor xml:id="p1.1.3" type="datasource" name="DCOI Corpus" version="unknown" />
              <processor xml:id="p1.1.4" type="datasource" name="Lassy Klein Corpus" version="unknown" />
        </processor>
        <processor xml:id="p1.2" name="mblem" />
    </processor>

Of course, all this is not mandatory and requires the specific tool to actually supply this data. The specific annotations (or declarations thereof) may refer to these datasources (which implies all parent processors were invoked) and that particular datasource was the source, or to the parent processor (mbpos in this case) which implies any (zero or more) of the subprocessors might have played a role.

Resourcelink

Another amendment I already edited into the proposal is that all processors may take a resourcelink attribute pointing to an RDF resource (of any kind) further describing the processor (be it a datasource, tool or human annotator). This allows linking to the external world of linked open data from the provenance chain in FoLiA.

from folia.

proycon commented on June 2, 2024

This is now implemented as proposed, and documented here: https://folia.readthedocs.io/en/latest/metadata.html#provenance-data

from folia.

Add proper support for provenance logging in FoLiA about folia HOT 10 CLOSED

Comments (10)

(THIS IS SUPERSEDED BY THE SECOND PROPOSAL, SCROLL DOWN)

Example

Explanation

Version hash

Second proposal

Example

Explanation

Internal vs external provenance

Amendment to Second Proposal

Resourcelink

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent