annotation / stam Goto Github PK

Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.

Home Page: https://annotation.github.io/stam/

License: Creative Commons Attribution Share Alike 4.0 International

Makefile 0.74% HTML 99.26%

annotation linguistics webannotation text stand-off text-annotation

stam's People

Contributors

Stargazers

Watchers

Forkers

tvermaut openpecha

stam's Issues

consider adding remarks or descriptions

Especially with annotationdataset and annotationdata it may be a good idea to add a clarifying remark.

Write STAM paper

When the library and tooling around STAM is mature enough, I'd like to write
and publish a paper on it, in which we present STAM and evaluate various aspects of it.
This was also suggested by @roelandordelman during the CLARIAH Tech Day recently.

This would happen at the earliest in Q4 2023 (but probably even later).

Improve STAM Query Language documentation

Could be a bit nicer and clearer for end-users.

Formulate a STAM Query Language

A query language should be formulated to effectively query a STAM model. The
query language will be formulated as an extension to STAM and effectively
provides a higher-level interface that can be directly exposed to end-users as
the primary means of interacting with a STAM model.

The query language should be able to express (non-exhaustive):

Querying by text
Querying by text relations (overlap, embedding, adjacency, etc) (as implemented via TextSelectionOperator)
Querying by annotation data (with various operators, equality, inequality, greater than, less than, etc)
Querying by resource or annotation dataset
All common logical operators
Adding, editing, and deletion of annotations/annotationdata. That is, the queries are not just used to retrieve data, but also to add/update/delete data.

The query language should be accessible enough for (technical) researchers.

Improve the space-efficiency of complex selectors

I'd like to improve the space-efficiency of the complex selectors
(MultiSelector/CompositeSelector/DirectionalSelector). In earlier discussions,
we already established that the MultiSelector is a valid tool to annotate
multiple targets using only a single annotation. However, in the current
implementation, the selector is still implemented in a way that explicitly
enumerates all the offsets in the text. So if you annotate 100,000 targets with
a single annotation via a MultiSelector (saving yourself 99,999 annotations in
the process), you still have 100,000 subselectors in memory.

This can be done more space-efficient. In Text Fabric @dirkroorda efficiently maps entire
ranges of nodes to annotation content (features):

1-426590 word
426591-426629 book
426630-427558 chapter
427559-515689 clause
515690-606393 clause_atom
606394-651572 half_verse
651573-904775 phrase
904776-1172307 phrase_atom
1172308-1236024 sentence
1236025-1300538 sentence_atom
1300539-1414388 subphrase
1414389-1437601 verse
1437602-1446831 lex

I think we need a similar way to express large ranges in STAM. We too have
'nodes' that are expressed by an internal integer ID (TextSelections,
Annotations, TextResources, AnnotationDataSets), and if there's a large
contigent range of them we can refer to them by a simple begin intID and end intID
(or multiple if there are non-contingent parts).

In ideal circumstances, we can then express complex selector with 100,000
subselectors using just one (new) ranged subselector instead.

Such a ranged subselector may be best kept as a part of STAM's 'extended
model', i.e. parts of its internals and not expressed in canonical
serialisation. This keeps the model simple and easier to interpret for the outside world, but uses
the necessary optimisations internally.

There's one limitation in this approach: When targetting text, using such a
ranged subselector would only work for 'simple' offsets, that is, offsets that
refer directly to the resource using begin-aligned cursors. If the offset is
relative (goes through another annotation) or uses end-aligned cursors, then we
need to store a copy of that offset.

Add examples other than "explicit_containment"

Could you add .txt and .json files for all the examples listed in https://github.com/annotation/stam/blob/master/examples/README.md

Thanks for the great work!

High-level API design

I want to take the next step towards designing a good high-level API for STAM. In the current implementation, things have grown somewhat organically, but we've reached a stage where things are becoming cluttered or confusing if not well designed, and where some expected high-level methods are still clearly missing.

Please read my API proposal and comment here in this issue. The document is not normative for STAM itself (any implementation may decide to do things differently); STAM as such prescribes only a data model and expected functionality for implementations, but not an API.

I also want to more clearly separate the internal API in stam-rust from the higher-level API that is exposed, right now too many internals are exposed publicly in the library. This means I want to close off parts of the low-level API, such a decoupling layer allows for easier internal changes without affecting the outside world.

It does imply there's going to be a fairly big API breakage for next stam-rust and stam-python releases, but that was coming anyway because of other changes, and at this stage that is still manageable. I hope to cover most breaking changes in a single release.

The high-level API design also relates to our aim to formulate a query language (#12) and implementation thereof (annotation/stam-rust#14), because most of the methods are related to searching. The proposed API sits at one level below a full query implementation (which was already underway), but if done right, the query implementation itself becomes less urgent and can delegate a lot to the new high-level API methods.

How to deal with resource changes?

Perhaps think about support for dealing with resource changes that possibly break existing Cursors.

timestamp?
checksum?
notification via pub/sub (resource notifies stam)?
stam stores initially selected text and validates if that changed?
rely on persistent identifiers?
....

Support external annotations files to allow selective loading and avoid memory issues

We're working on PechaData, a multilingual Buddhist corpus project in collaboration with bdrc.io and pecha.org. As a format, Stam is a dream for our project, and we're starting to build our project on top of it with a mechanism to update annotation coordinates when the base text is updated.

However, our dataset includes many large texts (>10mb .txt) featuring multiple annotation layers often larger than the initial text file and we are concerned about performance issues when we have to load all the annotations in memory even when we only need a couple of sets of annotations. (i.e. we have a file with 15 annotation sets including POS tags and dependencies but we only need the text and the annotations for the table of content.)

Have you considered externalizing annotations in separate files like the .ann files of BrAT or do you have another solution to load annotations selectively? We thought about patching Stam to find a solution but we would much prefer a solution coming from the creators.

Thanks a lot for your work!

why must a private identifier start with _?

The second is a private identifier, an internal numeric identifier (starting with an underscore)....

I know this is sometimes used as coding convention (i.e. in environments that do not support scoping), but I am not a fan.

Expand STAM Query language with the ability to ADD and DELETE items

Right now STAMQL is read-only, add ADD and DELETE statements to make the query language able to manipulate data.

Note: an EDIT statement is less likely to be implemented due to the immutable nature of annotations in STAM.

the importance of having a coordinate system independent of what the source files offer

I was thinking about "the importance of having a coordinate system independent
of what the source files offer" which @dirkroorda mentioned the other day, and
which was also described in the Unlocking Digital Texts Position
Paper:

From these formats, it is definitely possible to introduce a “glyph1-level” fragment addressing
scheme, comprising an offset from the start of the file. This effectively reduces all text
formats to plain-text by stripping away any additional tagging and non-textual components.

This is not an entirely trivial exercise, since some additional complexities around Unicode
normalisation rules and white-space handling will need to be dealt with, in order to ensure
that plain-text conversions are carried out in a consistent manner.

However, at this stage, it appears that it would be advantageous to also have a
higher level scheme that operates in a more “human-friendly” way, with word (or token)
granularity and some sense of semantic structure at a level similar to Markdown or a
light-TEI schema.

How does this relate to STAM? We have higher-order annotations that allows modelling higher-level schemes.
We can annotate a sentence and then annotate a word in that sentence using relative offsets:

The offsets still refer to the unicodepoint level, but no longer relative to
the resource as a whole but to the annotation that is being pointed at (the
sentence in this case).

The recent proposal for the STAM Baseoffset
extension is also relevant in this
because it allows us to use a start/base offset that deviates from the actual
text (a simple decoupling from the actual coordinate system, though the units
are still the same).

Next we also have our CompositeSelector (and MultiSelector) that would let us
model things the other way round, we can have the sentence be the higher-order
annotation and have it point annotations that are words, and those in turn point to the resource
(using offsets).

At this point a question arises of something we can't model in STAM yet. Our
offsets are always unicode points (as that's our most atomic unit). If you want
to address things at a higher level like described in the previous paragraph
then that requires explicitly enumerating all the targets in a
CompositeSelector/MultiSelector/DirectionalSelector. But what if we want to use
offsets in another coordinate system here? Say a selector that selects the
second up to the ninth word? Do we want a selector that can express this
whilst automatically interpolating the points in between?

Adding something like that should be possible and adds more flexibility to how
people can use STAM for modelling, but it comes at the cost of adding further
complexity to STAM. So probably it should be an extension.

Eventually we could even go as far as have a universal Selector that points to
something (resource/annotation) that is the result of a whole query. That might
subsume the above use-case as well, but would rely on several extensions (most
notably the query system which will be upcoming anyway) that are not trivial.

Text-Fabric and FoLiA both rely in the core on a coordinate system
more detached from the text, in both a text is merely an annotation like anything else.

The situation in STAM is a bit different, almost everything is an annotation
but the text itself is the primary thing an annotation points to (a slice
thereof), either directly or indirectly. I do think that's the proper method
for a standoff text annotation model.

Last, a word about complex selectors like those in the Web Annotation model
which can reference XPath and other complex file formats. I do consider these
explicitly out of scope for STAM. We want to untangle text and annotations
completely, so text is its most bare form (plain text, utf-8) and all
annotations reference that, rather than some hybrid.

I just wanted to throw all this out here to voice and hear some thoughts and if
needed have some discussion, I'm especially interested in what @dirkroorda
thinks.

Disallow nesting complex selectors

In the current specification complex selections (multi selectors, composite
selectors and directional selectors) can be nested at will, including multiple
nested layers. This allows the user to build a whole tree of selectors but
creates the problem that the semantic interpretation of such a tree is not
clearly defined.

I want to prevent this issue from arising by simply forbidding nesting of
complex selectors. If users want to build a tree-like structure then the proper
way to do so in in STAM is to create annotations that refer to other
annotations, not through selectors (annotations carry labels, ids, etc..
selectors do not). This fits better with our 'everything is an annotation'
principle. It also simplifies implementing selectors.

Initial STAM presentation

The time has come for some outreach, a presentation is planned for an internal CLARIAH WP3 meeting, which serves as a nice test bed. I hope to later record and disseminate a STAM presentation in a video.

The slides can be found at https://github.com/annotation/stam/tree/master/docs/presentation

Annotate existing xml resources?

stam might be very usefull for existing xml resources, of which there are many. This could be left to extenders or of course not be considered at all and let stam be purely text based.

instead of converting xml to text first (must often be tailor made I expect, whitespace handling and tags to convert to text) and use that as basis to annotate, you could consider using xpath for pointing. Perhaps analogous to Cursor using XpathBegin, XpathEnd.

At the moment xml extenders of stam must provide there own model and implementation for this part (except datasetselector):

I think making Cursor abstract simplifies adding support for xml/xpath (and more?).