annotation / stam-rust Goto Github PK

Programming library for the Standoff Text Annotation Model (STAM), written in Rust. This is the primary software library for STAM with a focus on performance.

Home Page: https://annotation.github.io/stam

License: GNU General Public License v3.0

Rust 100.00%

annotation library nlp rust text

stam-rust's Introduction

STAM Library

STAM is a standalone data model for stand-off text annotation. This is a software library to work with the model from Rust, and is the primary library/reference implementation for STAM. It aims to implement the full model as per the STAM specification and most of the extensions.

What can you do with this library?

Keep, build and manipulate an efficient in-memory store of texts and annotations on texts
Search in annotations, data and text, either programmatically or via the STAM Query Language.
- Search annotations by data, textual content, relations between text fragments (overlap, embedding, adjacency, etc),
- Search in text (incl. via regular expressions) and find annotations targeting found text selections.
- Search in data (set,key,value) and find annotations that use the data.
- Elementary text operations with regard for text offsets (splitting text on a delimiter, stripping text).
- Convert between different kind of offsets (absolute, relative to other structures, UTF-8 bytes vs unicode codepoints, etc)
Read and write resources and annotations from/to STAM JSON, STAM CSV, or an optimised binary (CBOR) representation
- The underlying STAM model aims to be clear and simple. It is flexible and does not commit to any vocabulary or annotation paradigm other than stand-off annotation.

This STAM library is intended as a foundation upon which further applications can be built that deal with stand-off annotations on text. We implement all the low-level logic in dealing this so you no longer have to and can focus on your actual application. The library is written with performance in mind.

Installation

Add stam to your project's Cargo.toml:

$ cargo add stam

Usage

Import the library

use stam;

Or if you prefer losing the namespace:

use stam::*;

Loading a STAM JSON file containing an annotation store:

fn your_function() -> Result<(),stam::StamError> {
    let store = stam::AnnotationStore::from_file("example.stam.json", stam::Config::default())?;
    ...
}

We assume some kind of function returning Result<_,stam::StamError> for all examples in this section.

The annotation store is your workspace, it holds all resources, annotation sets (i.e. keys and annotation data) and of course the actual annotations. It is a memory-based store and you can as much as you like into it (as long as it fits in memory:).

When instantiating an annotation store, you can pass a configuration (stam::Config()) which specifies various parameters, such as which indices to generate. Use the various with_() methods (a builder pattern) to set the various configuration options.

Retrieving items

You can retrieve items by methods that are similarly named to the desired return type:

let annotation =  store.annotation("my-annotation").or_fail()?;
let resource = store.resource("my-resource").or_fail()?;
let annotationset: &stam::AnnotationDataSet = store.annotationset("my-annotationset").or_fail()?;
let key = annotationset.key("my-key").or_fail()?;
let data = annotationset.annotationdata("my-data").or_fail()?;

All of these methods return an Option<ResultItem<T>>, where T is a type in the STAM model like Annotation, TextResource,AnnotationDataSet, DataKey or TextSelection. The or_fail() method transforms it into a Result<T,StamError> and the ? unwraps it safely into ResultItem<T> or propagates the error further.

The ResultItem<T> type holds a reference to T, with a lifetime equal to the store, it also holds a reference to the store itself. You can call as_ref() on all ResultItem<T> instances to a direct reference with a lifetime equal to the store, this exposes a lower-level API. ResultItem<T> itself always exposes a high-level API, which is what you want in most cases.

The wrapping of TextSelection is a bit special, instead of ResultItem<TextSelection>, we typically use a more specialised type ResultTextSelection.

Adding items

Add a resource to an existing store:

let resource_handle = store.add( stam::TextResource::from_file("my-text.txt", store.config()) )?;

A similar pattern works for AnnotationDataSet:

let annotationset_handle = store.add( stam::AnnotationDataSet::from_file("myset.json", store.config()) )?;

The add methods adds the items directly, which means they have to have been constructed already. Many STAM data structures, however, have an associated builder type and are not instantiated directly. We use annotate() rather than add() to add annotations to an existing store:

let annotation_handle = store.annotate( stam::AnnotationBuilder::new()
           .with_target( stam::SelectorBuilder::TextSelector("testres", stam::Offset::simple(6,11))) 
           .with_data("testdataset", "pos", "noun") 
)?;

*Here we see a Builder type that uses a builder pattern to construct instances of their associated types. The actual instances will be built by the underlying store.

Structures like AnnotationDataSets and TextResource also have builders, you can use them with add() by invoking the build() method on the builder to produce the final type:

let annotationset_handle = store.add(
                   stam::AnnotationDataSetBuilder::new().with_id("testdataset"))
                                                 .with_data_with_id("pos", "noun", "D1").build()?)?;

Let's now create a store and annotations from scratch, with an explicitly filled AnnotationDataSet:

let store = stam::AnnotationStore::new(stam::Config::default())
    .with_id("test")
    .add( stam::TextResource::from_string("testres", "Hello world"))?
    .add( stam::AnnotationDataSet::new().with_id("testdataset")
           .add( stam::DataKey::new("pos"))?
           .with_data_with_id("pos", "noun", "D1")?
    )?
    .with_annotation( stam::Annotation::builder() 
            .with_id("A1")
            .with_target( stam::SelectorBuilder::textselector("testres", stam::Offset::simple(6,11))) 
            .with_existing_data("testdataset", "D1") )?;

And here is the very same thing but the AnnotationDataSet is filled implicitly here:

let store = stam::AnnotationStore::default().with_id("test")
    .add( stam::TextResource::from_string("testres".to_string(),"Hello world"))?
    .add( stam::AnnotationDataSet::new().with_id("testdataset"))?
    .with_annotation( stam::AnnotationBuilder::new()
            .with_id("A1")
            .with_target( stam::SelectorBuilder::textselector("testres", stam::Offset::simple(6,11))) 
            .with_data_with_id("testdataset","pos","noun","D1")
    )?;

The implementation will ensure to reuse any already existing AnnotationData if possible, as not duplicating data is one of the core characteristics of the STAM model.

There is also an AnnotationStoreBuilder you can use with implements the builder pattern for the annotation store as a whole.

Serialisation to file

You can serialize the entire annotation store (including all sets and annotations) to a STAM JSON file:

store.to_file("example.stam.json")?;

Or to a STAM CSV file (this will actually create separate derived CSV files for sets and and annotations):

store.to_file("example.stam.csv")?;

Iterators & Searching

Iterating through all annotations in the store, and outputting a simple tab separated format with the data by annotation and the text by annotation:

for annotation in store.annotations() {
    let id = annotation.id().unwrap_or("");
    for data in annotation.data() {
        // get the text to which this annotation refers (if any)
        let text: Vec<&str> = annotation.text().collect();
        print!("{}\t{}\t{}\t{}", id, data.key().id().unwrap(), data.value(), text.join(" "));
    }
}

Here is an overview of the most important methods that return an iterator, the iterators in turn all return ResultItem<T> instances (or ResultTextSelection). The table is divided into two parts, the top part simple methods that follows STAM's ownership model. Those in the bottom part leverage the various reverse indices that are computed:

Method	T	Description
`AnnotationStore.annotations()`	`Annotation`	all annotations in the store
`AnnotationStore.resources()`	`TextResource`	all resources in the store
`AnnotationStore.datasets()`	`AnnotationDataSet`	all annotation sets in the store
`AnnotationDataSet.keys()`	`DataKey`	all keys in the set
`AnnotationDataSet.data()`	`AnnotationData`	all data in the set
`Annotation.data()`	`AnnotationData`	the data pertaining to the annotation
-------------------------------------	-----------------------	-------------------------------------
`TextResource.textselections()`	`TextSelection`	all known text selections in the resource (1)
`TextResource.annotations()`	`Annotation`	Annotations referencing this text using a `TextSelector` or `AnnotationSelector`
`TextResource.annotations_as_metadata()`	`Annotation`	Annotations referencing the resource via a `ResourceSelector`
`AnnotationDataSet.annotations()`	`Annotation`	All annotations making use of this set
`AnnotationDataSet.annotations_as_metadata()`	`Annotation`	Annotations referencing the set via a `DataSetSelector`
`Annotation.annotations()`	`Annotation`	Annotations that reference the current one via an `AnnotationSelector`
`Annotation.annotations_in_targets()`	`Annotation`	Annotations referenced by the current one via an `AnnotationSelector`
`Annotation.textselections()`	`Annotation`	Targeted text selections (via `TextSelector` or `AnnotationSelector`)
`AnnotationData.annotations()`	`Annotation`	All annotations that use this data
`DataKey.data()`	`AnnotationData`	All annotation data that uses this key
`TextSelection.annotations()`	`Annotation`	All annotations that target this text selection
-------------------------------------	-----------------------	-------------------------------------

Notes:

(1) With known text selections, we refer to portions of the texts that have been referenced by an annotation.
Most of the methods in the left column, second part of the table, are implemented only for ResultItem<T>, not &T.
This library consistently uses iterators and therefore lazy evaluation. This is more efficient and less memory intensive because you don't need to wait for all results to be collected (and heap allocated) before you can do computation.

The main named iterator traits in STAM are:

Iterator trait	T	Methods that produce the iterator
`AnnotationIterator`	`Annotation`	`annotations()` / `annotations_in_targets()`
`DataIterator`	`AnnotationData`	`data()` / `find_data()`
`TextSelectionIterator`	`TextSelection`	`textselections()` / `related_text()`
`ResourcesIterator`	`AnnotationData`	`resources()`
`KeyIterator`	`DataKey`	`keys()`
------------------------------------	-----------------------	-----------------------------------------------

The iterators expose an API allowing various transformations and filter actions: You can typically transform one type of iterator to another using the methods in the third column. Similarly, you can obtain an iterator from ResultItem instances through equally named methods.

All of these iterators have an owned collection counterpart (Handles<T>) that holds an entire collection in memory, the items are held by reference to a store, so the space-overhead is reduced. You can go from the former to the latter with .to_handles() and from the latter to the format with .items().

Iterator Trait	Collection
`AnnotationIterator`	`Annotations`
`DataIterator`	`Data`
`ResourcesIterator`	`Resources`
`TextSelectionsIter`	`TextSelections`
`KeyIterator`	`Keys`
------------------------------------	-----------------------

The iterators can be extended by filters, they are applied in a build pattern and return the an iterator that still implements the same trait, but with the filter applied:

Filter method	Description
`filter_annotation(&ResultItem<Annotation>)`	Filters on a single annotation
`filter_annotations(Annotations)`	Filters on multiple annotations
`filter_annotationdata(&ResultItem<AnnotationData>)`	Filters on a single data item
`filter_data(Data)`	Filters on multiple data items
`filter_key(&ResultItem<DataKey>)`	Filters on a data key
`filter_value(value)`	Filters on a data value, the parameter can be of various types
----------------------------------------------	----------------------

All these iterators are lazy-iterator, that is to say they don't do anything unless consumed. Once they are being iterated over, internal buffers may be allocated.

When you are not interested in the actual items but merely want to test whether there are results at all, then use the test() method that is available on these iterators.

For improved performance, you can add .parallel() to an iterator, any subsequent iterator methods (generic ones like map() and filter(), not STAM-specific), will then run in parallel over multiple cores.

Examples

Example retrieving all annotations for that have part-of-speech noun (fictitious model):

let dataset = store.dataset("linguistic-features").or_fail()?;
let key = dataset.key("part-of-speech").or_fail()?;
let annotationsiter = key.data().filter_value("noun".into()).annotations();

Alternatively, this can also be done as follows, following a slightly different path to get to the same results. Sometimes one version is more performant than the other, depending on how your data is modelled:

let annotationsiter = key.annotations().filter_value("noun".into());

Example testing whether a word is annotated with part-of-speech noun (fictitious model):

let dataset = store.dataset("linguistic-features").or_fail()?;
let key = dataset.key("part-of-speech").or_fail()?;
if word.annotations().filter_key(&key).filter_value("noun".into()).test() {
   ...    
}

Searching data

The above methods already allow to find data, but there is find_data() method on AnnotationStore and AnnotationDataSet provide a shortcut to quickly get data instances (via a DataIter).

Example:

let data = store.find_data("linguistic-features", "part-of-speech", "noun".into()).next()

Here and in examples before we use the into() method to coerce a &str into a DataOperator::Equals(&str). There are also other data operators available allowing for various types and various kinds of comparison (equality, inequality, greater than, less than, logical and/or etc).

Searching text

The following methods are available to search for text, they return iterators producing ResultItem<T> items.

Method	T	Description
`TextResource.find_text()`	`TextSelection`	Finds a particular substring in the resource's text.
`TextSelection.find_text()`	`TextSelection`	Finds a particular substring within the specified text selection.
`TextResource.find_text_regex()`	`TextSelection`	Idem, but as powerful regular expressed based search.
`TextSelection.find_text_regex()`	`TextSelection`	Idem, but as powerful regular expressed based search.
-------------------------------------	-----------------------	-------------------------------------

Searching related text

The related_text() method allows for for finding text selections that are in a certain relation with the current one(s). It takes a TextSelectionOperator as parameter, which distinguishes various variants.

Equals - Both sets occupy cover the exact same TextSelections, and all are covered (cf. textfabric's ==), commutative, transitive
Overlaps - Each TextSelection in A overlaps with a TextSelection in B (cf. textfabric's &&), commutative
Embeds - All TextSelections in B are embedded by a TextSelection in A (cf. textfabric's [[)
Embedded - All TextSelections in A are embedded by a TextSelection in B (cf. textfabric's ]])
Before - Each TextSelection in A comes before a textselection in B (cf. textfabric's <<)
After - Each TextSelection In A comes after a textselection in B (cf. textfabric's >>)
Precedes - Each TextSelection in A precedes B; it ends where at least one TextSelection in B begins.
Succeeds - Each TextSelection in A succeeds B; it begins where at least one TextSelection in A ends.
SameBegin - Each TextSelection in A starts where a TextSelection in B starts
SameEnd - Each TextSelection in A starts where a TextSelection in B ends

The variants are typically constructed via a helper function on TextSelectionOperator (simply name of the variant in lowercase), e.g. TextSelectionOperator::equals().

Example, select all words in a sentence (sentence may be either an Annotation or TextSelection in this case):

let dataset = store.dataset("structure-type").or_fail()?;
let key_word = dataset.key("word").or_fail()?;
for word in sentence.related_text(stam::TextSelectionOperator::embeds()).annotations().filter_key(key_word) {
    ...
}

Querying

Rather than searching programmatically, you can also express queries via the STAM Query Language (STAMQL). Do note that this incurs a performance penalty due to extra overhead:

let query: Query = "SELECT ANNOTATION ?a WHERE DATA myset type = phrase;".try_into()?;
let iter = store.query(query);
let names = iter.names();
for results in iter {
    if let Ok(result) = results.get_by_name(&names, "a") {
       if let QueryResultItem::Annotation(annotation) = result {
          ...
        }
    }
}

API Reference Documentation

Please consult the API reference documentation for in-depth explanation on all structures, traits and methods, along with some examples.

Extensions

This library implements the following STAM extensions:

STAM-CSV - Defines an alternative serialisation format using CSV.
STAM-Query - Defines the STAM Query Language.
STAM-Transpose - Defines linking identical textual parts across resources.
STAM-Textvalidation - Defines a mechanism to ensure annotation targets can be checked for changes.

Python binding

This library comes with a binding for Python, see here.

Acknowledgements

This work is conducted at the KNAW Humanities Cluster's Digital Infrastructure department, and funded by the CLARIAH project (CLARIAH-PLUS, NWO grant 184.034.023) as part of the FAIR Annotations track of the Shared Development Roadmap.

stam-rust's People

Contributors

Stargazers

Watchers

stam-rust's Issues

Implement full text index

Searching for text currently iterates through the whole text. To make quicker lookups possible, a full text index could be implemented (e.g. using suffixarrays), at the cost of (significant) space. This would be an opt-in feature, via a Config parameter.

Implement query language

Implement a higher-level query language. Effectively parsing and translating high-level queries into calls to lower-level search methods. One major challenge is to find in what parts of a query to execute before others, in such a way that the search space is as small as possible (= quickest search results and most performant).

Depends on:

This will take significant time to implement (wild guess: 150 hours).

Implement DataValue::DateTime

In specification but currently not implemented

Implement python binding

In order to reach a wider audience of researchers and developers, a Python binding needs to be implemented
This binds the performant Rust-code with accessibilit from Python.

Fix serialisation to new standoff text file

When a resource is loaded as a string and a filename is subsequently associated, it should be serialized to that file, but that's currently not happening.

Implement deletion from stores

The current implementation does not do deletion yet.

Implement deletion (very easy), but also implement a mechanism to add
subsequent new items at places that have been freed (rather than at the end
increasing the store size).

Vocab extension

Implement the Vocab Extension.

Implement resolution of relative offsets (AnnotationSelector)

Offsets in STAM may be relative to the annotation that is being pointed at (with an AnnotationSelector). These need to be added to the reverse index (textrelationmap)

Implement binary (de)serialisation

A binary serialization format should be implemented that facilates more
high-performance deserialisation/serialisation.

The format would be implementation-specific and not really intended for
information exchange. Rather than implementing one from scratch I just want to
use one that happens to have a good and performant implementation that works
with the serde library. Candidates
MessagePack, or possibly BSON or
ProtoBuf.

Unlike the STAM JSON format, it should also hold the reverse indices, so the
indexing step can be skipped and deserialisation is as quick as we can get it.

Flexible ownership of text strings

Currently STAM forcibly takes memory ownership of texts (strings), both in TextResource and in DataValue. For more flexible use in situations where the caller wants to retrain ownership, these Strings could be rewritten to Cow<'a,str>. This would be a fairly big refactoring make those structs, and the whole AnnotationStore, subject to a lifetime though.

(not a priority, just an idea)

Extend API with methods to edit resource text

The API needs some methods that allow editing a resource's text
(insertion,deletion,substitution). Editing text has consequences for
potentionally all annotations on it as they are referred to by offset. When
text is edited, offsets of other annotations need to adjust automatically. As
this is a fairly expensive operation, I want to implement the ability to
commit a batch of edits at once, so the computation can be done more efficiently.

When text changes and annotations are updated accordingly, the user should have the options of:

Just reusing existing resource and annotation IDs: annotations are forcibly edited. This is fine as long as the resource hasn't been published yet, otherwise discouraged.
Create new resources and new annotations with new IDs (support a version component in the ID): strictly adheres to the idea of annotations being immutable.
- The old ones may be either retained or deleted
- If retained, there may be an extra annotation linking the old ones to the new ones

W3C Web Annotation export

Implement the https://github.com/annotation/stam/tree/master/extensions/stam-webannotations extension that enables export to W3C web annotations. Also requires a validation component that checks whether all IDs are proper IRIs.

Wild time estimate: 50 hours

Make CompositeSelectors and MultiSelectors adhere to textual order

It helps performance (#19) and simplifies usage if we can ensure that elements under CompositeSelectors and MultiSelectors are in textual order. Note that this explicitly does NOT apply to the DirectionalSelector.

Serialisation/deserialisation to/from stand-off files with @include

Implement deserialisation and serialisation of the '@include' field. It is currently implemented only for TextResource. It also requires some extra bookkeeping to serialize to the same files as items were deserialized from.

Implement text-baseoffset extension

See annotation/stam#7

Pass user parameters to AnnotationStore

Users may want to pass parameters to the annotation store to configure what indices they want to build (by default all are built) and set some other run-time parameters.

Implement gzip support

JSON files are pretty verbose and can get huge. Gzipping them is an easy way to
compress them with a good rate (I have a 71MB STAM json file which compresses
to 3.2MB when gzipped (2.8MB when bzip2)).

Implement support in the library for reading and writing gzipped json files,
simply by detecting the json.gz extension.

Implement DataKeySelector and AnnotationDataSelector

This is a recent addition to the STAM model (annotation/stam#19)

It also relates somewhat to #24 as it may be needed to introduce different means to retrieve differently selected annotations.

Implement transpose function

As specified in https://github.com/annotation/stam/tree/master/extensions/stam-transpose

Test and improve performance

Good performance is one of the design goals of this STAM implementation. I
consider this to include both efficient run-time execution (CPU time), as well as resource
consumption (memory), and we often find a trade-off between the two.

The library implements a fair amount of benchmarks (run cargo bench) to quantify this.
Most insightful, however, are comparisons with other systems. The comparison with TextFabric is most
notable and most informative in this, especially with regard to searching/querying.

Performance is constrained by the way the STAM model itself is designed. Its
aim to be flexible in supporting multiple annotation paradigms, and possibly
act as a pivot model between multiple, implies the implementation can not
always be optimised as much as others. I'm again comparing with TextFabric here
where for instance certain queries may map more directly onto internal
structures.

Further tests and possible refactoring rounds are needed to ensure performance is the best we can do.

Implement UNION query constraint

Allows expressing disjunctions, already documented but not implemented yet.

Implement datavalue index

Build an index for datakeys with lots of values, so they can be retrieved more efficiently.
This could be done automatically if there are lots of vales, or by setting an index flag on DataKey (this would be a STAM extension).

Revise handling of Annotations.filter_annotation() and Annotations.filter_annotations()

Currently these filter the collection itself, but they should be able to filter annotations 'up' and (another variant) 'down' the hierarchy. This would be more consistent with the same methods elsewhere.

Implement parameters for TextSelectionOperators

Implement parameters for TextSelectionOperators:

for Precedes and Succeeds I want to introduce skip_space and skip_punct parameters (int value representing distance in unicode points) that allows counting two textselections adjacent even if there is a gap with spacing/punctuation. Useful for for example words.
for Embedded, Overlaps and most others, I want to introduce a limit_distance parameter that limits the distance that is considered to find a match (int value representing distance in unicode points)

Implement methods for common data/paradigm transformations

Implement some API methods for common data/paradigm transformations such as:

Merging annotations with the same data into a single annotation with a multiselector
Splitting annotations with a multiselcector into multiple separate annotations

Implement search methods (extended model)

Implement a wide variety of search methods to search a STAM model.

This is a prerequisite for a (higher-level) query language.

refactoring: make FindTextSelections trait into a one with a generic type

Try to unify the FindTextSelectionsIter and FindTextSelectionsOwnedIter into one over a generic type (the reference textselection, either borrowed or owned), also expand it to allow TextSelectionSet.

Improve unicode point to utf-8 offset conversion

Offsets and cursors in STAM are specified as unicode points. String slices
internally, however, use UTF-8 byte offsets. This requires a conversion, which
is implemented in resolve_cursor() in resource.rs
(https://github.com/annotation/stam-rust/blob/master/src/resources.rs#L188)

The initial implementation is not efficient enough as it scans through the
entire text (O(n)) to compute the byte-offset, which becomes expensive as texts
grows, and large text are deliberately in scope for STAM.

A possible improved implementation is to precompute and store so-called milestones spread
over a regular distance interval. Alternatively, even ALL conversions could be stored (interval 1),
but this may have an undesirably large memory impact.

Text validation extension

Implement text validation extension.

Automatically create ranged selectors when parsing (decreases memory footprint)

We have ranged selectors, an internal selector type that leads to far more efficient memory use when complex selectors are used, but right now these new selectors are not used yet. They need to constructed automatically when parsing a model.

This is the continuation of:

annotation/stam#11

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.