korpling / graphannis Goto Github PK

View Code? Open in Web Editor NEW

17.0 7.0 0.0 15.4 MB

This is a new backend implementation of the ANNIS linguistic search and visualization system.

Home Page: http://corpus-tools.org/annis/

License: Apache License 2.0

Shell 0.13% Python 0.25% Rust 60.16% C 2.06% HTML 37.29% Handlebars 0.10%

linguistic-corpora linguistic-analysis search-engine hacktoberfest

graphannis's Introduction

graphANNIS

This is a new backend implementation of the ANNIS linguistic search and visualization system (http://corpus-tools.org/annis/).

Only a sub-set of the ANNIS Query Language (AQL) from ANNIS version 3 (based on PostgreSQL) is supported yet. More operators can be added in the future, but the ones missing are the ones which have been used less frequent. There is a tutorial in the Developer Guide on how to embed graphANNIS in your own application.

The basic design ideas and data models are described in detail in the PhD-thesis "ANNIS: A graph-based query system for deeply annotated text corpora". The thesis describes a prototype implementation in C++ and not Rust, but the design ideas are the same. Notable differences/enhancements compared to the thesis are:

Graph storages implement querying inverse edges and finding reachable nodes based on them: this allows to implement inverse operators (e.g. for precedence) and switching operands in situations where it was not possible before.
The data model has been simplified: the inverse coverage component and inverse edges in the left-/right-most token component have been removed.
Additional query language features are now supported.

Documentation

Developer Guide (including descriptions of the data model and tutorials for the API)
API documentation

Developing graphANNIS

You need to install Rust to compile the project. We recommend installing the following Cargo subcommands for developing annis-web:

cargo-release for creating releases
cargo-about for re-generating the third party license file
cargo-llvm-cov for determining the code coverage
cargo-dist for configuring the GitHub actions that create the release binaries.

Execute tests

You can run the tests with the default cargo test command. To calculate the code coverage, you can use cargo-llvm-cov:

cargo llvm-cov --open --all-features --ignore-filename-regex '(tests?\.rs)|(capi/.*)'

Performing a release

You need to have cargo-release installed to perform a release. Execute the follwing cargo command once to install it.

cargo install cargo-release

To perform a release, switch to the main branch and execute:

cargo release [LEVEL] --execute

The level should be patch, minor or major depending on the changes made in the release. Running the release command will also trigger a CI workflow to create release binaries on GitHub.

3rd party dependencies

This software depends on several 3rd party libraries. These are documented in the "third-party-licenses.html" file in this folder.

Language bindings

Java: https://github.com/korpling/graphANNIS-java
Python 3: https://github.com/korpling/graphANNIS-python
Rust (this repository)
C (this repository)

Author(s)

Thomas Krause ([email protected])

graphannis's People

Contributors

Stargazers

Watchers

graphannis's Issues

graphAnnis commands for deletion and renaming of a corpus

I use the interactive command-line, and I am wondering which commands allow deletion and rename of a corpus (I could not find them here: https://korpling.github.io/graphANNIS/docs/v3/cli.html)

export to graphml should trim visualizer configuration

Consider the following fragment of a resolver_vis_map.annis:

[...] htmldoc	edition	hidden	4	hide_tok:true;annos:[...]; config: edition

When this corpus is imported with graphANNIS and then exported to graphml, the following fragment is created:

[...]
[visualizers.mappings]
" config" = ' edition'
[...]

As a consequence, the visualizer does not work when re-importing the graphml corpus.

Outsource and cleanup the malloc_size crate to an external crate

relANNIS-Import: Subgraph query does not work if there is no coverage component.

Make all queries from the public benchmark set work the the Rust AQL parser

Problematic queries so far:

OR-queries do not handle the automatic index properly:
(pos=/(APPRART)|(APPR)/&pos=/(PPOSS)|(PPOSAT)|(ART)|(PDAT)|(WDAT)/ & pos=/ADJA/ & pos=/N./ & MA#tok & #1.#2 & #2.#3 & #3.#4 & #MA_=_#1)| (pos=/(APPRART)|(APPR)/ & pos=/ADJA/ & pos=/N./ & OA#norm & #6.#7 & #7.#8 & #OA_=_#6)
queries with unrestricted precedence .*: the old AQL restricted this to maximum 50, there should be a quirks mode for this
instructor_pos=/NN/ & (meta::l1="deu"|meta::l2="eng, fra, spa") --> This is actually a problem with OR in meta-data constraints of relANNIS, instead of using OR, it uses AND for all the meta-data fields. This behavior should be catched by the Quirks Mode
a#instructor_dipl = /äh.?/ & (#a . instructor_pos = /.*/ | #a . break)& (#a . instructor_pos = /.*/ | #a . break| #a . instructee_pos)
Meta-Queries with spaces between meta and ::, e.g. meta :: l1_1="cmn" & ZH1lemma="gehen" _=_ ZH1 & ZH1 _=_ ZH1lemma & #2 ->dep #3
Regular expressions with empty alternatives (||) are not accepted
Regular expressions with unrecognized escape sequences like ZH1=/\,/

Subgraph query does not work if there is no coverage component.

Versioning of serialized node annotation storages

Docs.rs does not build because "allocator_api" is not enabled on their rustc

Fix automatic creation of binaries using CI for releases

"any segmentation" for precedence and near operator

Currently the precedence/near operator has either no named argument (and thus is defined on the token precedence) or has the specific name of the segmentation chain. In cases where you search e.g. for "the" . "house" and there are segmentations in the corpus also the segmentations will be search for the annotation values "the" and "house". Unfortunately there is no "any segmentation" counter-part for the operator itself. My suggestion is to use an character that is not allowed as ID to mark this. In SQL there would be only a check that both segmentation names are equal.

My suggestions for the character are:

"the" .~ "house"
"the" .? "house"
"the" .+ "house"
"the" .@ "house"
"the" .= "house"

All of them have advantages and disadvantes, like some have semantically similar meaning in regular expressions (like "+"), some are used in AQL already and some would be completely new and therefore possible confusing. My current favourite is ".=" since it would express that both segmentations need to be the same (as a kind of binding).

@amir-zeldes, @CarolinOdebrecht Do you have any ideas what syntax would be the best?

Check codebase with the clippy tool

Problem with visualizations triggered by 'edge'

In ANNIS 3.x, specifying 'edge' as the triggering condition in resolver_vis_map renders the visualizer even if only one end point of the edge is in the search result. In 4.0.0-beta.2, it seems the visualizer is only rendered if both source and target of the edge are in context. This is problematic for visualizations like coref, in which both end points of the edge are often far.

To see the difference in behavior, compare results for entity in GUM for ANNIS 3 (show coref for all hits) vs. 4 (only if antecedent happens to be in context). The resolver entry is:

GUM	NULL	ref	edge	discourse	coreference (discourse)	hidden	7	NULL

Operator negation in AQL - part 1: negation with existence assumption

There are two types of operator negation that are desirable in AQL: with and without implied existence of one of the nodes. Negation with implied existence is probably easier and will be specified in this issue. Negation without existence could be accomplished in a second step and is described in issue #187 .

The negation of operators is done by placing ! before the operator. Negated conditions do not participate in variable binding. Variables participating in the query must be bound by some other operator(s) in the query. Here is an example:

cat="NP" & cat="PP" & cat="ADVP" & #1 _i_  #2 & #1 _i_  #3 & #1 >* #2 & 1 !>* #3

This query is stating that an NP covers a PP and an ADVP, and it dominates the PP indirectly, but it does not dominate ADVP indirectly (there is no d edge from NP to ADVP, even though it is covered discontinuously). Note that all three nodes are already bound by the query without negation. The negation simply rules out some of the search results, since they violate the negative constraint.

Another example using precedence:

sent & "ne" & "pas" & #1 _i_ #2 & #1 _i_ #3 & #2 .* #3 & #2 !. #3

Finds cases of sentences containing "ne" and later on "pas", except if they are consecutive (not #2 . #3).

Possible deadlocks on complex corpus loading/unloading patterns

The production service sometimes freezes and needs a hard restart. This issue seems to be difficult to replicate, but it seems that in most instances

queries on multiple corpora (like all the DDD subcorpora) at once, and
multiple users are involved.

We should add load tests that simulate this behavior and add more traces to catch a reproducing sequence. In addition, we might want to use a library like Moka to implement the corpus cache instead of our own implementation.

Nodes are not deleted from graph storages via the "applyUpdate" API

Optional nodes with negated operator do not work as first node in query

I first noticed this bug in ANNIS when trying to query corpora featuring dominance edges with structural nodes for their roots using the following query:

node? !> node

In ANNIS this ends in a timeout, which turns out to be mainly caused a longer runtime of such a query. Doing the same on a smaller corpus in graphANNIS with no timeout halts, but fails to provide the node ids of the matches (instead a list of empty strings is provided, which I suspect to match the amount of matches). I attach an example corpus, with which I generated the following output:

ptb> find node? !> node
13:10:10 [INFO] Total cache size is 0.53 MB / 2598.64 MB and loaded corpora are: import (0.02 MB), ptb (0.51 MB).
13:10:10 [INFO] Executed query in 10 ms

ptb> count node? !> node
13:10:14 [INFO] Executed query in 8 ms
result: 1 matches in 1 documents

Note the empty line after the find command's execution report.

Improve `find` performance for large corpora and `tok` query.

For large corpora like the Opera Graeca Adnotata (https://zenodo.org/records/8158675) executing a find query for tok can take around 10 seconds, even when the corpus is already loaded and fits into the memory. This could be due to sorting issues, but might also be because of the check for token might be too expensive.

New unary operator in AQL: level by namespace

Currently #1:root only finds nodes that are the root of a graph in all components. Often we want to know what the root of a subgraph is, e.g. just the dependency annotation, even though the same node is not a root if all context components are considered. Since we have information in the rank table on the depth level of each node in each component, it is proposed to allow level queries for any level (not just root), and limited to any specific namespace. The root operator should not be abolished despite this. Some examples:

node & korpling/ANNIS#1:level=1  (any node at level 1 in any context component)
node & korpling/ANNIS#1:level=1,3 (any node at level 1,2 or 3 in any context component)
node & korpling/ANNIS#1:dep:level=0 (any node which is the root of a context component in the namespace dep)

Operator negation in AQL - part 3: complex non-existing structures

In addition to search for single non-existing operands for negated operators (described in #187), we might want to allow more complex structures to not exist.
E.g. in the query from #187 cat="NP" & cat="PP"? & #2 !> #1 additionally would be valid to bind the PP to other non-existent elements (with ?) as follows:

cat="NP" & cat="PP"? & #2 !> #1 & cat="S"? & #3 > #2

In this case, all NPs (=positive, existing elements without ?) are recovered, then we consider whether there exists the following potentially non-existent structure: cat="S"? > cat="PP"? (both with question marks). If such a structure dominates an NP, it is not a desired result.

Finally, note that queries binding non-existing and existing elements are invalid:

invalid:

cat="NP" & cat="PP?" & #2 !> #1 & cat="S" & #3 > #2

Here we are apparently saying that an NP is possibly not dominated by a PP, but that the PP is definitely dominated by an S, which we also want to have. This is impossible to evaluate, since in cases where the PP doesn't exist the S is ill-defined.

get_all_components() returns all components with matching name if none with the same type exist

Filter not applied for negated annotation search

If a search for an annotation value is negated and its on the RHS of a join, the incorrect filter is applied.

Document public API

Importing PCC 2.1 corpus hangs at "calculating statistics for component LeftToken/annis/"

Computing inherited coverage component recursively crashes for long dominance paths

graphANNIS/graphannis/src/annis/db/aql/model.rs

Line 124 in 6176428

indirectly_covered_token.extend(calculate_inherited_coverage_edges(

This results in stack overflow for very long dominance paths.

Travis configuration used wrong repository and could not deploy release binaries

REST service and CLI tool should remove db.lock on exit

I typically run the two as different users (the REST service as a special system user annis, the CLI tool as myself), but both need to act on the same data directory. However, when the db.lock file is owned by a different user, running either of the binaries fails with a permission denied error while creating the lock, and it has to be removed manually. So I'm thinking it might be a good idea for both binaries to remove the lock on exit?

Use winapi crate instead of kernel32-sys in graphannis-core

kernel32-sys is no longer used with winapi 0.3: https://github.com/retep998/winapi-rs#should-i-still-use-those--sys-crates-such-as-kernel32-sys

Search context left/right dropdown menu: 0

Within the "Search Options" Interface of the most recent release (Beta6), the only options available for left/right search context are 1/2/5/10. Contrary to previous releases, 0 does not appear in the list of the dropdown menu. It is possible though to set the context to zero manually, even though it does not appear in the list, but it did take me some time to find that out, so it would be great if it could be added again to the dropdown list.

OS: Ubuntu 20.04
Java: OpenJDK 14.0.2

Docs: Suggest tweaking ulimit -s before relANNIS import

In spite of #205 and #206, I was still getting stack overflows on largish corpora, in spite of having more than enough free RAM. In the end, a colleague tipped me off that ulimit -s might be set too low. And indeed, setting ulimit -s unlimited prior to running the import command did the trick. Maybe worth adding to the docs?

Importing in parallel with CLI

is it possible to use graphANNIS CLI and import in parallel?

Corpora accessible without login in ANNIS v4

Is it possible to have some (or all) corpora accessible without requiring users to log in in ANNIS v4?

Non-reflexive operator join on "any token search" leads to non-empty result

While using a match all regex value (tok=/.*/ _=_ tok=/.*/) gives the correct result, not specifying any value (e.g. tok _=_ tok) does not.

Implement regular expression search for edge annotations.

GraphStorage::find_connected shows no results (unlike CycleSafeDFS)

Consider the following code snippet:

let c = AnnotationComponent::new(AnnotationComponentType::Ordering, smartstring::alias::String::from(ANNIS_NS), smartstring::alias::String::from(order_name));
if let Some(ordering) = graph.get_graphstorage(&c) {                
	let start_node = ordering.source_nodes().find(|n| ordering.get_ingoing_edges(*n.as_ref().unwrap()).count() == 0).unwrap()?;                
	let mut nodes: Vec<u64> = Vec::new();
	for entry in ordering.find_connected(start_node, 1, Bound::Excluded(usize::MAX)) {
		nodes.push(entry?);
	}
	if nodes.is_empty() {
		let dfs = CycleSafeDFS::new(ordering.as_edgecontainer(), start_node, 1, usize::MAX);
		let more_nodes = dfs.collect_vec();
		panic!("No nodes for ordering `{}` could be retrieved. (dfs: {})", order_name, more_nodes.len());
	}
}

This results in the following:

thread '<unnamed>' panicked at 'No nodes for ordering `norm` could be retrieved. (dfs: 142)'

While the dfs gathers 142 nodes, which is the expected result, find_connected() finds 0.

Operator negation in AQL - part 2: negation without existence assumption

This enhancement extends #186 , negation with existence assumption, and should probably be implemented after that issue is closed.

Often, we want a negative search to confirm that some element does not interact with our data in some way. This element doesn't have to exist: we just want to make sure it is not there (i.e. nothing, the empty set is fine as fulfilling the place of #n). This probably requires a "where not exists..." clause in the SQL generation.

It will be necessary to mark potentially non-existent elements in AQL. This will be done by suffixing a ? to the search expression. Here are some examples: suppose we want to find all NPs not dominated by a PP. It is not the case that a PP needs to be found and then checked for dominance of the NP. We simply want to go over everyone who dominates the NP (if at all) and make sure that none of them are PPs. We therefore explicitly state that we are not really searching for the PP. Note that the PP cannot be bound to the positive nodes in the query (those without ?) - otherwise its existence is implied and the solution in #186 applies. Here is the query:

cat="NP" & cat="PP"? & #2 !> #1

This finds all NPs, then checks whether a cat="PP" exists which dominates them. If so, they are removed from the result set.

In the context of this issue, only simple annotation searches are valid as operands of the negated operator. It is not possible to connect two optional elements with a negated or non-negated operators. More complex structures would be introduced by #199 .

Regex in anno names and namespaces?

I got a question from Jena about Regex in annotation names and namespaces. Not sure if we want to realize this, we thought about it once for Falko - the DB can probably handle it but the AQL grammar would need an adjustment. I could imagine this syntax:

/namesp.ce/:/an{2}o/="value"

So the slashes tell the parser that the anno name or namespace is a regular expression. Let's discuss this before deciding if we should implement this request.

Imported from Launchpad using lp2gh.

date created: 2011-10-28T08:10:00Z
owner: amir-zeldes
the launchpad url was https://bugs.launchpad.net/bugs/882966

^ operator does not work with layer

Describe the bug

Queries with .layer work but with ^layer do not.

The query Gloss=‎/GLEICH.*‎/ .Gloss Gloss works as expected:

The query Gloss=‎/GLEICH.*‎/ ^Gloss Gloss does not return any matches:

Desktop:

Operating System: iOS
Browser: chrome 101.0.4951.54
ANNIS Version 4.6.6 and 4.6.7

Additional context
The corpus was converted using pepper and imported in the relAnnis format.

stack overflow on importing (possibly large) corpus

Describe the bug
annis import aborts on importing a large corpus.

08:12:34 [INFO] loaded 7400000 lines from relANNIS_1.1/relAnnis_anno/node.annis
08:12:41 [INFO] loaded 7500000 lines from relANNIS_1.1/relAnnis_anno/node.annis
08:12:49 [INFO] loaded 7600000 lines from relANNIS_1.1/relAnnis_anno/node.annis

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

To Reproduce

start ANNIS in the data directory ~/annis /mnt/annis/v4/
import relAnnis-Corpus (either with >> import relANNIS_1.1/relAnnis_anno/ which is the unpacked directory or with >> import relANNIS_1.1_packed.zip which is the compressed version, which can be found here)
ANNIS starts importing and exits after loading 7600000 lines

Expected behavior
The corpus import should run through.

Desktop (please complete the following information):

OS: Ubuntu Linux 20.04 LTS

Java Version:

openjdk version "17" 2021-09-14
OpenJDK Runtime Environment (build 17+35-Ubuntu-120.04)
OpenJDK 64-Bit Server VM (build 17+35-Ubuntu-120.04, mixed mode, sharing)

ANNIS-Version: graphANNIS CLI 1.3.0

Add any other context about the problem here.

To rule out file storage limitations, I moved the data folder to a separate partition (hence the /mnt/annis/v4/) - that did not do the trick.

How to get first/last token in span

I'm testing the latest Kickstarter, which is great, but I have a scenario I can't get to work:

I have a span annotation for entities with a pointing relation to the syntactic head token within it. I want to export all entity type annotations with head and first token, however, sometimes the head token is itself the head token. Things I've tried:

Just query span ->head tok and CSV export, parse the first token out of covered text - this times out at 1000/30000 hits
Query span ->head tok & tok & #1 _l_ #3. Now I can use frequencies, which is lightning fast (yay!), but if the head is span initial I don't get those hits due to the reflexivity constraint

Any ideas are appreciated - ideally I hope csvexport will be faster, but I could see users wanting to carry out something like query 2, which aql doesn't allow as of now