natlibfi / annif Goto Github PK

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.

License: Other

Python 96.15% HTML 2.46% CSS 0.76% Dockerfile 0.44% Shell 0.20%

subject-indexing python machine-learning code4lib classification rest-api flask-application connexion multilabel-classification text-classification

annif's People

Contributors

Stargazers

Watchers

annif's Issues

Optimize analysis parameters

Document and implement an optimize CLI command (no REST API necessary) that works similar to the optimize.py script in the prototype, i.e., given a document corpus with gold standard subjects, look for analysis parameters (in a given range?) that give the best results. Optionally store the parameters in the project configuration.

CLI command pattern would be something like this:

annif optimize <projectid> <dir> [--threads N] [--maxdocs 10] [--rounds 10] [--store=true/false] [--paramN=5:35:2]...

threads: number of threads to run in parallel, to make best use of CPU/ES
maxdocs: take a random sample of this many documents from the directory instead of all
rounds: how many times to start over from a random set of parameters
store: whether to store the best parameters in the project configuration, or just show them
paramN: this would define the search space for each analysis parameter (with reasonable defaults)

HTTP/REST client backend

We need a backend that can make a call to a REST service, e.g. api.annif.org or the upcoming MauiService.

Perform analysis with backends

Implement the analysis commands so that results come from backends. Part of #38

Explain why a subject was matched

When Annif returns bad subjects, it can be difficult to understand why they were suggested. An explain parameter for the analyze functionality could be used to enable explanation functionality, which would return, for each suggested subject, the text from all of the blocks in the document that contributed to the subject assignment, sorted by their scores (highest score first). This would give at least some kind of idea which parts of the document caused the match.

Administration UI

We could have a UI for administering most/all aspects of an Annif installation. It would most likely be a single-page dynamic web app that uses the REST API to perform admin functions.

Support ES credentials

Currently we assume that Elasticsearch runs on localhost in the default port and requires no credentials. We should make this configurable, so that the configuration file can specify the ES hostname, port and/or required credentials.

Support TSV subject files

Currently we only discover subject files with .key extension, should also support .tsv files as documented in Corpus formats

Connect projects to backends

Make it possible to configure backends for a project. Part of #38

Remove "drop project" operation

See #30

Analyzers with parameters

Currently we have three analyzers (fi, sv, en) all based on NLTK SnowballStemmer. This leads to duplicated code. Adding more languages would imply adding more analyzers.

We could instead have just one SnowballAnalyzer that takes a language parameter. Then it could be configured like this:

analyzer=snowball(finnish)

or why not

analyzer=snowball(french)

A similar approach would work for libvoikko (#37) which supports several languages:

analyzer=voikko(fi)

The parameter would be passed directly to the algorithm so all languages supported by the stemmers/lemmatizers would be available.

Gensim TF-IDF backend

Make a tf-idf backend using gensim. The load operation needs to be implemented for this.

We probably need some toy test data for unit tests.

New architecture; stop using Elasticsearch

This is quite a drastic change from before, but I think it's the right thing to do to get things going.

Drop the use of Elasticsearch (currently used only to manage projects) and switch to a simpler architecture where projects are defined in a configuration file (projects.cfg).

This also means that we can drop the project management CLI and REST operations. Projects can be read-only.

The index functionality can be later implemented better using gensim and/or fastText. These will become backends and there can eventually be several backends per project, as defined in another configuration file (backends.cfg).

Initial framework with Elasticsearch

The first version of the framework should have

a Flask app that can be started using "flask run"
a command "flask init" that creates the repository for projects in Elasticsearch
tests for the above, running within Travis CI

Contextual learning

We could enhance the simple learning (#17) by trying to adjust the term vectors of subjects. A contextual approach to learning could be:

Given a document and a gold standard set of subjects, analyze the document.

For each blockof the document, take the top N matched subjects.
For the matched subjects that are in the gold standard set, add the block to the term vector of that subjects.
For the matched subjects that are not in the gold standard set, subtract the block from the term vector of that subject.

This would require a more advanced representation of subjects as ES documents, with at least four fields:

original text (from the subject corpus)
delta with additional text (term vector to add)
delta with text to subtract (term vector to subtract)
effective text / term vector, formed from the above

Command-wise, this could be implemented as an extra parameter (--context) to the learn and learndir operation.

Document subject corpus format

The same format used by Annif prototype

Swedish analyzer

using nltk.stem.snowball.SwedishStemmer

Filter words / tokens

Analyzers should support a method for filtering words. The basic rules should be:

words should have a minimum length (3 is OK as a starting value)
the word should have at least one letter character, based on Unicode character properties

Backend abstract base class

Defines the operations backends should implement. Part of #38

English analyzer

using nltk.stem.snowball.SnowballStemmer("english")

Decide and add license

Check which license to use and add license text to repository. CC0 is probably not applicable.

Better Finnish analyzer based on libvoikko

In #34 I resorted to a Snowball stemmer for Finnish because of difficulties installing libvoikko in a virtual environment (and Travis might be problematic too).

But it would be worth at least trying whether libvoikko gives better results than the stupid Snowball stemmer.

Backend type registry

Part of #38. We just need an empty registry and unit test.

Support Problem JSON in Swagger spec and REST API

Our REST responses in error situations should follow RFC 7807. Some concrete guidance can be found in the Zalando RESTful API guidelines

Integration with test coverage services

We should set up integration with either Coveralls or Code Climate (or both) to monitor test coverage.

Switch to Connexion framework

Use Connexion for REST API implementation, using the Swagger API spec to define the functionality

Subject corpus operations

Support for these CLI commands:

load
create-subject
show-subject
list-subjects
drop-subjects

(Do we also need an operation for clearing all subjects from a project, without dropping the project? Or maybe the load operation could do that before actually loading anything to the index?)

Plus unit tests and the corresponding REST operations.

Finnish analyzer

Using libvoikko

Integration with static analyzers

We should set up static analyzers that monitor code quality. These tools can be integrated directly with GitHub so that they check each committed version.

We could use one or more of these tools: (even all of them if it's not too much work)

Codebeat
Scrutinizer
Code Climate
SourceClear

Dummy backend

Create a dummy backend type that always delivers empty results. Then we can wire up and test the analysis commands.

Connect REST API to operations

Currently the swagger spec defines mappings between REST methods and Annif operations, but they don't seem to work right. Make sure that at least the project and analyze methods work via REST.

Testing UI

We need a web UI for testing Annif, e.g. a simple HTML form similar to the one in the prototype (http://annif.org)

Implement project administration commands

CLI implementation of:

create-project
list-projects
show-project
drop-project

With unit tests, naturally

Analyze document / automatic subject indexing

This is the core operation that Annif needs to support: CLI command analyze and the corresponding REST API command, plus unit tests.

Backend package and registry

Similar to analyzers and projects.

The annif.backend package contains a registry of backend types. Modules within the package implement different types of backends (tf-idf, Maui, fastText...)

A configuration file (e.g. backends.cfg) defines the available backends, like this:

# Backend using a gensim tf-idf vector space model with a sparse matrix index
[fi-tfidf]
type=tf-idf
# These correspond to Dictionary.filter_extremes() parameters
dict_no_below=1
dict_no_above=0.2
dict_keep_n=1000000
# How many sentences to include in each chunk
chunk_size=5
limit=100

# Backend using a MauiService REST API
[fi-maui]
type=mauiservice
service=http://localhost:8080/maui/yso-fi/analyze
limit=50

The projects.cfg configuration file lists the backends to use for each project (with optional weights):

[myproject-fi]
language=fi
analyzer=finnish
backends=fi-tfidf:0.4,fi-maui:0.6

The results from each backend are combined using a weighted average of scores.

For this issue, only the backend infrastructure is necessary. Actual backends are implemented separately.

Learn per-backend concept trustworthiness

This could be done using the sklearn IsotonicRegression model. The idea is to store concept hit scores from each backend and whether the matches were correct or not. Then we can turn those scores into probabilities.

Plugin architecture

We could support plugins for pre- and/or post-processing the document analysis functionality.

A plugin could be a subclass of a class like this:

class AnnifPlugin:
    """A plugin that tweaks Annif queries before and/or after they are executed"""

    def process_analyze_query(query):
        """Preprocess an analyze query, tweaking the parameters before the query is executed"""
        # default implementation is a no-op
        return query

    def postprocess_analyze_query(query, result):
        """Postprocess an analyze query and result, tweaking the result before responding to the client"""
        # default implementation is a no-op
        return query

For registering plugins, we could perhaps make use of PluginBase. Each plugin would be a separate Python project that registers itself to the Annif plugin system. Each Annif project could define a set of plugins to use. The plugins could be stacked/chained, so that the result of one plugin would be fed to the next one in the chain.

Plugins would be fed the raw result of Annif queries (with lots of candidate subject), before cutting down them into the requested size and/or applying score thresholds. This way the plugins have more candidate subjects to work with.

Ideas for plugins:

Neural plugin that makes use of neural networks to determine whether a set of subjects is "good" (i.e. looks somewhat like existing subject sets it has been trained on) and tweak the set so that it becomes "better"
Co-occurrence plugin that checks whether a set of subject contains pairs of subjects that frequently occur together, and increases their score

Toy subject and document corpus

Based on YSO archaeology group, Finna metadata and selected questions from the Ask a librarian corpus

Support for language specific analyzer modules

There should be a package annif.analyzer that contains general infrastructure for analyzers as well as language-specific analyzers in separate modules.

Ability to iterate subject corpus as lists of tokens

The SubjectDirectory need a method that returns an iterator which iterates through the subjects and each item is a list of tokens.

Needed for Gensim compatibility (#46)

Specify and document the corpus formats

We need a document specifying the

subject corpus format
document corpus format

Security levels for REST API methods

The CLI commands can be run with any user with read access to the configuration file.
But the REST API should have more protection. The levels could be something like this:

Superuser: can do anything
Project configuration: can administer (e.g. using PUT) a specific existing project
Subject administration: can administer the subjects of a specific project
Learning: can perform learning operations on existing subjects of a specific project
Analysis: can perform document analysis, evaluation etc. - read only, no need for protection

How to implement this is left open for now. The Connexion toolkit seems to support OAuth2 access control, which might be used here in some way.

REST API for project administration

To be done after #8

Simple learning of concept weights

We should have functionality to adjust concept weight/boost values based on feedback from the user or from existing gold standard subjects.

Example CLI command:

annif learn <projectid> <subjectfile> --weight 0.01 <document.txt

The subjectfile would contain gold standard subjects in the document corpus format.
The weight parameter would specify the amount of learning (factor for weight adjustment). Larger values could be used in interactive use, smaller ones when training on gold standard subjects.

A version for directories of documents with gold standard subjects:

annif learndir <projectid> <directory> --weight 0.01 --maxdocs 10

With the maxdocs parameter, learn from a random sample of N documents from the directory instead of all of them.

The corresponding REST operation (for a single document) could be:

 POST /projects/<projectid>/learn

with parameters corresponding to the CLI command.

Backend configuration file

As specified in #38

Representation of subject corpus

We need a class like DocumentDirectory that allows iterating through a subject corpus one subject at a time.

Prerequisite for #14

Define project model in Swagger spec

Currently the Swagger spec doesn't have any data model information. We should define the data model for projects

CLI specification

We need a specification of the CLI commands (with REST equivalents when applicable), for example as a Markdown file or a Wiki page. This should be written in a form that will serve as user documentation later on.

Configuration file for projects

Using configparser module.

The file could look like this:

[yso-finna-fi]
vocab=yso-skos.ttl
language=fi
analyzer=finnish

Also which backends for each project would be specified in this file, but that will come later.

Evaluation against a gold standard

We could support a command for evaluating how well Annif performs when compared against a gold standard (one or more manually created subject sets per document).

The CLI command for a single document could be:

annif eval <projectid> <subjectfile> <document.txt

and the corresponding REST API operation would be

POST /projects/<projectid>/evaluate

with the subjects and document text passed in the body.

The result should be metrics including precision, recall, F-measure and Rolling similarity.

A batch operation for directories could be

annif evaldir <projectid> <directory> --maxdocs 10

The maxdocs parameter would limit the evaluation to a random sample of the given number of documents.

The reported metrics would be averaged over the sampled documents.

OpenAPI (Swagger) specification

We need a specification of the REST API as an OpenAPI YAML file.

natlibfi / annif Goto Github PK

annif's People

Contributors

Stargazers

Watchers

Forkers

annif's Issues

Recommend Projects

Recommend Topics

Recommend Org