Giter VIP home page Giter VIP logo

natlibfi / annif Goto Github PK

View Code? Open in Web Editor NEW
185.0 13.0 39.0 8.27 MB

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.

Home Page: https://annif.org

License: Other

Python 96.15% HTML 2.46% CSS 0.76% Dockerfile 0.44% Shell 0.20%
subject-indexing python machine-learning code4lib classification rest-api flask-application connexion multilabel-classification text-classification

annif's People

Contributors

dependabot[bot] avatar juhoinkinen avatar kinow avatar lgtm-migrator avatar mo-fu avatar monalehtinen avatar mvsjober avatar osma avatar rneatherway avatar step-security-bot avatar tvirolai avatar unnikohonen avatar zuphilip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

annif's Issues

Optimize analysis parameters

Document and implement an optimize CLI command (no REST API necessary) that works similar to the optimize.py script in the prototype, i.e., given a document corpus with gold standard subjects, look for analysis parameters (in a given range?) that give the best results. Optionally store the parameters in the project configuration.

CLI command pattern would be something like this:

annif optimize <projectid> <dir> [--threads N] [--maxdocs 10] [--rounds 10] [--store=true/false] [--paramN=5:35:2]...
  • threads: number of threads to run in parallel, to make best use of CPU/ES
  • maxdocs: take a random sample of this many documents from the directory instead of all
  • rounds: how many times to start over from a random set of parameters
  • store: whether to store the best parameters in the project configuration, or just show them
  • paramN: this would define the search space for each analysis parameter (with reasonable defaults)

HTTP/REST client backend

We need a backend that can make a call to a REST service, e.g. api.annif.org or the upcoming MauiService.

Explain why a subject was matched

When Annif returns bad subjects, it can be difficult to understand why they were suggested. An explain parameter for the analyze functionality could be used to enable explanation functionality, which would return, for each suggested subject, the text from all of the blocks in the document that contributed to the subject assignment, sorted by their scores (highest score first). This would give at least some kind of idea which parts of the document caused the match.

Administration UI

We could have a UI for administering most/all aspects of an Annif installation. It would most likely be a single-page dynamic web app that uses the REST API to perform admin functions.

Support ES credentials

Currently we assume that Elasticsearch runs on localhost in the default port and requires no credentials. We should make this configurable, so that the configuration file can specify the ES hostname, port and/or required credentials.

Analyzers with parameters

Currently we have three analyzers (fi, sv, en) all based on NLTK SnowballStemmer. This leads to duplicated code. Adding more languages would imply adding more analyzers.

We could instead have just one SnowballAnalyzer that takes a language parameter. Then it could be configured like this:

analyzer=snowball(finnish)

or why not

analyzer=snowball(french)

A similar approach would work for libvoikko (#37) which supports several languages:

analyzer=voikko(fi)

The parameter would be passed directly to the algorithm so all languages supported by the stemmers/lemmatizers would be available.

Gensim TF-IDF backend

Make a tf-idf backend using gensim. The load operation needs to be implemented for this.

We probably need some toy test data for unit tests.

New architecture; stop using Elasticsearch

This is quite a drastic change from before, but I think it's the right thing to do to get things going.

Drop the use of Elasticsearch (currently used only to manage projects) and switch to a simpler architecture where projects are defined in a configuration file (projects.cfg).

This also means that we can drop the project management CLI and REST operations. Projects can be read-only.

The index functionality can be later implemented better using gensim and/or fastText. These will become backends and there can eventually be several backends per project, as defined in another configuration file (backends.cfg).

Initial framework with Elasticsearch

The first version of the framework should have

  • a Flask app that can be started using "flask run"
  • a command "flask init" that creates the repository for projects in Elasticsearch
  • tests for the above, running within Travis CI

Contextual learning

We could enhance the simple learning (#17) by trying to adjust the term vectors of subjects. A contextual approach to learning could be:

Given a document and a gold standard set of subjects, analyze the document.

  1. For each blockof the document, take the top N matched subjects.
  2. For the matched subjects that are in the gold standard set, add the block to the term vector of that subjects.
  3. For the matched subjects that are not in the gold standard set, subtract the block from the term vector of that subject.

This would require a more advanced representation of subjects as ES documents, with at least four fields:

  • original text (from the subject corpus)
  • delta with additional text (term vector to add)
  • delta with text to subtract (term vector to subtract)
  • effective text / term vector, formed from the above

Command-wise, this could be implemented as an extra parameter (--context) to the learn and learndir operation.

Filter words / tokens

Analyzers should support a method for filtering words. The basic rules should be:

  • words should have a minimum length (3 is OK as a starting value)
  • the word should have at least one letter character, based on Unicode character properties

Decide and add license

Check which license to use and add license text to repository. CC0 is probably not applicable.

Better Finnish analyzer based on libvoikko

In #34 I resorted to a Snowball stemmer for Finnish because of difficulties installing libvoikko in a virtual environment (and Travis might be problematic too).

But it would be worth at least trying whether libvoikko gives better results than the stupid Snowball stemmer.

Subject corpus operations

Support for these CLI commands:

  • load
  • create-subject
  • show-subject
  • list-subjects
  • drop-subjects

(Do we also need an operation for clearing all subjects from a project, without dropping the project? Or maybe the load operation could do that before actually loading anything to the index?)

Plus unit tests and the corresponding REST operations.

Integration with static analyzers

We should set up static analyzers that monitor code quality. These tools can be integrated directly with GitHub so that they check each committed version.

We could use one or more of these tools: (even all of them if it's not too much work)

  • Codebeat
  • Scrutinizer
  • Code Climate
  • SourceClear

Dummy backend

Create a dummy backend type that always delivers empty results. Then we can wire up and test the analysis commands.

Connect REST API to operations

Currently the swagger spec defines mappings between REST methods and Annif operations, but they don't seem to work right. Make sure that at least the project and analyze methods work via REST.

Backend package and registry

Similar to analyzers and projects.

The annif.backend package contains a registry of backend types. Modules within the package implement different types of backends (tf-idf, Maui, fastText...)

A configuration file (e.g. backends.cfg) defines the available backends, like this:

# Backend using a gensim tf-idf vector space model with a sparse matrix index
[fi-tfidf]
type=tf-idf
# These correspond to Dictionary.filter_extremes() parameters
dict_no_below=1
dict_no_above=0.2
dict_keep_n=1000000
# How many sentences to include in each chunk
chunk_size=5
limit=100

# Backend using a MauiService REST API
[fi-maui]
type=mauiservice
service=http://localhost:8080/maui/yso-fi/analyze
limit=50

The projects.cfg configuration file lists the backends to use for each project (with optional weights):

[myproject-fi]
language=fi
analyzer=finnish
backends=fi-tfidf:0.4,fi-maui:0.6

The results from each backend are combined using a weighted average of scores.

For this issue, only the backend infrastructure is necessary. Actual backends are implemented separately.

Learn per-backend concept trustworthiness

This could be done using the sklearn IsotonicRegression model. The idea is to store concept hit scores from each backend and whether the matches were correct or not. Then we can turn those scores into probabilities.

Plugin architecture

We could support plugins for pre- and/or post-processing the document analysis functionality.

A plugin could be a subclass of a class like this:

class AnnifPlugin:
    """A plugin that tweaks Annif queries before and/or after they are executed"""

    def process_analyze_query(query):
        """Preprocess an analyze query, tweaking the parameters before the query is executed"""
        # default implementation is a no-op
        return query

    def postprocess_analyze_query(query, result):
        """Postprocess an analyze query and result, tweaking the result before responding to the client"""
        # default implementation is a no-op
        return query

For registering plugins, we could perhaps make use of PluginBase. Each plugin would be a separate Python project that registers itself to the Annif plugin system. Each Annif project could define a set of plugins to use. The plugins could be stacked/chained, so that the result of one plugin would be fed to the next one in the chain.

Plugins would be fed the raw result of Annif queries (with lots of candidate subject), before cutting down them into the requested size and/or applying score thresholds. This way the plugins have more candidate subjects to work with.

Ideas for plugins:

  • Neural plugin that makes use of neural networks to determine whether a set of subjects is "good" (i.e. looks somewhat like existing subject sets it has been trained on) and tweak the set so that it becomes "better"
  • Co-occurrence plugin that checks whether a set of subject contains pairs of subjects that frequently occur together, and increases their score

Security levels for REST API methods

The CLI commands can be run with any user with read access to the configuration file.
But the REST API should have more protection. The levels could be something like this:

  1. Superuser: can do anything
  2. Project configuration: can administer (e.g. using PUT) a specific existing project
  3. Subject administration: can administer the subjects of a specific project
  4. Learning: can perform learning operations on existing subjects of a specific project
  5. Analysis: can perform document analysis, evaluation etc. - read only, no need for protection

How to implement this is left open for now. The Connexion toolkit seems to support OAuth2 access control, which might be used here in some way.

Simple learning of concept weights

We should have functionality to adjust concept weight/boost values based on feedback from the user or from existing gold standard subjects.

Example CLI command:

annif learn <projectid> <subjectfile> --weight 0.01 <document.txt

The subjectfile would contain gold standard subjects in the document corpus format.
The weight parameter would specify the amount of learning (factor for weight adjustment). Larger values could be used in interactive use, smaller ones when training on gold standard subjects.

A version for directories of documents with gold standard subjects:

annif learndir <projectid> <directory> --weight 0.01 --maxdocs 10

With the maxdocs parameter, learn from a random sample of N documents from the directory instead of all of them.

The corresponding REST operation (for a single document) could be:

 POST /projects/<projectid>/learn

with parameters corresponding to the CLI command.

CLI specification

We need a specification of the CLI commands (with REST equivalents when applicable), for example as a Markdown file or a Wiki page. This should be written in a form that will serve as user documentation later on.

Configuration file for projects

Using configparser module.

The file could look like this:

[yso-finna-fi]
vocab=yso-skos.ttl
language=fi
analyzer=finnish

Also which backends for each project would be specified in this file, but that will come later.

Evaluation against a gold standard

We could support a command for evaluating how well Annif performs when compared against a gold standard (one or more manually created subject sets per document).

The CLI command for a single document could be:

annif eval <projectid> <subjectfile> <document.txt

and the corresponding REST API operation would be

POST /projects/<projectid>/evaluate

with the subjects and document text passed in the body.

The result should be metrics including precision, recall, F-measure and Rolling similarity.

A batch operation for directories could be

annif evaldir <projectid> <directory> --maxdocs 10

The maxdocs parameter would limit the evaluation to a random sample of the given number of documents.

The reported metrics would be averaged over the sampled documents.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.