natlibfi / annif Goto Github PK
View Code? Open in Web Editor NEWAnnif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
Home Page: https://annif.org
License: Other
Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
Home Page: https://annif.org
License: Other
Document and implement an optimize
CLI command (no REST API necessary) that works similar to the optimize.py
script in the prototype, i.e., given a document corpus with gold standard subjects, look for analysis parameters (in a given range?) that give the best results. Optionally store the parameters in the project configuration.
CLI command pattern would be something like this:
annif optimize <projectid> <dir> [--threads N] [--maxdocs 10] [--rounds 10] [--store=true/false] [--paramN=5:35:2]...
We need a backend that can make a call to a REST service, e.g. api.annif.org or the upcoming MauiService.
Implement the analysis commands so that results come from backends. Part of #38
When Annif returns bad subjects, it can be difficult to understand why they were suggested. An explain
parameter for the analyze functionality could be used to enable explanation functionality, which would return, for each suggested subject, the text from all of the blocks in the document that contributed to the subject assignment, sorted by their scores (highest score first). This would give at least some kind of idea which parts of the document caused the match.
We could have a UI for administering most/all aspects of an Annif installation. It would most likely be a single-page dynamic web app that uses the REST API to perform admin functions.
Currently we assume that Elasticsearch runs on localhost in the default port and requires no credentials. We should make this configurable, so that the configuration file can specify the ES hostname, port and/or required credentials.
Currently we only discover subject files with .key
extension, should also support .tsv
files as documented in Corpus formats
Make it possible to configure backends for a project. Part of #38
See #30
Currently we have three analyzers (fi, sv, en) all based on NLTK SnowballStemmer. This leads to duplicated code. Adding more languages would imply adding more analyzers.
We could instead have just one SnowballAnalyzer that takes a language parameter. Then it could be configured like this:
analyzer=snowball(finnish)
or why not
analyzer=snowball(french)
A similar approach would work for libvoikko (#37) which supports several languages:
analyzer=voikko(fi)
The parameter would be passed directly to the algorithm so all languages supported by the stemmers/lemmatizers would be available.
Make a tf-idf backend using gensim. The load operation needs to be implemented for this.
We probably need some toy test data for unit tests.
This is quite a drastic change from before, but I think it's the right thing to do to get things going.
Drop the use of Elasticsearch (currently used only to manage projects) and switch to a simpler architecture where projects are defined in a configuration file (projects.cfg
).
This also means that we can drop the project management CLI and REST operations. Projects can be read-only.
The index functionality can be later implemented better using gensim and/or fastText. These will become backends and there can eventually be several backends per project, as defined in another configuration file (backends.cfg
).
The first version of the framework should have
We could enhance the simple learning (#17) by trying to adjust the term vectors of subjects. A contextual approach to learning could be:
Given a document and a gold standard set of subjects, analyze the document.
This would require a more advanced representation of subjects as ES documents, with at least four fields:
Command-wise, this could be implemented as an extra parameter (--context
) to the learn
and learndir
operation.
The same format used by Annif prototype
using nltk.stem.snowball.SwedishStemmer
Analyzers should support a method for filtering words. The basic rules should be:
Defines the operations backends should implement. Part of #38
using nltk.stem.snowball.SnowballStemmer("english")
Check which license to use and add license text to repository. CC0 is probably not applicable.
In #34 I resorted to a Snowball stemmer for Finnish because of difficulties installing libvoikko in a virtual environment (and Travis might be problematic too).
But it would be worth at least trying whether libvoikko gives better results than the stupid Snowball stemmer.
Part of #38. We just need an empty registry and unit test.
Our REST responses in error situations should follow RFC 7807. Some concrete guidance can be found in the Zalando RESTful API guidelines
We should set up integration with either Coveralls or Code Climate (or both) to monitor test coverage.
Use Connexion for REST API implementation, using the Swagger API spec to define the functionality
Support for these CLI commands:
(Do we also need an operation for clearing all subjects from a project, without dropping the project? Or maybe the load operation could do that before actually loading anything to the index?)
Plus unit tests and the corresponding REST operations.
Using libvoikko
We should set up static analyzers that monitor code quality. These tools can be integrated directly with GitHub so that they check each committed version.
We could use one or more of these tools: (even all of them if it's not too much work)
Create a dummy backend type that always delivers empty results. Then we can wire up and test the analysis commands.
Currently the swagger spec defines mappings between REST methods and Annif operations, but they don't seem to work right. Make sure that at least the project and analyze methods work via REST.
We need a web UI for testing Annif, e.g. a simple HTML form similar to the one in the prototype (http://annif.org)
CLI implementation of:
With unit tests, naturally
This is the core operation that Annif needs to support: CLI command analyze
and the corresponding REST API command, plus unit tests.
Similar to analyzers and projects.
The annif.backend
package contains a registry of backend types. Modules within the package implement different types of backends (tf-idf, Maui, fastText...)
A configuration file (e.g. backends.cfg
) defines the available backends, like this:
# Backend using a gensim tf-idf vector space model with a sparse matrix index
[fi-tfidf]
type=tf-idf
# These correspond to Dictionary.filter_extremes() parameters
dict_no_below=1
dict_no_above=0.2
dict_keep_n=1000000
# How many sentences to include in each chunk
chunk_size=5
limit=100
# Backend using a MauiService REST API
[fi-maui]
type=mauiservice
service=http://localhost:8080/maui/yso-fi/analyze
limit=50
The projects.cfg
configuration file lists the backends to use for each project (with optional weights):
[myproject-fi]
language=fi
analyzer=finnish
backends=fi-tfidf:0.4,fi-maui:0.6
The results from each backend are combined using a weighted average of scores.
For this issue, only the backend infrastructure is necessary. Actual backends are implemented separately.
This could be done using the sklearn IsotonicRegression model. The idea is to store concept hit scores from each backend and whether the matches were correct or not. Then we can turn those scores into probabilities.
We could support plugins for pre- and/or post-processing the document analysis functionality.
A plugin could be a subclass of a class like this:
class AnnifPlugin:
"""A plugin that tweaks Annif queries before and/or after they are executed"""
def process_analyze_query(query):
"""Preprocess an analyze query, tweaking the parameters before the query is executed"""
# default implementation is a no-op
return query
def postprocess_analyze_query(query, result):
"""Postprocess an analyze query and result, tweaking the result before responding to the client"""
# default implementation is a no-op
return query
For registering plugins, we could perhaps make use of PluginBase. Each plugin would be a separate Python project that registers itself to the Annif plugin system. Each Annif project could define a set of plugins to use. The plugins could be stacked/chained, so that the result of one plugin would be fed to the next one in the chain.
Plugins would be fed the raw result of Annif queries (with lots of candidate subject), before cutting down them into the requested size and/or applying score thresholds. This way the plugins have more candidate subjects to work with.
Ideas for plugins:
Based on YSO archaeology group, Finna metadata and selected questions from the Ask a librarian corpus
There should be a package annif.analyzer
that contains general infrastructure for analyzers as well as language-specific analyzers in separate modules.
The SubjectDirectory need a method that returns an iterator which iterates through the subjects and each item is a list of tokens.
Needed for Gensim compatibility (#46)
We need a document specifying the
The CLI commands can be run with any user with read access to the configuration file.
But the REST API should have more protection. The levels could be something like this:
How to implement this is left open for now. The Connexion toolkit seems to support OAuth2 access control, which might be used here in some way.
To be done after #8
We should have functionality to adjust concept weight/boost values based on feedback from the user or from existing gold standard subjects.
Example CLI command:
annif learn <projectid> <subjectfile> --weight 0.01 <document.txt
The subjectfile would contain gold standard subjects in the document corpus format.
The weight
parameter would specify the amount of learning (factor for weight adjustment). Larger values could be used in interactive use, smaller ones when training on gold standard subjects.
A version for directories of documents with gold standard subjects:
annif learndir <projectid> <directory> --weight 0.01 --maxdocs 10
With the maxdocs
parameter, learn from a random sample of N documents from the directory instead of all of them.
The corresponding REST operation (for a single document) could be:
POST /projects/<projectid>/learn
with parameters corresponding to the CLI command.
As specified in #38
We need a class like DocumentDirectory that allows iterating through a subject corpus one subject at a time.
Prerequisite for #14
Currently the Swagger spec doesn't have any data model information. We should define the data model for projects
We need a specification of the CLI commands (with REST equivalents when applicable), for example as a Markdown file or a Wiki page. This should be written in a form that will serve as user documentation later on.
Using configparser
module.
The file could look like this:
[yso-finna-fi]
vocab=yso-skos.ttl
language=fi
analyzer=finnish
Also which backends for each project would be specified in this file, but that will come later.
We could support a command for evaluating how well Annif performs when compared against a gold standard (one or more manually created subject sets per document).
The CLI command for a single document could be:
annif eval <projectid> <subjectfile> <document.txt
and the corresponding REST API operation would be
POST /projects/<projectid>/evaluate
with the subjects and document text passed in the body.
The result should be metrics including precision, recall, F-measure and Rolling similarity.
A batch operation for directories could be
annif evaldir <projectid> <directory> --maxdocs 10
The maxdocs parameter would limit the evaluation to a random sample of the given number of documents.
The reported metrics would be averaged over the sampled documents.
We need a specification of the REST API as an OpenAPI YAML file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.