Giter VIP home page Giter VIP logo

benchmarkstt's People

Contributors

amessina71 avatar aro-max avatar cjrosen avatar danielthepope avatar dependabot[bot] avatar eyallavi avatar ioannisnoukakis avatar mikesmitheu avatar pietrop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchmarkstt's Issues

Contacts and Contributors

  1. A number of contacts need to be defined, for example in the code of conduct we have a [TODO: add email]. Which email can we assign as the guardian of our code of conduct?

  2. Do we add a "CONTRIBUTORS" file as is often done for open source projects? In what measure do we push forward the developers and active maintainers of the project code and in what measure do we instead unite under the umbrella of "EBU"?

Stating some loosely defined roles can aid in clarity for all involved in the project.

Version numbering

I propose following PEP 0440 for this:

https://www.python.org/dev/peps/pep-0440/

In short this would mean version numbers like this:

X.YaN   # Alpha release
X.YbN   # Beta release
X.YrcN  # Release Candidate
X.Y     # Final release

Our current tagged pre-release version, as tagged in github, is 0.0.2, this should be 0.0a2 to indicate the pre-release status (the next pull request - another pre-release - I propose tagging as 0.0a3.

Project toolkit name

We have to decide on how to name the toolkit. An easily identifiable name stands out of the pack and seems to encourage usage.

At the moment I've just chosen a name for #16 that seemed perfectly adequate: "conferatur". I've chosen this because the latin meaning seems to apply decently to this project and that could be easily search/replaced if we go with another name.

See: https://en.wiktionary.org/wiki/confero#Latin

On one hand it means "to bring together", "to unite", "to consult" as is easily identifiable from the verb "to confer" which is based on it. This is what we are attempting to do with this project, to create something through cooperation of interested parties that will be useful to all.
On the other hand it means "to set in opposition, oppose, match", which is what the toolkit itself is trying to do, namely to compare different solutions by matching them through common metrics.

The name is not set in stone, and may be changed at any time, but I propose once we've decided on it to rename the git repository name to this as to keep things consistent.

Ground truth materials

2 freely available possible datasets have already been identified, more are welcome:

  1. Mozilla Common Voice https://voice.mozilla.org/en
    CC-0 license
  2. Openslr resources http://openslr.org/resources.php
    Each resource has own license ranging from "unrestricted" to "CC-BY-NC-ND 3.0"
    Remark: Some of the Openslr data is likely to have been used for training various STT systems, as such it may not always be the most fair indicator

Open questions:

  • Which ground truth materials might we use for evaluating vendors' solutions? Will we build our own dataset? Or both?
  • How will we include these resources into our product?

Licence

Options:

  • MIT
  • Apache
  • GPL
  • BSD

Position of positional arguments

benchmarkstt ref.txt hyp.txt --wer works, but benchmarkstt --wer ref.txt hyp.txt returns error: the following arguments are required: reference, hypothesis.

The help and documentation make it look like the positional arguments should come at the end, but they seem to need to be at the start of the command.

My preferred solution would be to go back to naming all arguments (e.g. --reference) as this removes ambiguity, especially when arguments with optional values are also used.

Show normalization rules in results

Normalization rules are applied from top to bottom and may conflict. It's a good idea to surface this to the user. Logging is already implemented in #19 so this issue covers presentation only.

Publishing on pypi

To make the code easily available and installable we should probably publish the package on pypi.org

Everything is set up to already support this, once the name is decided (see #17 ) . We should publish this with an account of ebu probably, does this already exist?

PEP 8

For now Python3 PEP8 will be followed, no decision has been made yet to enforce this (we should imho), dissent can be expressed here.

(issue part of #14)

Special normalisation rules

How to deal with:

  1. numbers
  2. acronyms / symbols
  3. website / email spellings (e.g. use "dot", "at" )

I would go for trying as much as possible to have letter-based normalised representations of all the above such as:

  1. 100 -> one hundred (cento, cent)
  2. Hz -> (hertz), WHO -> double u aitch o (less sure about this one ...)
  3. www.rai.it -> vu vu vu punto rai punto it, [email protected] -> pippo at pluto dot com

of course this would be for the sake of comparison, no one would really like to have such transcripts as a final product ... we don't even need to output normalised text if not for a debug session.

Define and document WER algorithm

This algorithm uses the results of the diff algorithm (#30) to calculate Word Error Rate.

In the documentation, provide pseudo-code and discuss the approach (linking to an issue could be a good alternative).

Output structured diff results

In addition to a single WER metric (#32), return the results of the diff metrics (s/d/i/c) in a structured format.

Something like:

{
"hypothesis-file": "file1.txt",
"diff-results":{"substitution":5, "deletions":10, "insertions":1, "correct":37}
}

Include vendor name? See also #33

Group WER results by vendor

Should we return WER per vendor in addition to WER per transcript (#32)?

Consider:

  • The vendor information may not persist in the transcript
  • How to handle multiple hypothesis per vendor - do we simply average the WER?

Make language code(s) mandatory?

For:

  • without language code we can't determine normalization rules

Against:

  • some vendors support automatic language identification
  • future support for mixed-language benchmarking

Command line help output

After installing the tool, running benchmarking --help gives an incomplete helper text (see below). At least info on how to build the basic cmdline for the available modules should be added.

dlvideo@dlstation1:~/benchmarkstt$ benchmarkstt --help
usage: benchmarkstt [--help]
[--log-level {warn,error,warning,debug,critical,info,notset}]
[--version]
{normalization,metrics} ...

BenchmarkSTT main command line script

positional arguments:
{normalization,metrics}

optional arguments:
--help show this help message and exit
--log-level {warn,error,warning,debug,critical,info,notset}
set the logging output level
--version output benchmarkstt version number

Normalize transcript using replacement tables

Before WER can be calculated, the reference and hypothesis transcript should have the same normalization rules applied. This issue is for adding simple text replacements using tables. Each CSV file will be named with a language code (see #26).

In v1, replacements will include words only, without punctuations. It may still be a a good idea to to use field qualifiers in addition to delimiters, for example "back end"[tab]"backend"

What is the delimiter?

Character encoding?

Hosting of documentation and website (currently just the JSON-RPC api "explorer")

Currently we have the documentation hosted from my own repository on github pages, available at https://conferatur.mikesmith.eu
Do we want to maintain this as github pages, but under the account of EBU, or do we want to push this to readthedocs.org (again, using an ebu account).

Similarly can EBU provide us with a location to make e.g. the API JSON-RPC api explorer available? (currently made available at https://conferatur.viaa.be/api and hosted by VIAA).

Clarify status of JSON-RPC api explorer

This is a very useful tool, but since it is not in scope for v1 we should make sure it doesn't take too much time. It was decided in the meeting today that this will provided as-is, without any tests or code reviews. This issue is to add a note to this effect to the documentation.

Convert reference and hypotheses to native schema

In v1, the native schema includes words only:

[
 {text: "hello"},
 {text: "world"}
]

The transcripts returned from each vendors, and the reference, need to be converted to this native scheme before further processing.

Regex normalisation rules

We agreed that for v1, normalisation will be based on simple replacement pairs, one table per language. Many of these pairs will follow a pattern, for example well-liked:well liked, well-known:well known. These can be replaced with a regex pattern: ([a-zA-Z]+)\-([a-zA-Z]+):$1\s$2.

The proposal is for the normaliser to consume a regex pattern file per language in addition to the simple replacement table.

To be decided if this is in scope for v1.

Identify common normalisation rules

For v1, we agreed that instead of normalisation rules we will use simple 1-to-1 replacement pairs, organised by language. Once we have a few of those tables, we can automatically identify common replacements and create rules that apply to all languages, for example Hertz:Hz.

Things to consider:

  • How to maintain this list
  • How to guarantee that the rule applies to all languages?

Behaviour driven development

(related to #14 and #21 )

It seems like it would be a good idea, but needs some enthusiasm from all parties involved, developers and analysts/stakeholders.
If developers are willing to aid analysts in writing correctly formatted specifications in the dialect of the chosen BDD, and implement these, and equally the analysts are willing to learn and keep to the newly learned BDD dialect, then it could increase our overall efficiency, as well as create clarity to the proceedings.

Handling multiple files for normalization rules

There are three types of normalization rules, per language:

  • Default. This is a set of built-in rules if none are specified
  • Simple pair replacements
  • Regex replacements

Each one of these types can include more than one physical file, for example:

/en-GB/regex_file_1
/en-GB/regex_file_2
/en-GB/pairs_file_1
/en-GB/pairs_file_2

Each one of these files can contain multiple rules. The order in which the rules are applied is important, so we need to clarify -

  • In what order files are processed
  • In what order rules within each file are processed.

@MikeSmithEU 's suggestion is to pass the normalization rule files in a config file, like --normalization configfile myownconfigfile.conf

Then, inside myownconfigfile.conf, the user lists the files to use. The rules files will be processed in the same order as they appear in the config file. The rules within each file will be processed in their order in the file.

If no --normalization parameter is given it would be equivalent to something like --normalization configfile ./resources/normalization/default/en_GB.

Questions:

  • What happens when the user specifies a language but no default file exists for that language - still use the English?
  • Would it be clearer to the user (and easier to maintain) if instead of creating default files, we ship the code with some basic files in the language folder? This means that if the user doesn't use the config, no normalization takes place at all. I like this because it removes 'magic' and makes the behaviour very transparent.

Explain WER options

It's quite difficult to understand what --wer [mode] [differ_class] means, even when going into the package sections of the documentation. A short line will suffice.

Language standards for code (US spelling vs. UK spelling)

Up until now I've mostly been using UK spelling in code examples. What is the preference here, do we want to use UK spelling since this is mostly a European project (i.e. "normalisation", "colour") or do we want to use US spelling as is more widely used in open source projects (i.e. "normalization", "color")?

CLI parameters

To get the conversation going, something like:
benchmarkstt --ref reference.txt --hyp hypothesis.txt --metric wer --lan en-GB

Demo of R2

Keep it very simple:

  • 1 plain text reference file
  • 1 plain text hypothesis file from AWS
  • 1 plain text hypothesis file from Kaldi
  • 1 replacement rule in 1 file
  • 1 regex rule in 1 file

Possibly duplicate this with Dutch.

Explain worddiffs options

It's not clear what --worddiffs [dialect] [differ_class] means. Add a short line explaining the options.

CSV output generates error

CSV is an output option (-o) in the documentation (https://ebu.github.io/benchmarkstt/cli.html) but it generates an error. Personally I'm happy to remove CSV from the documentation rather than fix this.

benchmarkstt  Item2.ref.txt Item2.hyp.txt --wer -o csv
usage: benchmarkstt [-rt REFERENCE_TYPE] [-ht HYPOTHESIS_TYPE]
                    [-o {json,markdown,restructuredtext}]
                    [--diffcounts [differ_class]] [--wer [mode]
                    [differ_class]] [--worddiffs [dialect] [differ_class]]
                    [--config file [section] [encoding]] [--lowercase]
                    [--regex search [replace]] [--replace search [replace]]
                    [--replacewords search replace] [--unidecode] [--log]
                    [--version]
                    [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                    [--help]
                    reference hypothesis
benchmarkstt: error: argument -o/--output-format: invalid choice: 'csv' (choose from 'json', 'markdown', 'restructuredtext')
(venv) lavie01@mc-n354732:~/code/git/benchmarkstt/venv$ 

Output WER per hypothesis transcript

In v1, the hypothesis transcripts are provided manually without integrating with the vendor's system. The transcript file might not include the vendor information, and there may be several files per vendors. To keep things simple, in the first instance return WER per transcript, and let the user match the transcripts with the vendor.

Error when building docs locally

When running make docs locally on master branch, the main command line section is missing from the docs, and the build log has this:

/Users/lavie01/code/git/benchmarkstt/docs/cli.rst:4: WARNING: Module "benchmarkstt.cli" has no attribute "main_parser"
Incorrect argparse :module: or :func: values?

Rename repository

Refs: #17

Should we rename the repository name ai-benchmarking-stt to benchmarkstt to be more in line with the actual toolkit name? If we want this it is best to do this as soon as possible.
I would propose to do this immediately after the merge of #48 PR2, the following Pull Request will then update the repository documentation to reflect this change.

My vote: 👍

Code coverage reports and goals

(related to #14 )

What reporting strategy do we want to use for this, looking for more input and volunteers to take this on.

  • What percentage of code coverage would we find acceptable (assuming 100% is the gold standard that we will strive for, yet may not achieve)?
  • Do we only merge Pull Requests if we have e.g. 90%+ code coverage?
  • Should this be our first next goal for the existing codebase before we continue with extra development?

Post-processing normalisation

An idea for a different approach to normalisation rules: rather than define them in advance, build an interface where the user can mark replacements that are effectively the same. For example, if the ASR result has we're and the reference has we are, the user can click this word pair. When the user does this, two things happen:

  • The WER score is updated
  • A new replacement pair is added to the normalisation table.

In this way we 'crowd source' the normalisation and avoid building complex normalisation rules.

Presentation of diff results

Currently the diffs are presented as red and green in the command line:
image

This can be confusing since red can be interpreted as a deletion. Is there a better way?

Define and document diff algorithm

The algorithm identifies substitutions, deletions, insertions and correct words. In v1, whole words only.

In the documentation, include pseudo-code and a discussion of the chosen algo.

Logging error

Multiple errors when using the logging option. Only first instance pasted below.

benchmarkstt  Item2.ref.txt Item2.hyp.txt --wer --lowercase  --log
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 1037, in emit
    stream.write(msg + self.terminator)
TypeError: can only concatenate tuple (not "str") to tuple
Call stack:
  File "/Users/lavie01/code/git/benchmarkstt/venv/bin/benchmarkstt", line 10, in <module>
    sys.exit(main())
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/cli.py", line 164, in main
    benchmark_cli.main(parser, args)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/benchmark/cli.py", line 22, in main
    metrics_cli.main(parser, args, normalizer)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/metrics/cli.py", line 48, in main
    ref = list(ref)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/input/core.py", line 58, in __iter__
    return iter(self._input_class(text, normalizer=self._normalizer))
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/input/core.py", line 19, in __iter__
    return iter(self._segmenter(self._text, normalizer=self._normalizer))
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/segmentation/core.py", line 20, in __init__
    self._text = self._normalizer.normalize(text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/logger.py", line 47, in _
    result = func(cls, text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/__init__.py", line 23, in normalize
    return self._normalize(text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/__init__.py", line 62, in _normalize
    text = normalizer.normalize(text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/logger.py", line 51, in _
    logger_.info('%s: %s -> %s', list(normalize_stack), text, result)
Message: '%s: %s -> %s'
Arguments: (['NormalizationComposite', 'Lowercase'], 'Здравствуйте. На канале Россия большие вечерние вести в субботу. Будут новости дня, но и будут те, кто у нас сегодня в фокусе. Сегодня мы уже точно начинаем обратный отсчет к саммиту Путин-Трамп, и в этой связи наши собеседники — приехавшие в Москву американские сенаторы Шелби и Тюн [Тун, Thune], в фокусе посол Америке в uh России Хантсман и перепроверимся у российского сенатора Константина Косачева. Но также наши собеседники сразу Евгений Касперский, Наталья Касперская, глава Сбербанка Греф и глава, um, подразделения Visa по безопасности платежей кредитными карточками. Такой, почему такой подбор? А вот главная тема этого выпуска.\n\n\nКак еще в середине семидесятых в федеральную политику США ворвался теперь уже восьмидесятичетырехлетний сенатор Ричард Шелби. Это он на неделе привез в Москву группу конгрессменов США, первую за четыре года. Наше интервью с Шелби, а еще наше на него досье. Как он вручал картину Президенту Рейгану, еще будучи демократом, и почему перешел в республиканцы. С кем он приносил присягу в Сенате, кто как раз отвечает за кибербезопасность. Почему этот сенатор, Тюн, приехал в Москву без смартфона.\n\n\nКакая роль в этом всем посла США в Москве Хантсмана, и почему на приеме у него не было принимающего американцев сенатора Косачева. Наше интервью с ним. Так что это было, и чего ждать. До саммита Путин-Трамп уже только чуть больше недели.\n\n\nПервая съемка на самом новом секретном объекте России, в новеньком центре кибербезопасности Сбербанка. Как предотвратить взлом хакерами наших банковских счетов, возможно ли сотрудничество России и Запада хотя бы в этом?\n\n- Вот что если D_DoS-атака приходит откуда-нибудь отсюда?\n\nНа одном мероприятии Касперская Наталья и Касперский Евгений. Задаем непростые вопросы и им, и руководству самого Сбербанка. И представителю системы карточек Визa [English company name: Visa]. Какие есть варианты?\n', 'здравствуйте. на канале россия большие вечерние вести в субботу. будут новости дня, но и будут те, кто у нас сегодня в фокусе. сегодня мы уже точно начинаем обратный отсчет к саммиту путин-трамп, и в этой связи наши собеседники — приехавшие в москву американские сенаторы шелби и тюн [тун, thune], в фокусе посол америке в uh россии хантсман и перепроверимся у российского сенатора константина косачева. но также наши собеседники сразу евгений касперский, наталья касперская, глава сбербанка греф и глава, um, подразделения visa по безопасности платежей кредитными карточками. такой, почему такой подбор? а вот главная тема этого выпуска.\n\n\nкак еще в середине семидесятых в федеральную политику сша ворвался теперь уже восьмидесятичетырехлетний сенатор ричард шелби. это он на неделе привез в москву группу конгрессменов сша, первую за четыре года. наше интервью с шелби, а еще наше на него досье. как он вручал картину президенту рейгану, еще будучи демократом, и почему перешел в республиканцы. с кем он приносил присягу в сенате, кто как раз отвечает за кибербезопасность. почему этот сенатор, тюн, приехал в москву без смартфона.\n\n\nкакая роль в этом всем посла сша в москве хантсмана, и почему на приеме у него не было принимающего американцев сенатора косачева. наше интервью с ним. так что это было, и чего ждать. до саммита путин-трамп уже только чуть больше недели.\n\n\nпервая съемка на самом новом секретном объекте россии, в новеньком центре кибербезопасности сбербанка. как предотвратить взлом хакерами наших банковских счетов, возможно ли сотрудничество россии и запада хотя бы в этом?\n\n- вот что если d_dos-атака приходит откуда-нибудь отсюда?\n\nна одном мероприятии касперская наталья и касперский евгений. задаем непростые вопросы и им, и руководству самого сбербанка. и представителю системы карточек визa [english company name: visa]. какие есть варианты?\n')

Language code

Normalization rules and other resources will be organised by language codes, including country variants.

ISO 639-3 defines 3-letter codes: eng
AWS and GCP use 2-letter codes (ISO 639-1?) followed by country codes: en-US | es-US | en-AU | fr-CA | en-GB

Do we want to conform to existing vendor practice or use a combination of both, for example eng-GB?

Code reviewers

Call to arms for all willing and able code reviewers. We need each other to double check proposed work every Pull Requests as well as make meaningful suggestions and ask the right questions.

Those who feel the calling, please let it be known here, as well as possibly where the major area of expertise is (some may be excellent in checking the frontend while others may be more attuned to backend or testing stuff).

Also see #14 #15 #21

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.