ebu / benchmarkstt Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 8.0 4.41 MB

Open Source AI Benchmarking toolkit for benchmarking speech to text services

License: MIT License

Dockerfile 0.35% Makefile 0.53% Python 90.85% HTML 8.26%

asr-benchmark benchmark benchmarking-suite machine-learning speech-to-text stt-benchmark

benchmarkstt's People

Contributors

Stargazers

Watchers

Forkers

amessina71 mikesmitheu danielthepope murezzda kleinrana darekgit steven8400 jon-chun

benchmarkstt's Issues

Contacts and Contributors

A number of contacts need to be defined, for example in the code of conduct we have a [TODO: add email]. Which email can we assign as the guardian of our code of conduct?
Do we add a "CONTRIBUTORS" file as is often done for open source projects? In what measure do we push forward the developers and active maintainers of the project code and in what measure do we instead unite under the umbrella of "EBU"?

Stating some loosely defined roles can aid in clarity for all involved in the project.

Add simple flow based on current normalizers and metrics

I.e. a command that does:

normalization -> segmentation -> metrics

Version numbering

I propose following PEP 0440 for this:

https://www.python.org/dev/peps/pep-0440/

In short this would mean version numbers like this:

X.YaN   # Alpha release
X.YbN   # Beta release
X.YrcN  # Release Candidate
X.Y     # Final release

Our current tagged pre-release version, as tagged in github, is 0.0.2, this should be 0.0a2 to indicate the pre-release status (the next pull request - another pre-release - I propose tagging as 0.0a3.

Project toolkit name

We have to decide on how to name the toolkit. An easily identifiable name stands out of the pack and seems to encourage usage.

At the moment I've just chosen a name for #16 that seemed perfectly adequate: "conferatur". I've chosen this because the latin meaning seems to apply decently to this project and that could be easily search/replaced if we go with another name.

See: https://en.wiktionary.org/wiki/confero#Latin

On one hand it means "to bring together", "to unite", "to consult" as is easily identifiable from the verb "to confer" which is based on it. This is what we are attempting to do with this project, to create something through cooperation of interested parties that will be useful to all.
On the other hand it means "to set in opposition, oppose, match", which is what the toolkit itself is trying to do, namely to compare different solutions by matching them through common metrics.

The name is not set in stone, and may be changed at any time, but I propose once we've decided on it to rename the git repository name to this as to keep things consistent.

Ground truth materials

2 freely available possible datasets have already been identified, more are welcome:

Mozilla Common Voice https://voice.mozilla.org/en
CC-0 license
Openslr resources http://openslr.org/resources.php
Each resource has own license ranging from "unrestricted" to "CC-BY-NC-ND 3.0"
Remark: Some of the Openslr data is likely to have been used for training various STT systems, as such it may not always be the most fair indicator

Open questions:

Which ground truth materials might we use for evaluating vendors' solutions? Will we build our own dataset? Or both?
How will we include these resources into our product?

Organisation of resources/languages

Should we organise around languages or resource types? E.g.:

it_it/stopwords
en_uk/stopwords

stopwords
|
-- it_it
-- en_uk

Licence

Options:

MIT
Apache
GPL
BSD

Position of positional arguments

benchmarkstt ref.txt hyp.txt --wer works, but benchmarkstt --wer ref.txt hyp.txt returns error: the following arguments are required: reference, hypothesis.

The help and documentation make it look like the positional arguments should come at the end, but they seem to need to be at the start of the command.

My preferred solution would be to go back to naming all arguments (e.g. --reference) as this removes ambiguity, especially when arguments with optional values are also used.

Show normalization rules in results

Normalization rules are applied from top to bottom and may conflict. It's a good idea to surface this to the user. Logging is already implemented in #19 so this issue covers presentation only.

Publishing on pypi

To make the code easily available and installable we should probably publish the package on pypi.org

Everything is set up to already support this, once the name is decided (see #17 ) . We should publish this with an account of ebu probably, does this already exist?

PEP 8

For now Python3 PEP8 will be followed, no decision has been made yet to enforce this (we should imho), dissent can be expressed here.

(issue part of #14)

Make demo resources available

Add the demo procedure and associated materials to the documentation so that users can run it themselves.

Code review and merge procedure

A discussion point for next meeting.

Special normalisation rules

How to deal with:

numbers
acronyms / symbols
website / email spellings (e.g. use "dot", "at" )

I would go for trying as much as possible to have letter-based normalised representations of all the above such as:

100 -> one hundred (cento, cent)
Hz -> (hertz), WHO -> double u aitch o (less sure about this one ...)
www.rai.it -> vu vu vu punto rai punto it, [email protected] -> pippo at pluto dot com

of course this would be for the sake of comparison, no one would really like to have such transcripts as a final product ... we don't even need to output normalised text if not for a debug session.

Github slack integration

Encourages the slack #dev channel to more easily follow what is going on with the project.

Would be as easy as adding the integration from https://viaa.slack.com/apps/A8GBNUWU8-github

and then subscribing in the #dev channel:

/github subscribe ebu/ai-benchmarking-stt

Define and document WER algorithm

This algorithm uses the results of the diff algorithm (#30) to calculate Word Error Rate.

In the documentation, provide pseudo-code and discuss the approach (linking to an issue could be a good alternative).

Output structured diff results

In addition to a single WER metric (#32), return the results of the diff metrics (s/d/i/c) in a structured format.

Something like:

{
"hypothesis-file": "file1.txt",
"diff-results":{"substitution":5, "deletions":10, "insertions":1, "correct":37}
}

Include vendor name? See also #33

Group WER results by vendor

Should we return WER per vendor in addition to WER per transcript (#32)?

Consider:

The vendor information may not persist in the transcript
How to handle multiple hypothesis per vendor - do we simply average the WER?

Make language code(s) mandatory?

For:

without language code we can't determine normalization rules

Against:

some vendors support automatic language identification
future support for mixed-language benchmarking

Command line help output

After installing the tool, running benchmarking --help gives an incomplete helper text (see below). At least info on how to build the basic cmdline for the available modules should be added.

dlvideo@dlstation1:~/benchmarkstt$ benchmarkstt --help
usage: benchmarkstt [--help]
[--log-level {warn,error,warning,debug,critical,info,notset}]
[--version]
{normalization,metrics} ...

BenchmarkSTT main command line script

positional arguments:
{normalization,metrics}

optional arguments:
--help show this help message and exit
--log-level {warn,error,warning,debug,critical,info,notset}
set the logging output level
--version output benchmarkstt version number

Normalize transcript using replacement tables

Before WER can be calculated, the reference and hypothesis transcript should have the same normalization rules applied. This issue is for adding simple text replacements using tables. Each CSV file will be named with a language code (see #26).

In v1, replacements will include words only, without punctuations. It may still be a a good idea to to use field qualifiers in addition to delimiters, for example "back end"[tab]"backend"

What is the delimiter?

Character encoding?

Hosting of documentation and website (currently just the JSON-RPC api "explorer")

Currently we have the documentation hosted from my own repository on github pages, available at https://conferatur.mikesmith.eu
Do we want to maintain this as github pages, but under the account of EBU, or do we want to push this to readthedocs.org (again, using an ebu account).

Similarly can EBU provide us with a location to make e.g. the API JSON-RPC api explorer available? (currently made available at https://conferatur.viaa.be/api and hosted by VIAA).

Clarify status of JSON-RPC api explorer

This is a very useful tool, but since it is not in scope for v1 we should make sure it doesn't take too much time. It was decided in the meeting today that this will provided as-is, without any tests or code reviews. This issue is to add a note to this effect to the documentation.

Convert reference and hypotheses to native schema

In v1, the native schema includes words only:

[
 {text: "hello"},
 {text: "world"}
]

The transcripts returned from each vendors, and the reference, need to be converted to this native scheme before further processing.

Regex normalisation rules

We agreed that for v1, normalisation will be based on simple replacement pairs, one table per language. Many of these pairs will follow a pattern, for example well-liked:well liked, well-known:well known. These can be replaced with a regex pattern: ([a-zA-Z]+)\-([a-zA-Z]+):$1\s$2.

The proposal is for the normaliser to consume a regex pattern file per language in addition to the simple replacement table.

To be decided if this is in scope for v1.

Identify common normalisation rules

For v1, we agreed that instead of normalisation rules we will use simple 1-to-1 replacement pairs, organised by language. Once we have a few of those tables, we can automatically identify common replacements and create rules that apply to all languages, for example Hertz:Hz.

Things to consider:

How to maintain this list
How to guarantee that the rule applies to all languages?

Behaviour driven development

(related to #14 and #21 )

It seems like it would be a good idea, but needs some enthusiasm from all parties involved, developers and analysts/stakeholders.
If developers are willing to aid analysts in writing correctly formatted specifications in the dialect of the chosen BDD, and implement these, and equally the analysts are willing to learn and keep to the newly learned BDD dialect, then it could increase our overall efficiency, as well as create clarity to the proceedings.

Create documentation pages

Add the docs as gh-pages to this repo. As Separate step we can look into using and EBU domain name.

Handling multiple files for normalization rules

There are three types of normalization rules, per language:

Default. This is a set of built-in rules if none are specified
Simple pair replacements
Regex replacements

Each one of these types can include more than one physical file, for example:

/en-GB/regex_file_1
/en-GB/regex_file_2
/en-GB/pairs_file_1
/en-GB/pairs_file_2

Each one of these files can contain multiple rules. The order in which the rules are applied is important, so we need to clarify -

In what order files are processed
In what order rules within each file are processed.

@MikeSmithEU 's suggestion is to pass the normalization rule files in a config file, like --normalization configfile myownconfigfile.conf

Then, inside myownconfigfile.conf, the user lists the files to use. The rules files will be processed in the same order as they appear in the config file. The rules within each file will be processed in their order in the file.

If no --normalization parameter is given it would be equivalent to something like --normalization configfile ./resources/normalization/default/en_GB.

Questions:

What happens when the user specifies a language but no default file exists for that language - still use the English?
Would it be clearer to the user (and easier to maintain) if instead of creating default files, we ship the code with some basic files in the language folder? This means that if the user doesn't use the config, no normalization takes place at all. I like this because it removes 'magic' and makes the behaviour very transparent.

Explain WER options

It's quite difficult to understand what --wer [mode] [differ_class] means, even when going into the package sections of the documentation. A short line will suffice.

Language standards for code (US spelling vs. UK spelling)

Up until now I've mostly been using UK spelling in code examples. What is the preference here, do we want to use UK spelling since this is mostly a European project (i.e. "normalisation", "colour") or do we want to use US spelling as is more widely used in open source projects (i.e. "normalization", "color")?

CLI parameters

To get the conversation going, something like:
benchmarkstt --ref reference.txt --hyp hypothesis.txt --metric wer --lan en-GB

Demo of R2

Keep it very simple:

1 plain text reference file
1 plain text hypothesis file from AWS
1 plain text hypothesis file from Kaldi
1 replacement rule in 1 file
1 regex rule in 1 file

Possibly duplicate this with Dutch.

Explain worddiffs options

It's not clear what --worddiffs [dialect] [differ_class] means. Add a short line explaining the options.

Add key to worddiffs colors

As agreed in #64, we need to add to the command line output a short explanation of the colors.

CSV output generates error

CSV is an output option (-o) in the documentation (https://ebu.github.io/benchmarkstt/cli.html) but it generates an error. Personally I'm happy to remove CSV from the documentation rather than fix this.

benchmarkstt  Item2.ref.txt Item2.hyp.txt --wer -o csv
usage: benchmarkstt [-rt REFERENCE_TYPE] [-ht HYPOTHESIS_TYPE]
                    [-o {json,markdown,restructuredtext}]
                    [--diffcounts [differ_class]] [--wer [mode]
                    [differ_class]] [--worddiffs [dialect] [differ_class]]
                    [--config file [section] [encoding]] [--lowercase]
                    [--regex search [replace]] [--replace search [replace]]
                    [--replacewords search replace] [--unidecode] [--log]
                    [--version]
                    [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                    [--help]
                    reference hypothesis
benchmarkstt: error: argument -o/--output-format: invalid choice: 'csv' (choose from 'json', 'markdown', 'restructuredtext')
(venv) lavie01@mc-n354732:~/code/git/benchmarkstt/venv$

Proof read documentation for 1.0rc1

Go over the various documentation (rst, wiki, olddocs), proof read and tidy up for the release.

Output WER per hypothesis transcript

In v1, the hypothesis transcripts are provided manually without integrating with the vendor's system. The transcript file might not include the vendor information, and there may be several files per vendors. To keep things simple, in the first instance return WER per transcript, and let the user match the transcripts with the vendor.

Error when building docs locally

When running make docs locally on master branch, the main command line section is missing from the docs, and the build log has this:

/Users/lavie01/code/git/benchmarkstt/docs/cli.rst:4: WARNING: Module "benchmarkstt.cli" has no attribute "main_parser"
Incorrect argparse :module: or :func: values?

Add instructions for building the docs

Mainly because I keep losing them!

Rename repository

Refs: #17

Should we rename the repository name ai-benchmarking-stt to benchmarkstt to be more in line with the actual toolkit name? If we want this it is best to do this as soon as possible.
I would propose to do this immediately after the merge of #48 PR2, the following Pull Request will then update the repository documentation to reflect this change.

My vote: 👍

Code coverage reports and goals

(related to #14 )

What reporting strategy do we want to use for this, looking for more input and volunteers to take this on.

What percentage of code coverage would we find acceptable (assuming 100% is the gold standard that we will strive for, yet may not achieve)?
Do we only merge Pull Requests if we have e.g. 90%+ code coverage?
Should this be our first next goal for the existing codebase before we continue with extra development?

Post-processing normalisation

An idea for a different approach to normalisation rules: rather than define them in advance, build an interface where the user can mark replacements that are effectively the same. For example, if the ASR result has we're and the reference has we are, the user can click this word pair. When the user does this, two things happen:

The WER score is updated
A new replacement pair is added to the normalisation table.

In this way we 'crowd source' the normalisation and avoid building complex normalisation rules.

Normalize transcript using regex rules

Before WER can be calculated, the reference and hypothesis transcript should have the same normalization rules applied. This issue is for adding regex replacements using tables. Each CSV file will be named with a language code (see #26) followed by -regex, for example en-US-regex.

This has already been implemented with a UI tester: https://conferatur.viaa.be/api#normalization-regexreplace so this ticket is to track decisions.

Presentation of diff results

Currently the diffs are presented as red and green in the command line:

This can be confusing since red can be interpreted as a deletion. Is there a better way?

Define and document diff algorithm

The algorithm identifies substitutions, deletions, insertions and correct words. In v1, whole words only.

In the documentation, include pseudo-code and a discussion of the chosen algo.

Logging error

Multiple errors when using the logging option. Only first instance pasted below.

benchmarkstt  Item2.ref.txt Item2.hyp.txt --wer --lowercase  --log
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 1037, in emit
    stream.write(msg + self.terminator)
TypeError: can only concatenate tuple (not "str") to tuple
Call stack:
  File "/Users/lavie01/code/git/benchmarkstt/venv/bin/benchmarkstt", line 10, in <module>
    sys.exit(main())
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/cli.py", line 164, in main
    benchmark_cli.main(parser, args)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/benchmark/cli.py", line 22, in main
    metrics_cli.main(parser, args, normalizer)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/metrics/cli.py", line 48, in main
    ref = list(ref)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/input/core.py", line 58, in __iter__
    return iter(self._input_class(text, normalizer=self._normalizer))
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/input/core.py", line 19, in __iter__
    return iter(self._segmenter(self._text, normalizer=self._normalizer))
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/segmentation/core.py", line 20, in __init__
    self._text = self._normalizer.normalize(text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/logger.py", line 47, in _
    result = func(cls, text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/__init__.py", line 23, in normalize
    return self._normalize(text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/__init__.py", line 62, in _normalize
    text = normalizer.normalize(text)
  File "/Users/lavie01/code/git/benchmarkstt/venv/lib/python3.7/site-packages/benchmarkstt/normalization/logger.py", line 51, in _
    logger_.info('%s: %s -> %s', list(normalize_stack), text, result)
Message: '%s: %s -> %s'
Arguments: (['NormalizationComposite', 'Lowercase'], 'Здравствуйте. На канале Россия большие вечерние вести в субботу. Будут новости дня, но и будут те, кто у нас сегодня в фокусе. Сегодня мы уже точно начинаем обратный отсчет к саммиту Путин-Трамп, и в этой связи наши собеседники — приехавшие в Москву американские сенаторы Шелби и Тюн [Тун, Thune], в фокусе посол Америке в uh России Хантсман и перепроверимся у российского сенатора Константина Косачева. Но также наши собеседники сразу Евгений Касперский, Наталья Касперская, глава Сбербанка Греф и глава, um, подразделения Visa по безопасности платежей кредитными карточками. Такой, почему такой подбор? А вот главная тема этого выпуска.\n\n\nКак еще в середине семидесятых в федеральную политику США ворвался теперь уже восьмидесятичетырехлетний сенатор Ричард Шелби. Это он на неделе привез в Москву группу конгрессменов США, первую за четыре года. Наше интервью с Шелби, а еще наше на него досье. Как он вручал картину Президенту Рейгану, еще будучи демократом, и почему перешел в республиканцы. С кем он приносил присягу в Сенате, кто как раз отвечает за кибербезопасность. Почему этот сенатор, Тюн, приехал в Москву без смартфона.\n\n\nКакая роль в этом всем посла США в Москве Хантсмана, и почему на приеме у него не было принимающего американцев сенатора Косачева. Наше интервью с ним. Так что это было, и чего ждать. До саммита Путин-Трамп уже только чуть больше недели.\n\n\nПервая съемка на самом новом секретном объекте России, в новеньком центре кибербезопасности Сбербанка. Как предотвратить взлом хакерами наших банковских счетов, возможно ли сотрудничество России и Запада хотя бы в этом?\n\n- Вот что если D_DoS-атака приходит откуда-нибудь отсюда?\n\nНа одном мероприятии Касперская Наталья и Касперский Евгений. Задаем непростые вопросы и им, и руководству самого Сбербанка. И представителю системы карточек Визa [English company name: Visa]. Какие есть варианты?\n', 'здравствуйте. на канале россия большие вечерние вести в субботу. будут новости дня, но и будут те, кто у нас сегодня в фокусе. сегодня мы уже точно начинаем обратный отсчет к саммиту путин-трамп, и в этой связи наши собеседники — приехавшие в москву американские сенаторы шелби и тюн [тун, thune], в фокусе посол америке в uh россии хантсман и перепроверимся у российского сенатора константина косачева. но также наши собеседники сразу евгений касперский, наталья касперская, глава сбербанка греф и глава, um, подразделения visa по безопасности платежей кредитными карточками. такой, почему такой подбор? а вот главная тема этого выпуска.\n\n\nкак еще в середине семидесятых в федеральную политику сша ворвался теперь уже восьмидесятичетырехлетний сенатор ричард шелби. это он на неделе привез в москву группу конгрессменов сша, первую за четыре года. наше интервью с шелби, а еще наше на него досье. как он вручал картину президенту рейгану, еще будучи демократом, и почему перешел в республиканцы. с кем он приносил присягу в сенате, кто как раз отвечает за кибербезопасность. почему этот сенатор, тюн, приехал в москву без смартфона.\n\n\nкакая роль в этом всем посла сша в москве хантсмана, и почему на приеме у него не было принимающего американцев сенатора косачева. наше интервью с ним. так что это было, и чего ждать. до саммита путин-трамп уже только чуть больше недели.\n\n\nпервая съемка на самом новом секретном объекте россии, в новеньком центре кибербезопасности сбербанка. как предотвратить взлом хакерами наших банковских счетов, возможно ли сотрудничество россии и запада хотя бы в этом?\n\n- вот что если d_dos-атака приходит откуда-нибудь отсюда?\n\nна одном мероприятии касперская наталья и касперский евгений. задаем непростые вопросы и им, и руководству самого сбербанка. и представителю системы карточек визa [english company name: visa]. какие есть варианты?\n')

Language code

Normalization rules and other resources will be organised by language codes, including country variants.

ISO 639-3 defines 3-letter codes: eng
AWS and GCP use 2-letter codes (ISO 639-1?) followed by country codes: en-US | es-US | en-AU | fr-CA | en-GB

Do we want to conform to existing vendor practice or use a combination of both, for example eng-GB?

Code reviewers

Call to arms for all willing and able code reviewers. We need each other to double check proposed work every Pull Requests as well as make meaningful suggestions and ask the right questions.

Those who feel the calling, please let it be known here, as well as possibly where the major area of expertise is (some may be excellent in checking the frontend while others may be more attuned to backend or testing stuff).

Also see #14 #15 #21

Testing approach and quality expectations

Things we might want to consider:

BDD, unit tests?
Code review standards
Python style (PEP8?)
Code quality tooling (statistical analysis, style)

For discussion.