<a href="https://github.com/huggingface/nlp/blob/7d1526dfeeb29248d832f1073192dbf03ad64

I think <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Tokenized BLEU considered harmful - Discussion on community-based process about evaluate HOT 11 OPEN

kpu commented on June 8, 2024 15

Tokenized BLEU considered harmful - Discussion on community-based process

from evaluate.

Comments (11)

mjpost commented on June 8, 2024 7

Yes, there are slides like that at WMT every year :) BLEU correlates with human judgment only at coarse levels, and it seems to be getting worse when people try to use it to do model selection among high-performing neural systems.

However, the point isn't whether BLEU is a good metric, but whether your BLEU score can be compared to other BLEU scores. They only can be compared if you use the same reference tokenization (similar to how you can't compare LM perplexities across different segmentations). sacrebleu was an attempt to get everyone to use WMT's reference tokenization (meaning, your system has to first remove its own tokenization) so that you could just compare across papers. This also prevents scores from being gamed.

from evaluate.

kpu commented on June 8, 2024 4

Use sacrebleu on detokenized output and raw unmodified references.

from evaluate.

mjpost commented on June 8, 2024 1

I second this request. The bottom line is that scores produced with different reference tokenizations are not comparable. To discourage (even inadvertent) cheating, the user should never touch the reference. The v13a tokenization standard is not ideal, but at least it has been consistently used at matrix.statmt.org, facilitating comparisons.

Sacrebleu exposes all its data sources and additionally provides an API to accessing the references, which seem to fit within the spirit of your codebase.

from evaluate.

kpu commented on June 8, 2024 1

I think @bittlingmayer is referring to Figure 6 in http://statmt.org/wmt19/pdf/53/WMT02.pdf . When you look at Appendix A there are some cases where metrics fall apart at the high end and some where they correlate well. en-zh is arguably production-quality.

This could evolve into a metrics Bazaar where the value add is really the packaging and consistency: it installs/compiles the metrics for me, gives a reproducible name to use in publication (involve the authors; you don't want a different sacrebleu hash system), a version number, and evaluation of the metrics like http://ufallab.ms.mff.cuni.cz/~bojar/wmt19-metrics-task-package.tgz but run when code changes rather than once a year.

from evaluate.

kaivu1999 commented on June 8, 2024 1

Very important discussion.
I am trying to understand the effects of tokenization.
I wanted to ask which is a good practice.
Sacrebleu should be used on top of the tokenized output, or detokenized(raw) text?

from evaluate.

bittlingmayer commented on June 8, 2024

Didn't we have a slide and discussion at WMT admitting that, for production-quality models, BLEU doesn't correlate with human eval anyway?

from evaluate.

tholiao commented on June 8, 2024

I do not consider as a sufficient solution switching this library's default metric from BLEU to the wrapper around SacreBLEU.

As currently implemented, the wrapper allows end users to toggle SacreBLEU options, but doesn't pass along the SacreBLEU signature. As @mjpost showed in Post18, it's simply not credible to assume that people will stick to the defaults, therefore, the signature is necessary to be explicit about what options were used.

In addition to the v13a or intl options for the SacreBLEU tokenize argument, which was pointed out earlier, papers frequently differ on whether they lowercase text before scoring (lowercase) and the smoothing method used (smooth_method). BLEU scores can differ substantially (over 1 BLEU) just by changing these options.

Losing the SacreBLEU signature is a regression in reproducibility and clarity.

(Perhaps this should belong in a separate issue?)

from evaluate.

thomwolf commented on June 8, 2024

Thanks for sharing your thoughts. This is a very important discussion.

Also one of the first items on our mid-term roadmap (we will try to clean it and share it soon) is to introduce mechanisms to get high-quality traceability and reproducibility for all the processes related to the library.

So having the signature for sacrebleu is really important!

Regarding BLEU, I guess we can just remove it from the canonical metrics included in the repo itself (it won't prevent people to add it as "user-metrics" but at least we won't be promoting it).

On a more general note (definitely too large for the scope of this issue) we are wondering, with @srush in particular, how we could handle the selection of metrics/datasets with the most community-based and bottom-up approach possible. If you have opinions on this, please share!

from evaluate.

srush commented on June 8, 2024

Yeah, I would love to have discussions about ways this project can have an community-based, transparent process to arrive at strong default metrics. @kpu / @mjpost do you have any suggestions of how that might work or pointers to places where this is done right? Perhaps this question can be template for what is likely to be repeated for other datasets.

from evaluate.

srush commented on June 8, 2024

While a Bazaar setup works for models / datasets, I am not sure it is ideal for metrics ? Ideal from my perspective would be to have tasks with metrics moderated by experts who document, cite, and codify known pitchfalls (as above^) and make it non-trivial for beginners to mess it up.

from evaluate.

bittlingmayer commented on June 8, 2024

@srush @thomwolf

ModelFront could provide (automated, "QE-based") evaluation for all the pretrained translation models you host. Not bottom-up and not valid for claiming SoTA, but independent, practical for builders and not top-down.

For that I would also suggest some diverse benchmarks (so split it out into datasets with only user-generated data, or only constants, or only UI strings, or only READMEs) which tease out known trade-offs. Even hypothetical magic eval is limited if we always reduce it to a single number.

Realistically people want to know how a model compares to an API like Google Translate, Microsoft Translator, DeepL or Yandex (especially for a language pair like EN:RU, or for the many languages that only Yandex supports), and that could be done too.

from evaluate.

Tokenized BLEU considered harmful - Discussion on community-based process about evaluate HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent