appraisedev / ocelot Goto Github PK

Project OCELoT: an Open, Collaborative Evaluation Leaderboard of Translations

License: BSD 3-Clause "New" or "Revised" License

Python 75.34% HTML 20.89% CSS 3.59% Shell 0.18%

ocelot's Introduction

Welcome to OCELoT!

OCELoT stands for Open, Competitive Evaluation Leaderboard of Translations. This project started as part of the Fifth Machine Translation Marathon in the Americas, hosted at UMD, College Park, MD, from May 28–June 1, 2019. Project OCELoT aims to create an open platform for competitive evaluation of machine translation output, based on both automatic metrics and human evalation. Code is available from GitHub and shared under an open license.

Getting started

License

OCELoT is released under BSD 3-Clause License

Contributors

We are grateful to the following contributors for their support:

Christian Federmann (Microsoft)
Roman Grundkiewicz (Microsoft)
Tom Kocmi (Microsoft)
Zi-Yi Dou (Carnegie Mellon University)

Acknowledgments

We acknowledge the following licenses and thank the respective authors for making their work available:

Project compare-mt by NeuLab released under BSD 3-Clause License; you should also read their NAACL19 paper!
Logo Glyphe Ocelotl (jaguar) by Katepanomegas released on Wikimedia Commons under CC BY 3.0 License

ocelot's People

Contributors

Stargazers

Watchers

Forkers

snukky cfedermann helsinki-nlp

ocelot's Issues

Ocelot for WMT21 is unavailavle

I get a 502 (bad gateway) from https://ocelot.mteval.org/

best
Barry

Validation error when score -1

As fixed with 0ac45f4

We would need a proper validation error in case we end up in the -1 scenario

Difference between "active" and "verified" flag for teams

Why is the "active" flag set to False for newly created teams? Shouldn't it be set to active? And what does it affect (in contrast to "verified")?

Create basic usage examples

Basic examples for creating a campaign, adding a test set, exporting the results, etc. These could follow the very simple pattern we have in Appraise examples (see Examples/).

Related issues: #6

Add possibility for running multiple campaigns

At the moment the current campaign is hard-coded (the WMT robustness task). This is a basic feature needed for release v1.0.0

A test set file with multiple domains, but automatic scores reported only on one of them

Is your feature request related to a problem? Please describe.
WMT22 will have a test set with multiple domains from several shared tasks. We want participants to submit translations of all domains, but the scores in OCELoT should be reported for individual task only.

Describe the solution you'd like
An additional optional node in the XML file that can restrict on which set of documents/segments the automatic scores are reported in OCELoT.

Describe alternatives you've considered
A few more, but this one the best.

Additional context
None

Create documentation for WMT20 code release

Allowing users to mark submission removed

Is your feature request related to a problem? Please describe.
Some participants wanted to withdraw their submission (before deadline), but there is no other way to do so. Allowing participants to withdraw submission/mark them removed could solve it (it could be also a sanity check for bad submissions).

Describe the solution you'd like
To add an option that would allow people to mark their submissions deleted.

Remove static files and use collectstatic instead

Describe the bug
Suggestion from @snukky in #4:

"Running python manage.py collectstatic overwrites some static files. I think ideally we should add static to .gitignore and should not keep static files in the repository.

To Reproduce
Steps to reproduce the behavior:

Check out
Run python manage.py collectstatic

See an error message similar to the one below:

 (OCELoT) PS C:\Users\chrife\Documents\GitHub\OCELoT> python .\manage.py collectstatic
 WARNING:root:Could not import signal.SIGPIPE (this is expected on Windows machines)
 
 You have requested to collect static files at the destination
 location as specified in your settings:
 
     C:\Users\chrife\Documents\GitHub\OCELoT\static
 
 This will overwrite existing files!
 Are you sure you want to do this?
 
 Type 'yes' to continue, or 'no' to cancel:

De-anonymise the primary systems for WMT21 news task

Now that we have a settled set of primary systems for WMT21, can we de-anonymise the primary systems (and only the primary systems)? The leaderboard showing all primary systems should be publicly visible,

best
Barry

A command-line tool/script for making submissions

Is your feature request related to a problem? Please describe.
We would like to have a tool/script for making a new submission from the command line to a specific OCELoT instance.

Describe the solution you'd like

A Python script that can be used to make a submission in the same way as it is done via https://ocelot.mteval.org/submit
The script could simply make a JSON request to some API exposed by OCELoT
The script should not be a Django management command so that it could be installable via setup.py
If the submission is successful, a message with BLEU and Chrf scores is displayed, otherwise there is an error message

Usage examples (it's only a draft and the script options/usage need a proper design):

# List available test sets from open competitions
./ocelot_submission.py https://ocelot.mteval.org

# Make a submission
./ocelot_submission.py https://ocelot.mteval.org --user team-name --token abcd1234
    --submission submission.xml --format xml --test-set 'newstest2020.en-de'

The sign-up token is not treated as password

Changing <input type="text" name="token" required="" id="id_token" maxlength="10"> into type="password" would allow password managers (in browsers) to store the token as password.

Submitting compressed submissions

Is your feature request related to a problem? Please describe.
At WMT22 we had a problem with DoS, the potential problem was large testsets that needed to be uploaded. If they have been compressed, it could decrease the time.

Describe the solution you'd like
Allowing to submit, compressed files.

Set up system logging

We should set up proper logging in OCELoT apps, and clean up print calls used for logging debug messages.

Logging in development could write to both ocelot.log and to stdout of runserver.

A tool/script for batch creating test sets

Is your feature request related to a problem? Please describe.
Creating new test sets via the admin panel is time consuming and potentially error-prone. We could provide a tool for adding multiple new test sets from the command line given a path to the folder with test sets and probably a manifest file specifying metadata (languages, file format, competition ID/name, if it's active, if it's public, etc.)

Describe the solution you'd like
A command-line script for creating new test sets. The details are yet to be designed.

Allowing bulk submission of multiple language pairs at once

Is your feature request related to a problem? Please describe.
At WMT22, many teams have been submitting outputs for several language pairs, which needed a manual submission for each language pair. This problem likely contributed to the DoS problem we faced at WMT22

Describe the solution you'd like
Allowing to submit one zip file with submissions for multiple systems

[Question] Are there some documentation to deploy a local version of Ocelot?

Allow submitting before verifying teams

Is your feature request related to a problem? Please describe.
non-verified teams must wait for verification before they can start submitting, which means they need to visit the page twice instead of doing it at once.

Describe the solution you'd like
Allow submitting right after registration, but do not show the scores for teams that are not verified yet.
Problem: non-verified submissions must not see the scores to avoid hill-climbing on scores with multiple non-verified accounts

Secondary/contrastive submissions

Is your feature request related to a problem? Please describe.
Some shared tasks would like to be able to collect not only primary submissions, but also secondary/contrastive ones.

Describe the solution you'd like
A team gets an option to select a secondary submission for each test set, similarly as it's now with primary submissions. An administrator can download a package with all primary and secondary submissions from the admin panel.

Describe alternatives you've considered
None

Additional context
N/A

Unverified team shouldn't be able to submit

The interface seems like an unverified team can submit translations, however, it blocks the submission. Maybe they shouldn't see the page at all until their verification.

Pre-verify accounts from last year

Is your feature request related to a problem? Please describe.
Every year, teams must verify before they can submit. We could use list of teams from last year and pre-verify them automatically

Set up default primary system to avoid missing participants

Is your feature request related to a problem? Please describe.
When participant forgets to set up primary system, they are removed from evaluation

Describe the solution you'd like
One suggestion: Set the first submission as a primary.

admins should be able to see team names

As an administrator I would like to be able to see the team names when I am browsing the list of submissions.

At the moment, when I go to "Leaderboard -> submissions" I can see all submissions, but they are labelled as "Anonymous #XXX" . Being able to see the team names would make it easier to monitor the campaign.

Export results from challenges/campaigns

Add a management admin command for

exporting all results from a specific campaign,
exporting results for a specific challenge/test set,
exporting primary submissions only.

Plus exporting test set outputs via creating an archive in memory and making it available for download.

This is a basic feature for release v1.0.0

Updating Admin to improve collecting of information

Is your feature request related to a problem? Please describe.
During WMT22 we run into several issues that have been difficult to tackle:

Getting a list of which submissions are constrained and which are not.
having mapping between submissions filename and the team name
getting a list of abstracts and publication names for teams
getting a list of teams that submitted into a particular competition (filtering out those who did not submit into General MT, since we share teams)

Describe the solution you'd like
Ideally have a button that takes everything into consideration (if it is primary, if it was not removed, ...) and exports data with all metadata. This could also avoid human errors

Update requirements to Django 3

Update requirements.txt to Django 3. It currently installs Django 2, when we want to keep python and libraries as up to date as possible.

Import human evaluation results

Add a management command for importing human rankings for a given test set. Human scores should be displayed as another metric/column in the table with results.

In general, we would like to be able to import results from any metric.

Unify two team names for single account

Is your feature request related to a problem? Please describe.
I am not sure how it happens, but there can be different names in the Ocelot leaderboard and different name then on files we use for further evaluation. Could these two team names be unified?

Refactor handling of scores

GitHub just deleted my long blurb about refactoring scores into Score instances... sigh

Not being too restrictive about formatting

Describe the bug
Since researchers are converting files by wrap.py, they can miss adding the filename extensions ".xml" which will render their submission invalid.
https://github.com/AppraiseDev/OCELoT/blob/main/leaderboard/models.py#L240

What about removing that condition? Or alternatively extending the help (I will create PR).

Option to disallow TXT submissions

Is your feature request related to a problem? Please describe.
At WMT22, we run into issues as the pipeline downstream of processing submissions is not well prepared for TXT submissions. Would be great to have an option to forbid everything except XML, or integrate a converter that exports all submissions from OCELoT in the same format.

Delay in scores calculation

Is your feature request related to a problem? Please describe.
At WMT22 we had a problem with DoS, the potential problem was large testsets that needed to be scored. The solution could be calculating scores with a delay when the system has enough resources.

Describe the solution you'd like
Calculate scores with a delay. Possibly calculate only ChrF scores first.

Languages should be sorted alphabetically in the admin panel

Describe the bug
When creating new test sets in the admin panel, the select box with languages seems having them sorted by their IDs. This makes finding languages suboptimal and alphabetical sorting would be much more convinient.

To Reproduce
Start adding a new test set through the admin panel.

Expected behavior
Languages in the select box are sorted alphabetically by their names.

Use pathlib.Path instead of os.path

Following #25 (comment), we would like to switch from using os.path to pathlib.Path.

When exporting primary systems, I need the team's publication name

To export the primary submissions, I am using the export function on the submissions page in the admin console. The filename seems to include the run name, but I need to identify the team's publication name. Could it be included in the filename (or in the xml, but filename is probably easier)

best
Barry

Allow participants to NOT select system as primary

Is your feature request related to a problem? Please describe.
Currently, participants has to pick primary system for all language pairs they participated, but one participant this year wanted to only evaluate some language pairs (they submitted others just out of curiosity).

Describe the solution you'd like
To have a option in the droplist to NOT select primary system.

Allow to compare any two systems in a leaderboard table

Describe the solution you'd like
Adopt MTMA19 plan and add support for pairwise comparison of any two systems in a leaderboard table. Based on compare-mt, which may need to be updated to the latest version (ping CMU folks). We may have to implement some precomputation method to speed things up.

Describe alternatives you've considered
MT-CompareEval was considered, but dropped due to non-availability of (up-to-date and complete) Python fork.

Additional context
This will make it easier to compare systems, and would allow qualitative analysis of leaderboard systems.

Support various file extensions for test sets and submissions

Currently the only extension supported for text files is .txt and for SGML files it is .sgm, both hard-coded, possibly in several places. But this does not make much sense if there is a separate file_format field in models working with uploaded files. Do you plan to support various extensions for submissions/test sets in the specific format? If yes, I think file_format would be better.

Edit: Clean this up and allow for uploading files with different extensions, relaying on the already existing file_format field in models.

Choosing automatic metrics shown in a leaderboard

Is your feature request related to a problem? Please describe.
WMT22 will feature the sign LT task with links/paths to videos as references and translations. In the current setup, BLEU and ChrF are computed on URLs/paths, which is not what we may want.

Describe the solution you'd like
Show results of submission validation (successful/unsuccessful) instead of BLEU and ChrF.

Describe alternatives you've considered
With increased amount of effort needed to implement:

An option for disabling automatic metrics for a competition, or better, a test set.
If an automatic metric is disabled, show in the leaderboard if the submission validation was successful or not.
Allow choosing one or more automatic metrics used for a test set from the predefined set of available metrics (BLEU, ChrF, validated).

Additional context
N/A

Collect metadata about systems for WMT21+

Is your feature request related to a problem? Please describe.

I'm always frustrated when I am aligning the system submissions and the text descriptions ("metadata") that people have submitted through the form like https://forms.gle/4MEq8pG9zn6GTTncA .

I would like OCELoT to collect not only system outputs but also these details, in a way which absolutely clearly records the correspondence between system runs and the metadata.

Describe the solution you'd like

My view of WMT news translation task is this (your view may be slightly different):

Each participating institution can form one or more teams.
Each team can submit many systems, but one has to be marked as primary.

To report clearly on this in the paper:

Each team should report its institution name.
- The names of institutions are usually clear, but we should still ask for them, esp. when startups and other companies participate.
Each team has to suggest its name (e.g. "Donald's Nephews") for the main text and also a short name (e.g. "Donalds") for the result tables.
- In the result tables, the team short name refers to the primary submission of that team.
Each submitted system needs to have a unique name.
- The name can be implied, e.g. "Donald's Nephews Primary Submission", "Donald's Nephews Contrastive 1", "Donald's Nephews Contrastive 2".
Each submitted system needs to receive metadata, i.e. the paragraph for the Findings and indication of the data and system features used.
- The metadata are most important for the primary submissions. For contrastive submissions, they can be abridged.

When submitting any system run (which is presumably assigned a system ID in OCELoT), the submitter should be redirected to a web form to fill all the metadata. The system ID should be pre-filled in the form, and so should be the team and institution name, if known.

Ideally, the system submission would not be deemed completed if this form was not filled. This could be achieved by a simple export of form results to a URL where OCELoT would check and highlight the submission green if the system ID had any entry there.

I suggest using Google Forms to collect the metadata, because they offer the flexibility we need across years. Also, they easily offer pre-filled URLs, e.g.:

https://docs.google.com/forms/d/e/1FAIpQLSeQ-R9DAiY56WGmzInggusVQXQDx8YvZnETWn26DEh5RkqQ0w/viewform?usp=pp_url&entry.1616043926=Ondrej+prefilled+his+name&entry.63843270=My+prefilled+system+name

Describe alternatives you've considered

We do it as before, circulate the form around and then frantically chase people to answer, and fully make up the few entries for which the metadata never arrives.

Move `DATABASES` setup into environment vars

Stop using local_settings.py.

Allow sorting tables in the admin panel by a column

Is your feature request related to a problem? Please describe.
When searching in the admin, it would be useful to have an option to sort by a column.

Update sacreBLEU and improve how it is called

New versions of sacreBLEU newer than 1.4 do not work with OCELoT because sacreBLEU API has changed. We should update it and improve the way how it is called in OCELoT to make potential updates in the future easier. For example, we can write a simple wrapper for computing BLEU and ChrF outside Django models.

Rename `master` to `main`

Is your feature request related to a problem? Please describe.
We should rename master branch to main, following GitHub's policy change outlined here.

Competition specific rules and updates

Is your feature request related to a problem? Please describe.
Currently all rules and updates are global and can be modified only by changing a HTML file. We would like to have rules and updates specific for competitions, and ideally being configurable via the admin panel.

Describe the solution you'd like

A new model refering to the competition object with fields accepting texts with HTML (markdown?)
Placeholders in the admin panel should provide an example with the recommended HTML format.
Updates and rules are shown per competition, possibly on a new page specific to the competition (what URL?)

This should be a separate model with a reference key to the competition object, because this will provide a selective access for users who are only moderators and should not have access to other DB tables in the admin panel (for free thanks to the admin Django extension).

Additional context
We would also want to have post-competition surveys that are configurable per competition. This needs more thought as the current need for WMT21 was having surveys per test sets (i.e. primary submissions), not per competitions, which is more complicated.

Remve/rename abstract field from ocelot

Is your feature request related to a problem? Please describe.
There is a field for abstract in Ocelot, that is confusing with Abstract paper we ask them to collect. This field could be removed or at least renamed to "description" to avoid confusion (last year one participant was removed due to this).
Is this field used? Maybe for short system description in findings paper?

appraisedev / ocelot Goto Github PK

ocelot's Introduction

Welcome to OCELoT!

Getting started

License

Contributors

Acknowledgments

ocelot's People

Contributors

Stargazers

Watchers

Forkers

ocelot's Issues

Recommend Projects

Recommend Topics

Recommend Org