Giter VIP home page Giter VIP logo

arqmath-eval's Introduction

ARQMath-eval

This repository contains code, which you can use to evaluate your system runs from the ARQMath competitions.

Description

Tasks

This repository evaluates the performance of your information retrieval system on a number of tasks:

The main tasks are:

  • task1 – Use this task to evaluate your ARQMath task 1 system, and
  • task2 – Use this task to evaluate your ARQMath task 2 system.

Subsets

Each task comes with three subsets:

  • train – The training set, which you can use for supervised training of your system.
  • validation – The validation set, which you can use to compare the performance of your system with different parameters. The validation set is used to compute the leaderboards in this repository.
  • test – The test set, which you currently should not use at all. It will be used at the end to compare the systems, which performed best on the validation set.

The task1 and task2 tasks also come with the all subset, which contains all relevance judgements. Use these to evaluate a system that has not been trained using subsets of the task1 and task2 tasks.

The task1 and task2 tasks also come with a different subset split used by the MIRMU and MSM teams in the ARQMath-2 competition submissions. This split is also used in the pv211-utils library:

  • train-pv211-utils – The training set, which you can use for supervised training of your system.
  • validation-pv211-utils – The validation set, which you can use for hyperparameter optimization or model selection.

The training set is futher split into the smaller-train-pv211-utils and smaller-validation subsets in case you need two validation sets for e.g. hyperparameter optimization and model selection. If you don't use either hyperparameter optimization or model selection, you can use the bigger-train-pv211-utils subset, which combines the train-pv211-utils and validation-pv211-utils subsets.

  • test-pv211-utils – The test set, which you currently should only use for the final performance estimation of your system.

Examples

Using the train subset to train your supervised system

$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python
>>> from arqmath_eval import get_topics, get_judged_documents, get_ndcg
>>>
>>> task = 'task1'
>>> subset = 'train'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>>     results[topic] = {}
>>>     for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>>        similarity_score = compute_similarity_score(topic, document)
>>>        results[topic][document] = similarity_score
>>>
>>> get_ndcg(results, task='task1-votes', subset='train', topn=1000)
0.5876

Using the validation subset to compare various parameters of your system

$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python
>>> from arqmath_eval import get_topics, get_judged_documents
>>>
>>> task = 'task1'
>>> subset = 'validation'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>>     results[topic] = {}
>>>     for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>>        similarity_score = compute_similarity_score(topic, document)
>>>        results[topic][document] = similarity_score
>>>
>>> user = 'xnovot32'
>>> description = 'parameter1=value_parameter2=value'
>>> filename = '{}/{}/{}.tsv'.format(task, user, description)
>>> with open(filename, 'wt') as f:
>>>     for topic, documents in results.items():
>>>         top_documents = sorted(documents.items(), key=lambda x: x[1], reverse=True)[:1000]
>>>         for rank, (document, score) in enumerate(top_documents):
>>>             line = '{}\txxx\t{}\t{}\t{}\txxx'.format(topic, document, rank + 1, score)
>>>             print(line, file=f)
$ git add task1-votes/xnovot32/result.tsv  # track your new result with Git
$ python -m arqmath_eval.evaluate          # run the evaluation
$ git add -u                               # add the updated leaderboard to Git
$ git push                                 # publish your new result and the updated leaderboard

Using the all subset to compute the NDCG' score of an ARQMath submission

$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python -m arqmath_eval.evaluate MIRMU-task1-Ensemble-auto-both-A.tsv all 2020
0.238, 95% CI: [0.198; 0.278]

Citing ARQMath-eval

Text

NOVOTNÝ, Vít, Petr SOJKA, Michal ŠTEFÁNIK and Dávid LUPTÁK. Three is Better than One: Ensembling Math Information Retrieval Systems. CEUR Workshop Proceedings. Thessaloniki, Greece: M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen., 2020, vol. 2020, No 2696, p. 1-30. ISSN 1613-0073.

BibTeX

@inproceedings{mir:mirmuARQMath2020,
  title = {{Three is Better than One}},
  author = {V\'{i}t Novotn\'{y} and Petr Sojka and Michal \v{S}tef\'{a}nik and D\'{a}vid Lupt\'{a}k},
  booktitle = {CEUR Workshop Proceedings: ARQMath task at CLEF conference},
  publisher = {CEUR-WS},
  address = {Thessaloniki, Greece},
  date = {22--25 September, 2020},
  year = 2020,
  volume = 2696,
  pages = {1--30},
  url = {http://ceur-ws.org/Vol-2696/paper_235.pdf},
}

Citing ARQMath-eval

Text

NOVOTNÝ, Vít, Petr SOJKA, Michal ŠTEFÁNIK and Dávid LUPTÁK. Three is Better than One: Ensembling Math Information Retrieval Systems. CEUR Workshop Proceedings. Thessaloniki, Greece: M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen., 2020, vol. 2020, No 2696, p. 1-30. ISSN 1613-0073.

BibTeX

@inproceedings{mir:mirmuARQMath2020,
  title = {{Three is Better than One}},
  author = {V\'{i}t Novotn\'{y} and Petr Sojka and Michal \v{S}tef\'{a}nik and D\'{a}vid Lupt\'{a}k},
  booktitle = {CEUR Workshop Proceedings: ARQMath task at CLEF conference},
  publisher = {CEUR-WS},
  address = {Thessaloniki, Greece},
  date = {22--25 September, 2020},
  year = 2020,
  volume = 2696,
  pages = {1--30},
  url = {http://ceur-ws.org/Vol-2696/paper_235.pdf},
}

arqmath-eval's People

Contributors

michal-stefanik avatar stefanik12 avatar witiko avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.