Giter VIP home page Giter VIP logo

evaluation's Introduction

BigScience Evaluation

Code and data for the BigScience Evaluation WG.

Upcoming Milestones for Contributors

  • September 1, 2021: Eval Engineering Subgroup release toy tasks/dummy code to define API
  • September 1, 2021: New task-based subgroups established and begin work
  • October 1, 2021: Finalize GitHub with all data and scripts for generating raw evaluation results
  • October 15, 2021: General meeting to discuss longer research project proposals for fall/spring
  • October 15, 2021: Form subgroup on data presentation/visualization to create final report card

Quickstart

To benchmark a baseline GPT-2 model with WMT and TyDiQA datasets on GPU, run

python3 -m evaluation.eval \
    --model_name_or_path gpt2 \
    --eval_tasks wmt tydiqa_secondary \
    --device cuda \
    --output_dir outputs

Note: For toxicity dataset, you have to download the dataset manually from Kaggle here and also pass the data_dir argument to the folder.

Setup

  1. Create virtual environment (one-time).

    python3 -m venv venv # create a virtual environment called 'venv'
  2. Activate the virtual environment.

    source venv/bin/activate
  3. Install package requirements.

    python3 -m pip install -r requirements.txt
    python3 -m pip install -r requirements-dev.txt

Tasks

This project plans to support all datasets listed under docs/datasets.md. The sections below detail task-independent inner-workings of this repository.

AutoTask

Every task/dataset lives as a submodule within evaluation.tasks. The core of these submodules inherit from evaluation.tasks.auto_task.AutoTask, which is a base class that houses all abstract functions, as well has holds model, tokenizer, and task_config as its attributes.

AutoTask makes it incredibly easy to load any dataset for a benchmark. The basic signature is

task = AutoTask.from_task_name(
    "task_name", model, tokenizer, device, english_only
)

Alternatively, if the model has to be recreated for each task, a task object can be created from string specifications.

task = AutoTask.from_spec(
    "task_name", 
    "model_name_or_path", 
    "tokenizer_name",
    device,
    english_only,
    data_dir: Optional
)

Evaluation

Every AutoTask subclass has a .evaluate() function wherein all evaluation logic resides, i.e. loading the dataset (and the dataloader, if necessary), and computing reporting metrics. At the end of the evaluation, metrics are saved as a class attribute in task.metrics. For more details on the full pipeline, refer to the main evaluation script, evaluation/eval.py.

Contributing

Refer to CONTRIBUTING.md.

evaluation's People

Contributors

debajyotidatta avatar epavlick avatar jaketae avatar jankalo avatar jordiclive avatar meg-huggingface avatar ryanzhumich avatar sebastiangehrmann avatar tianjianjiang avatar trishalaneeraj avatar tttyuntian avatar wilsonyhlee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.