Giter VIP home page Giter VIP logo

evaluation's Introduction

BigScience Evaluation

Code and data for the BigScience Evaluation WG.

Upcoming Milestones for Contributors

  • September 1, 2021: Eval Engineering Subgroup release toy tasks/dummy code to define API
  • September 1, 2021: New task-based subgroups established and begin work
  • October 1, 2021: Finalize GitHub with all data and scripts for generating raw evaluation results
  • October 15, 2021: General meeting to discuss longer research project proposals for fall/spring
  • October 15, 2021: Form subgroup on data presentation/visualization to create final report card

Quickstart

To benchmark a baseline GPT-2 model with WMT and TyDiQA datasets on GPU, run

python3 -m evaluation.eval \
    --model_name_or_path gpt2 \
    --eval_tasks wmt tydiqa_secondary \
    --device cuda \
    --output_dir outputs

Note: For toxicity dataset, you have to download the dataset manually from Kaggle here and also pass the data_dir argument to the folder.

Setup

  1. Create virtual environment (one-time).

    python3 -m venv venv # create a virtual environment called 'venv'
  2. Activate the virtual environment.

    source venv/bin/activate
  3. Install package requirements.

    python3 -m pip install -r requirements.txt
    python3 -m pip install -r requirements-dev.txt

Tasks

This project plans to support all datasets listed under docs/datasets.md. The sections below detail task-independent inner-workings of this repository.

AutoTask

Every task/dataset lives as a submodule within evaluation.tasks. The core of these submodules inherit from evaluation.tasks.auto_task.AutoTask, which is a base class that houses all abstract functions, as well has holds model, tokenizer, and task_config as its attributes.

AutoTask makes it incredibly easy to load any dataset for a benchmark. The basic signature is

task = AutoTask.from_task_name(
    "task_name", model, tokenizer, device, english_only
)

Alternatively, if the model has to be recreated for each task, a task object can be created from string specifications.

task = AutoTask.from_spec(
    "task_name", 
    "model_name_or_path", 
    "tokenizer_name",
    device,
    english_only,
    data_dir: Optional
)

Evaluation

Every AutoTask subclass has a .evaluate() function wherein all evaluation logic resides, i.e. loading the dataset (and the dataloader, if necessary), and computing reporting metrics. At the end of the evaluation, metrics are saved as a class attribute in task.metrics. For more details on the full pipeline, refer to the main evaluation script, evaluation/eval.py.

Contributing

Refer to CONTRIBUTING.md.

evaluation's People

Contributors

debajyotidatta avatar epavlick avatar jaketae avatar jankalo avatar jordiclive avatar meg-huggingface avatar mmitchellai avatar ryanzhumich avatar sebastiangehrmann avatar tianjianjiang avatar trishalaneeraj avatar tttyuntian avatar wilsonyhlee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

evaluation's Issues

Wrap evaluation benchmark using HF-trainer

This might sounds like a bit of re-structuring but for the sake of future compatibility, I propose the following,

  1. Move to huggingface trainer: This will help the repo to automatically adapt to deepspeed and all the exclusive features of transformers library.
  2. We don't have to re-invent the wheel. Given that we are using huggingface trainer, we only need to implement the following functions for a trainer for different tasks.
    -- data_loader
    -- DataCollator
    -- compute_metrics
    -- predictions (if needed)
  3. In case if we want to finetune our full model, we don't have to change a lot in the surface level.

I would love to take some responsibility if needed. Let me know. @jaketae @tianjianjiang @wilsonyhlee

Add MNLI to Full Benchmark

coordinate with whoever is working on SuperGLUE, we only need to include MNLI once. But NLI will be held-out from model training (whereas the other SuperGLUE tasks will not) so interpreting MNLI results is different from other superglue tasks.

use to test generalization to unseen task; maybe use FLEX?

Setup testing

#56 set up a basic unit test, but we have to consider what kind of tests we want to run. This is especially important given that GitHub workflows does not have any GPU support, and will thus take a non-trivial amount of time to complete even a basic simple benchmark run. The proposal is to ideate some ways in which we could make tests modular and reasonably fast.

Set code conventions

As the repository gets larger, we will eventually need to decide on a code convention. Though preferences may vary, it's probably safe to stick to the setup in transformers, which is black + isort.

We could create a simplified Makefile and do something like

.PHONY: style

style:
	black .
	isort . --profile=black .

Alternatively, create a .pre-commit-config.yaml.

repos:
  - repo: https://github.com/psf/black
    rev: 21.7b0
    hooks:
      - id: black
  - repo: https://github.com/pycqa/isort
    rev: 5.9.3
    hooks:
      - id: isort
        args: ["--profile", "black"]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.