Giter VIP home page Giter VIP logo

fib's Introduction

FIB

This repository contains the code for "Evaluating the Factual Consistency of Large Language Models Through Summarization"

FIB Benchmark

The dataset is now on HuggingFace ๐Ÿค— Note that the multiple-choice accuracy is computed in a slightly different way in our work. See below for more details.

Evaluating Models

Setup

  1. Create a virtual environment and activate it.
python3 -m venv env
source env/bin/activate
  1. Install dependencies
python -m pip install -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html
  1. Set environment variables (This step has to be done every session.)
source bin/setup.sh

Running Models

The following command is used to evaluate models:

python src/evaluate_mulChoice.py -f {multiple_choice-dataset_filepath} -m {model}

For example,

python src/evaluate_mulChoice.py -f multiple_choice-dataset/xsum/fib/binary_choice-using_bart-base_distractors.jsonl -m facebook/opt-1.3b

Our code has only been tested on evaluating models from the BLOOM, OPT, GPT, and T5 families.

Note that though DeepSpeed is implemented, we did not use it. So our implementation of DeepSpeed might have some bugs.

Get Results

The following command is used to gather multiple results and get the median score:

python src/scripts/get_results.py -e {all_experiment_directories_of_datasets} -m {list_models}

For example,

python src/scripts/get_results.py -f exp_out/multiple_choice/xsum/fib/* -m bigscience-T0_3B

Evaluating Models on FIB

The difference between the FIB dataset released above and the evaluation here is

  • Here, we take the median accuracy across of the model across 3 prompts for each distractor model used. Then, we take a weighted average of the median accuracies across different distractor models.
  • In the FIB dataset, we combine all the examples from each distractor model and across XSum and CNN/DM into one file to simplify it. Users can use any prompt they want.

The following commands will run it.

python src/evaluate_mulChoice.py -f multiple_choice-dataset/{dataset}/fib/binary_choice-* -m {model}
python src/compute_fib_results.py -m {model} -d {dataset}

Other Binary Multiple-Choice Datasets

The datasets are under multiple_choice-dataset/xsum and multiple_choice-dataset/cnn_dm for XSum and CNN\DM respectively.

The different alternative choices include

  1. FIB - Our benchmark of factually inconsistent model-generated summaries
  2. FactCC
  3. MFMA
  4. FIR - factually inconsistent reference summaries (i.e. reference summaries from XSum or CNN\DM that were annotated as factually inconsistent)
  5. factually consistent model generated-summaries.

Each example is a json consisting of the following keys: {id, input, correct_choice, list_choices, lbl}

Citation

If you find this repo helpful, welcome to cite our work:

@article{tam2022fib,
  title={Evaluating the Factual Consistency of Large Language Models Through Summarization},
  author={Tam, Derek and Mascarenhas, Anisha and Zhang, Shiyue and Kwan, Sarah and Bansal, Mohit and Raffel, Colin},
  journal={arXiv preprint arXiv:2211.08412},
  year={2022}
}

We use the following code in our works:

@inproceedings{kryscinski-etal-2020-evaluating,
    title = "Evaluating the Factual Consistency of Abstractive Text Summarization",
    author = "Kryscinski, Wojciech  and
      McCann, Bryan  and
      Xiong, Caiming  and
      Socher, Richard",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.750",
    doi = "10.18653/v1/2020.emnlp-main.750",
    pages = "9332--9346",
}

@inproceedings{lee-etal-2022-masked,
    title = "Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking",
    author = "Lee, Hwanhee  and
      Yoo, Kang Min  and
      Park, Joonsuk  and
      Lee, Hwaran  and
      Jung, Kyomin",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.76",
    doi = "10.18653/v1/2022.findings-naacl.76",
    pages = "1019--1030",
}

fib's People

Contributors

dptam avatar

Stargazers

Nicholas Kondal avatar JIMMY ZHAO avatar  avatar Shashi Kumar Nagulakonda avatar Zhouzichen avatar Zhouxing Shi avatar peco avatar Fareed Khan avatar Daxiong avatar Eunchan Lee avatar Raunak  avatar feimeng avatar Jon Chun avatar Haotian Wang avatar Irfan Al-Hussaini avatar  avatar init avatar Nikita avatar Michael Ermolenko avatar Bruno Henrique avatar Maksym Del avatar Jeff Hammerbacher avatar  avatar

Watchers

 avatar

Forkers

ktg1 nthon z-y00

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.