FIB

This repository contains the code for "Evaluating the Factual Consistency of Large Language Models Through Summarization"

FIB Benchmark

The dataset is now on HuggingFace 🤗 Note that the multiple-choice accuracy is computed in a slightly different way in our work. See below for more details.

Evaluating Models

Setup

Create a virtual environment and activate it.

python3 -m venv env
source env/bin/activate

Install dependencies

python -m pip install -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html

Set environment variables (This step has to be done every session.)

source bin/setup.sh

Running Models

The following command is used to evaluate models:

python src/evaluate_mulChoice.py -f {multiple_choice-dataset_filepath} -m {model}

For example,

python src/evaluate_mulChoice.py -f multiple_choice-dataset/xsum/fib/binary_choice-using_bart-base_distractors.jsonl -m facebook/opt-1.3b

Our code has only been tested on evaluating models from the BLOOM, OPT, GPT, and T5 families.

Note that though DeepSpeed is implemented, we did not use it. So our implementation of DeepSpeed might have some bugs.

Get Results

The following command is used to gather multiple results and get the median score:

python src/scripts/get_results.py -e {all_experiment_directories_of_datasets} -m {list_models}

For example,

python src/scripts/get_results.py -f exp_out/multiple_choice/xsum/fib/* -m bigscience-T0_3B

Evaluating Models on FIB

The difference between the FIB dataset released above and the evaluation here is

Here, we take the median accuracy across of the model across 3 prompts for each distractor model used. Then, we take a weighted average of the median accuracies across different distractor models.
In the FIB dataset, we combine all the examples from each distractor model and across XSum and CNN/DM into one file to simplify it. Users can use any prompt they want.

The following commands will run it.

python src/evaluate_mulChoice.py -f multiple_choice-dataset/{dataset}/fib/binary_choice-* -m {model}
python src/compute_fib_results.py -m {model} -d {dataset}

Other Binary Multiple-Choice Datasets

The datasets are under multiple_choice-dataset/xsum and multiple_choice-dataset/cnn_dm for XSum and CNN\DM respectively.

The different alternative choices include

FIB - Our benchmark of factually inconsistent model-generated summaries
FactCC
MFMA
FIR - factually inconsistent reference summaries (i.e. reference summaries from XSum or CNN\DM that were annotated as factually inconsistent)
factually consistent model generated-summaries.

Each example is a json consisting of the following keys: {id, input, correct_choice, list_choices, lbl}

Citation

If you find this repo helpful, welcome to cite our work:

@article{tam2022fib,
  title={Evaluating the Factual Consistency of Large Language Models Through Summarization},
  author={Tam, Derek and Mascarenhas, Anisha and Zhang, Shiyue and Kwan, Sarah and Bansal, Mohit and Raffel, Colin},
  journal={arXiv preprint arXiv:2211.08412},
  year={2022}
}

We use the following code in our works:

@inproceedings{kryscinski-etal-2020-evaluating,
    title = "Evaluating the Factual Consistency of Abstractive Text Summarization",
    author = "Kryscinski, Wojciech  and
      McCann, Bryan  and
      Xiong, Caiming  and
      Socher, Richard",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.750",
    doi = "10.18653/v1/2020.emnlp-main.750",
    pages = "9332--9346",
}

@inproceedings{lee-etal-2022-masked,
    title = "Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking",
    author = "Lee, Hwanhee  and
      Yoo, Kang Min  and
      Park, Joonsuk  and
      Lee, Hwaran  and
      Jung, Kyomin",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.76",
    doi = "10.18653/v1/2022.findings-naacl.76",
    pages = "1019--1030",
}

r-three / fib Goto Github PK

fib's Introduction

FIB

FIB Benchmark

Evaluating Models

Setup

Running Models

Get Results

Evaluating Models on FIB

Other Binary Multiple-Choice Datasets

Citation

fib's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent