Giter VIP home page Giter VIP logo

mirage's Introduction

MIRAGE Benchmark

Welcome to the GitHub page of MIRAGE (Medical Information Retrieval-Augmented Generation Evaluation) Benchmark! This repository contains a comprehensive dataset and benchmark results aimed at evaluating Retrieval-Augmented Generation (RAG) systems for medical question answering (QA). We use the MedRAG toolkit to evaluate existing solutions of various components in RAG on MIRAGE.

The benchmark data is stored as benchmark.json in this repo, which can also be downloaded from Google Drive.

Snippet ids of the top 10k snippets for each task in MIRAGE retrieved by all retrievers in MedRAG can be downloaded with

wget -O retrieved_snippets_10k.zip https://virginia.box.com/shared/static/cxq17th6eisl2pn04vp0x723zczlvlzc.zip

Preprint Homepage Leaderboars

Table of Contents

Introduction

For a realistic evaluation purpose, MIRAGE adopts four key evaluation settings:

Zero-Shot Learning (ZSL): Input QA systems are evaluated in a zero-shot setting where in-context few-shot learning is not permitted.

Multi-Choice Evaluation (MCE): Multi-choice Questions are used to evaluate given systems.

Retrieval-Augmented Generation (RAG): Input systems should perform retrieval-augmented generation, which need to collect external information for accurate and reliable answer generation.

Question-Only Retrieval (QOR): To align with real-world cases of medical QA, answer options should not be provided as input during the retrieval.

Dataset

The following figure presents the overview of MIRAGE, which shows that MIRAGE contains five commonly used datasets for medical QA for the evaluation of RAG systems, including three medical examination QA datasets and two biomedical research QA datasets:

  • MMLU-Med: A medical examination QA dataset with 1089 questions. A subset of six tasks that are related to biomedicine are selected from MMLU, including anatomy, clinical knowledge, professional medicine, human genetics, college medicine, and 996 college biology.
  • MedQA-US: A medical examination QA dataset. We focus on the real-world English subset in MedQA with questions from the US Medical Licensing Examination (MedQA-US), including 1273 four-option test samples.
  • MedMCQA: A medical examination QA dataset. We chose the dev set of the original MedMCQA, which includes 4183 medical questions from Indian medical entrance exams.
  • PubMedQA*: A biomedical research QA dataset. We build PubMedQA* by removing given contexts in the 500 expert-annotated test samples of PubMedQA. The possible answer to a PubMedQA* question can be yes/no/maybe, reflecting the authenticity of the question statement based on scientific literature.
  • BioASQ-Y/N: A biomedical research QA dataset. We select the Yes/No questions in the ground truth test set of BioASQ Task B from the most recent five years (2019-2023), including 618 questions in total. The ground truth snippets are removed in this benchmark.

Alt text

Statistics of datasets in MIRAGE are shown below:

Dataset Size #O. Avg. L Source
MMLU-Med 1,089 4 63 Examination
MedQA-US 1,273 4 177 Examination
MedMCQA 4,183 4 26 Examination
PubMedQA* 500 3 24 Literature
BioASQ-Y/N 618 2 17 Literature

(#O.: numbers of options; Avg. L: average token counts in each question.)

Benchmark Results

The following table shows the benchmark results of different backbone LLMs.

Alt text

This table shows the comparison of different corpora and retrievers on MIRAGE.

Alt text

Usage

Load the benchmark:

>>> import json
>>> benchmark = json.load(open("benchmark.json"))

Load specific datasets in the benchmark (e.g., mmlu):

>>> from src.utils import QADataset

>>> dataset_name = "mmlu"
>>> dataset = QADataset(dataset_name)

>>> print(len(dataset))
1089

>>> print(dataset[0])
{'question': 'A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral', 'options': {'A': 'paralysis of the facial muscles.', 'B': 'paralysis of the facial muscles and loss of taste.', 'C': 'paralysis of the facial muscles, loss of taste and lacrimation.', 'D': 'paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation.'}, 'answer': 'A'}

Evaluate prediction results saved in ./prediction for both CoT generation and RAG with 32 snippets:

# CoT with GPT-3.5
python src/evaluate.py --results_dir ./prediction --llm_name OpenAI/gpt-35-turbo-16k

# MedRAG-32 with GPT-3.5
python src/evaluate.py --results_dir ./prediction --llm_name OpenAI/gpt-35-turbo-16k --rag --k 32

# CoT with GPT-4
python src/evaluate.py --results_dir ./prediction --llm_name OpenAI/gpt-4-32k

# MedRAG-32 with GPT-4
python src/evaluate.py --results_dir ./prediction --llm_name OpenAI/gpt-4-32k --rag --k 32

Submission

To submit results of your new system on the Leaderboard, please send an email to Guangzhi Xiong ([email protected]) with

  • The name of your system and its components
  • Performance of the system on different subtasks & Average performance
  • A reference link to your results

Citation

@article{xiong2024benchmarking,
    title={Benchmarking Retrieval-Augmented Generation for Medicine}, 
    author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang},
    journal={arXiv preprint arXiv:2402.13178},
    year={2024}
}

mirage's People

Contributors

teddy-xionggz avatar

Stargazers

 avatar  avatar  avatar Maochengbai avatar YouHaku avatar Henry avatar  avatar Kim Gun Il avatar chenyuan wu avatar misonsky avatar  avatar Victor Chen avatar Myungchul Shin avatar Motoki Wu avatar Asım Sinan Yüksel avatar Peilin Li avatar  avatar David S. Batista avatar wentao.shi avatar Nuan avatar ding ding avatar 追不闻名的星 avatar Mai A. Shaaban avatar  avatar Aziz Alto avatar Dimitrios Kapetanios avatar  avatar Kim HyoJun avatar Hsiang-Yu Tsou avatar Jiaxin Zhang avatar George Ng avatar  avatar Krasjet-Yu avatar Tom Hutchinson avatar  avatar filippo.abbondanza avatar Jeonghwan Kim avatar Sunan He avatar Marina Samprovalaki avatar minstar avatar Evita avatar  avatar Yue Yang avatar Linus Stuhlmann avatar  avatar Raleigh avatar Shuo Zhang avatar Changho Shin avatar 张驰 avatar ANDA avatar  avatar Tonic avatar ChungYao.Ma avatar Zikang Chen avatar Daniel Fleischer avatar Jianghao Zhang avatar sc zz avatar Haaan avatar Zheng Yuan avatar  avatar Howard_Lyu avatar Tom Hartvigsen avatar Kent Shefchek avatar bluewayg avatar AATiP avatar baeseongsu avatar Livia Pimentel avatar Elisa Terumi Rubel Schneider avatar W. avatar  avatar Sudarsan Lakshminarayanan avatar  avatar Yuan JIN avatar HappyColor avatar Jeff Carpenter avatar Jieli Zhou avatar DrMo avatar  avatar Qiao Jin avatar

Watchers

Qiao Jin avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.