Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard (dataset)

ResearchGate: link

Abstract

Although informal evaluations of modern LLMs can be found on social media, blogs, and news outlets, a formal and comprehensive comparison among them has yet to be conducted. In response to this gap, we have undertaken an extensive benchmark evaluation of LLMs and conversational bots. Our evaluation involved the collection of 1002 questions encompassing 27 categories, which we refer to as the “Wordsmiths dataset.” These categories include reasoning, logic, facts, coding, bias, language, humor, and more. Each question in the dataset is accompanied by an accurate and verified answer. We meticulously assessed four leading chatbots: ChatGPT, GPT-4, Bard, and Claude, using this dataset. The results of our evaluation revealed the following key findings: a) GPT-4 emerged as the top-performing chatbot across all categories, achieving a success rate of 84.1%. On the other hand, Bard faced challenges and achieved a success rate of 62.4%. b) Among the four models evaluated, one of them responded correctly approximately 93% of the time. However, all models were correct only about 44%. c) Bard is less correlated with other models while ChatGPT and GPT-4 are highly correlated in terms of their responses. d) Chatbots demonstrated proficiency in language understanding, facts, and self-awareness. However, they encountered difficulties in areas such as math, coding, IQ, and reasoning. e) In terms of bias, discrimination, and ethics categories, models generally performed well, suggesting they are relatively safe to utilize. To make future model evaluations on our dataset easier, we also provide a multiple-choice version of it (called Wordsmiths-MCQ). The understanding and assessment of the capabilities and limitations of modern chatbots hold immense societal implications. In an effort to foster further research in this field, we have made our dataset available for public access, which can be found at Wordsmiths.

Results

to be announced

About the dataset

In total, our dataset contains 1002 question-answer pairs. There are 27 categories that can be used to assess the main and important abilities of the large language models. The figure below shows the number of questions per category.

Download

To access the dataset, see the data folder or download the dataset from the release section. Both json and csv formats are provided for all categories, you can use them based on your need. For those categories/questions that do not require an answer, "NONE" is replaced as the answer.

Contribution

If you are interested in contributing to expanding the proposed dataset, please open an issue or just send an email. We encourage you to add your question-answer pairs in any category and language.

Citation

SSRN preprint:

@misc{BorjiMohammadianWordsmiths,
author = {Borji, Ali and Mohammadian, Mehrdad},
year = {2023},
month = {06},
pages = {},
title = {Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard},
journal = {SSRN Electronic Journal},
doi = {10.2139/ssrn.4476855}
}

License

GNU General Public License v3.0

mehrdad-dev / battle-of-the-wordsmiths Goto Github PK

battle-of-the-wordsmiths's Introduction

Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard (dataset)

Abstract

Results

About the dataset

Download

Contribution

Citation

License

Contact

battle-of-the-wordsmiths's People

Contributors

Stargazers

Watchers

Forkers

battle-of-the-wordsmiths's Issues

Using LiteLLM Proxy Server

Creating a proxy server

Using to run an eval on lm harness:

Recommend Projects

Recommend Topics

Recommend Org