Giter VIP home page Giter VIP logo

m3ke's Introduction

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

IntroductionHow to use M3KEEvaluation LeaderboardCitation

Introduction

We propose M3KE, a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to measure knowledge acquired by Chinese large language models in zero- and few-shot settings. We have collected 20,477 questions from 71 tasks. Our selection covers all major levels of Chinese education system, ranging from the primary school to college, as well as a wide variety of subjects, including humanities, history, politics, law, education, psychology, science, technology, art and religion. All questions are multiple-choice questions with four options, hence guaranteeing a standardized and unified assessment process.

We’ve assessed and will continue to assess a number of Chinese large language models on our benchmark. The currently assessed models are either only pre-trained on massive data or pre-trained + fine-tuned with SFT or RLHF. Model sizes vary from 335M to 175B parameters.

This is collaborative research effort between the Natural Language Processing Laboratory at Tianjin University and Huawei Noah’s Ark Lab.

The comparison between M3KE and other relevant benchmarks.

Benchmark Language #Tasks #Questions
MMLU En 57 15908
AGIEval EN, Zh 20 8062
MMCU Zh 51 11900
M3KE Zh 71 20477

All 71 tasks displayed according to their subjects and levels.

Arts Humanities Social Sciences Natural Sciences Others
Primary school Chinese Math
Junior high school Chinese, History Politics Math, Physics, Biology, Chemistry, Geography
High school Chinese, History Politics Math, Physics, Biology, Chemistry, Geography
College Modern History, History Foundation, Modern World History Chinese Constitutional Law, History of Chinese Education, History of the Chinese Legal System, Developmental and Educational Psychology, History of Foreign Education, Experimental Psychology, Introduction to Psychology, Moral Cultivation, Psychology of Teaching, Principles of Pedagogy, Educational Research Methods, Current Affairs and Politics, Introduction to Mao Tsetung Thoughts, Civil Law, Jurisprudence, Sociology, Basic Principle of Marxism, Criminal Jurisprudence, Outline of Chinese Modern History Humanistic Medicine, Internal Medicine, Animal Physiology, Surgical Sciences, Operating Systems, Data Structures, Probability Theory, Biochemistry, Biochemistry and Pathology, Physiology, Principles of Computer Composition, Computer Networks,Advanced Mathematics, Linear Algebra, Stomatology, Anthropotomy, Pharmacology, Immunology Management, Economics
Others Film, Music, Dance, Fine Arts Computer Grade Exam (Computer Fundamentals, Programming Languages) Chinese Medicine, Ancient Chinese Language, Novels, Religion, Chinese Civil Service Examination

How to use M3KE

There are several approaches available for loading the M3KE dataset. Below, we provide three methods as references:

Method 1

First, clone this repository manually using the following command:

git clone https://github.com/tjunlp-lab/M3KE.git

Then, load the data using Python's built-in json package as shown below:

import json

fin = open("path/to/M3KE/test/Computer Programming Language-Natural Sciences-Other.jsonl", mode="r", encoding="utf-8")

for json_line in fin:
    json_data = json.loads(json_line)
    print(json_data)

Method 2

First, clone this repository manually using the following command:

git clone https://github.com/tjunlp-lab/M3KE.git

Then, load the data using the datasets package as shown below:

from datasets import load_dataset

ds = load_dataset("json", data_files="path/to/M3KE/test/Computer Programming Language-Natural Sciences-Other.jsonl", split="train")
print(ds)
"""
Dataset({
    features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
    num_rows: 236
})
"""

print(ds[0])
"""
{'id': 0, 'question': '下面判断正确的是?', 'A': 'char str[10]={"china"}; 等价于 char str[10];str[]="china";', 'B': 'char *s="china"; 等价于 char *s;s="china"; ', 'C': 'char *a="china"; 等价于 char *a;*a="china";', 'D': 'char c[6]="china",d[6]="china"; 等 价 于 char c[6]=d[6]="china"; ', 'answer': ''}
"""

Method 3

The M3KE dataset has been uploaded to the HuggingFace Datasets Hub, allowing for easy loading without manually cloning this repository. Use the following code to load the dataset:

from datasets import load_dataset

ds = load_dataset(
    path="TJUNLP/M3KE", 
    name="Computer Programming Language-Natural Sciences-Other"
)
print(ds)
"""
DatasetDict({
    test: Dataset({
        features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 236
    })
    dev: Dataset({
        features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 5
    })
})
"""

print(ds["test"][0])
"""
{'id': 0, 'question': '下面判断正确的是?', 'A': 'char str[10]={"china"}; 等价于 char str[10];str[]="china";', 'B': 'char *s="china"; 等价于 char *s;s="china"; ', 'C': 'char *a="china"; 等价于 char *a;*a="china";', 'D': 'char c[6]="china",d[6]="china"; 等 价 于 char c[6]=d[6]="china"; ', 'answer': ''}
"""

Evaluation Leaderboard (more models to be added)

If you want to have your Chinese large language models assessed and added to the leaderboard, please feel free to contact us via [email protected] or submit a pull request.

Average zero-shot accuracy of each evaluated model on four major discipline clusters:

Models Arts&Humanities Social Sciences Nature Sciences Others Average
GLM-335M 0.070 0.046 0.084 0.044 0.062
Bloom-7B 0.163 0.159 0.161 0.158 0.161
GLM-10B 0.180 0.229 0.219 0.150 0.197
GLM-130B 0.326 0.352 0.274 0.359 0.328
ChatGLM-6B 0.246 0.267 0.168 0.263 0.236
MOSS-SFT-16B 0.260 0.263 0.207 0.275 0.251
BELLE-7B-0.2M 0.247 0.296 0.260 0.260 0.266
BELLE-7B-2M 0.328 0.367 0.282 0.355 0.333
LLaMA-7B-2M 0.256 0.227 0.206 0.244 0.233
LLaMA-13B-2M 0.294 0.316 0.246 0.279 0.284
AquilaChat-7B 0.256 0.253 0.229 0.246 0.246
GPT3.5-turbo 0.460 0.538 0.444 0.481 0.481
GPT-4 0.588 0.676 0.623 0.665 0.638

Average five-shot accuracy of each evaluated model on four major discipline clusters:

Models Arts&Humanities Social Sciences Nature Sciences Others Average
GLM-335M 0.220 0.247 0.193 0.126 0.196
Bloom-7B 0.247 0.260 0.235 0.246 0.247
GLM-10B 0.294 0.304 0.232 0.211 0.260
GLM-130B 0.297 0.329 0.246 0.228 0.275
ChatGLM-6B 0.188 0.175 0.121 0.198 0.171
MOSS-SFT-16B 0.266 0.264 0.258 0.284 0.268
BELLE-7B-0.2M 0.292 0.327 0.273 0.307 0.299
BELLE-7B-2M 0.287 0.309 0.284 0.313 0.298
LLaMA-7B-2M 0.273 0.257 0.222 0.250 0.251
LLaMA-13B-2M 0.241 0.234 0.138 0.219 0.208
AquilaChat-7B 0.257 0.249 0.248 0.264 0.255
baichuan-7B 0.266 0.264 0.175 0.241 0.237
GPT3.5-turbo 0.453 0.540 0.464 0.476 0.483

Citation

@misc{liu2023m3ke,
    title={M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models},
    author={Chuang Liu and Renren Jin and Yuqi Ren and Linhao Yu and Tianyu Dong and Xiaohan Peng and Shuting Zhang and Jianxiang Peng and Peiyi Zhang and Qingqing Lyu and Xiaowen Su and Qun Liu and Deyi Xiong},
    year={2023},
    eprint={2305.10263},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

m3ke's People

Contributors

cordercorder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

m3ke's Issues

完整数据集

请问下readme说有两万多个问题,但是我git clone下来好像并没有那么多。请问如何获取完整的数据?

Can you provide some details of the evaluation code for the reported results?

This assessment data also appears to be in the form of multiple choice tasks similar to MMLU, but there are many detailed differences in the practice of MMLU, and these detailed differences have a significant impact on the quality outcome value. Among them, the accuracy calculation method provided by MMLU is based on the probability normalization of four options, from which the maximum probability is selected as the prediction result. However, many others have changed it to the generated form and then extracted the ABCD option from the generated answer, and the prompt setting and the extraction method of the answer will affect the final result.

So what are the evaluation code details based on which the results table is reported in your repository?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.