math401-llm

Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?

Full evaluation of all size models.

Dataset

MATH 401 = 1 Euler Equation + 16 group * 25 problems

Euler Equation.
Add & Subtract of two integers within 10.
Add & Subtract of two integers within 100.
Add & Subtract of two integers within 1,000.
Add & Subtract of two integers within 1,000,000,000,000.
Add & Subtract of two integers within -10~10.
Add & Subtract of two decimal numbers within -100~100.
Multiply two integers within 100.
Multiply two decimal numbers within 10.
Multiply two integers within 100,000.
Division of two integers within 100.
Exponentiation of with integer base within 10 and integer exponent within 2~4.
Exponentiation of with a decimal number within 10 as the base and a decimal number within 2~4 as the exponent.
Add, Subtract & Multiply with one integer within 10 and a common irrational number (i.e. $e$ or $\pi$).
Long arithmetic expressions with brackets, involved integers are all within 100 and operators contain add, subtract, multiply, and division.
Trigonometry functions including $\sin$, $\cos$, and $\tan$. Inputs can be in the format of degrees and radians ($\pi$ can also appear in the inputs).
Logarithm of integers within 1000 of different bases: $2,e,10$.

Metric

Accuracy

If the difference between the decoded number and the target number is less than $1e-3$, we consider it a correct prediction. Accuracy is calculated based on correct prediction counts.

Relative error

We denote decoded number is $\hat{y}$ and target is $y$. We calculate relative error by:

$RE = \min(10, \frac{|\hat{y}-y|}{\max(|y|, 1)})$

If LLM does not decode any number, we consider $RE=10$. We truncate the relative error to 10 to prevent that one big mistake dominate the average relative error.

Non-number ratio

If decoded content does not contain any numbers, we consider it a failure. We calculate the non-number ratio based on it.

Citation

@misc{yuan2023large,
      title={How well do Large Language Models perform in Arithmetic tasks?}, 
      author={Zheng Yuan and Hongyi Yuan and Chuanqi Tan and Wei Wang and Songfang Huang},
      year={2023},
      eprint={2304.02015},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

yuanhy1997 / math401-llm Goto Github PK

math401-llm's Introduction

math401-llm

Dataset

Metric

Accuracy

Relative error

Non-number ratio

Citation

math401-llm's People

Contributors

Stargazers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent