Giter VIP home page Giter VIP logo

math401-llm's Introduction

math401-llm

Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?

Main

Full evaluation of all size models.

Full

Dataset

MATH 401 = 1 Euler Equation + 16 group * 25 problems

  • Euler Equation.
  • Add & Subtract of two integers within 10.
  • Add & Subtract of two integers within 100.
  • Add & Subtract of two integers within 1,000.
  • Add & Subtract of two integers within 1,000,000,000,000.
  • Add & Subtract of two integers within -10~10.
  • Add & Subtract of two decimal numbers within -100~100.
  • Multiply two integers within 100.
  • Multiply two decimal numbers within 10.
  • Multiply two integers within 100,000.
  • Division of two integers within 100.
  • Exponentiation of with integer base within 10 and integer exponent within 2~4.
  • Exponentiation of with a decimal number within 10 as the base and a decimal number within 2~4 as the exponent.
  • Add, Subtract & Multiply with one integer within 10 and a common irrational number (i.e. $e$ or $\pi$).
  • Long arithmetic expressions with brackets, involved integers are all within 100 and operators contain add, subtract, multiply, and division.
  • Trigonometry functions including $\sin$, $\cos$, and $\tan$. Inputs can be in the format of degrees and radians ($\pi$ can also appear in the inputs).
  • Logarithm of integers within 1000 of different bases: $2,e,10$.

Metric

Accuracy

If the difference between the decoded number and the target number is less than $1e-3$, we consider it a correct prediction. Accuracy is calculated based on correct prediction counts.

Relative error

We denote decoded number is $\hat{y}$ and target is $y$. We calculate relative error by:

$RE = \min(10, \frac{|\hat{y}-y|}{\max(|y|, 1)})$

If LLM does not decode any number, we consider $RE=10$. We truncate the relative error to 10 to prevent that one big mistake dominate the average relative error.

Non-number ratio

If decoded content does not contain any numbers, we consider it a failure. We calculate the non-number ratio based on it.

Citation

@misc{yuan2023large,
      title={How well do Large Language Models perform in Arithmetic tasks?}, 
      author={Zheng Yuan and Hongyi Yuan and Chuanqi Tan and Wei Wang and Songfang Huang},
      year={2023},
      eprint={2304.02015},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

math401-llm's People

Contributors

ganjinzero avatar

Stargazers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.