Giter VIP home page Giter VIP logo

llm-autoeval's Introduction

๐Ÿง LLM AutoEval

๐Ÿฆ Follow me on X โ€ข ๐Ÿค— Hugging Face โ€ข ๐Ÿ’ป Blog โ€ข ๐Ÿ“™ Hands-on GNN

Simplify LLM evaluation using a convenient Colab notebook.

Open In Colab

๐Ÿ” Overview

LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. You just need to specify the name of your model, a benchmark, a GPU, and press run!

Key Features

  • Automated setup and execution using RunPod.
  • Customizable evaluation parameters for tailored benchmarking.
  • Summary generation and upload to GitHub Gist for easy sharing and reference.

View a sample summary here.

Note: This project is in the early stages and primarily designed for personal use. Use it carefully and feel free to contribute.

โšก Quick Start

Evaluation

  • MODEL_ID: Enter the model id from Hugging Face.
  • BENCHMARK:
    • nous: List of tasks: AGIEval, GPT4ALL, TruthfulQA, and Bigbench (popularized by Teknium and NousResearch). This is recommended.
    • lighteval: This is a new library from Hugging Face. It allows you to specify your tasks as shown in the readme. Check the list of recommended tasks to see what you can use (e.g., HELM, PIQA, GSM8K, MATH, etc.)
    • openllm: List of tasks: ARC, HellaSwag, MMLU, Winogrande, GSM8K, and TruthfulQA (like the Open LLM Leaderboard). It uses the vllm implementation to enhance speed (note that the results will not be identical to those obtained without using vllm). "mmlu" is currently missing because of a problem with vllm.
  • LIGHTEVAL_TASK: You can select one or several tasks as specified in the readme or in the list of recommended tasks.

Cloud GPU

  • GPU: Select the GPU you want for evaluation (see prices here). I recommend using beefy GPUs (RTX 3090 or higher), especially for the Open LLM benchmark suite.
  • Number of GPUs: Self-explanatory (more cost-efficient than bigger GPUs if you need more VRAM).
  • CONTAINER_DISK: Size of the disk in GB.
  • CLOUD_TYPE: RunPod offers a community cloud (cheaper) and a secure cloud (more reliable).
  • REPO: If you made a fork of this repo, you can specify its URL here (the image only runs runpod.sh).
  • TRUST_REMOTE_CODE: Models like Phi require this flag to run them.
  • PRIVATE_GIST: (W.I.P.) Make the Gist with the results private (true) or public (false).
  • DEBUG: The pod will not be destroyed at the end of the run (not recommended).

Tokens

Tokens use Colab's Secrets tab. Create two secrets called "runpod" and "github" and add the corresponding tokens you can find as follows:

  • RUNPOD_TOKEN: Please consider using my referral link if you don't have an account yet. You can create your token here under "API keys" (read & write permission). You'll also need to transfer some money there to start a pod.
  • GITHUB_TOKEN: You can create your token here (read & write, can be restricted to "gist" only).
  • HF_TOKEN: Optional. You can find your Hugging Face token here if you have an account.

๐Ÿ“Š Benchmark suites

Nous

You can compare your results with:

Lighteval

You can compare your results on a case-by-case basis, depending on the tasks you have selected.

Open LLM

You can compare your results with those listed on the Open LLM Leaderboard.

๐Ÿ† Leaderboard

I use the summaries produced by LLM AutoEval to created YALL - Yet Another LLM Leaderboard with plots as follows:

image

Let me know if you're interested in creating your own leaderboard with your gists in one click. This can be easily converted into a small notebook to create this space.

๐Ÿ› ๏ธ Troubleshooting

  • "Error: File does not exist": This task didn't produce the JSON file that is parsed for the summary. Activate debug mode and rerun the evaluation to inspect the issue in the logs.
  • "700 Killed" Error: The hardware is not powerful enough for the evaluation. This happens when you try to run the Open LLM benchmark suite on an RTX 3070 for example.
  • Outdated CUDA Drivers: That's unlucky. You'll need to start a new pod in this case.

Acknowledgements

Special thanks to burtenshaw for integrating lighteval, EleutherAI for the lm-evaluation-harness, dmahan93 for his fork that adds agieval to the lm-evaluation-harness, Hugging Face for the lighteval library, NousResearch and Teknium for the Nous benchmark suite, and vllm for the additional inference speed.

llm-autoeval's People

Contributors

burtenshaw avatar chenhaodev avatar cultrix-github avatar mlabonne avatar steel-skull avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.