mlabonne / llm-autoeval Goto Github PK

View Code? Open in Web Editor NEW

491.0 491.0 79.0 138 KB

Automatically evaluate your LLMs in Google Colab

License: MIT License

Shell 46.83% Python 53.17%

llm-autoeval's Introduction

🐦 Follow me on X • 🤗 Hugging Face • 💻 Blog • 📙 Hands-on GNN

Hi, I'm a Machine Learning Scientist, Author, Blogger, and LLM Developer.

💼 Projects

The LLM Course: A popular curated list of resources to get into LLMs (>32k ⭐).
Hands-on GNNs: My book about graph neural networks published by Packt (all the code is open source).
LLM Datasets: Curated list of high-quality datasets for LLM fine-tuning.
LLM Tools: Automate LLM pipelines with Colab notebooks like LLM AutoEval, LazyMergekit, LazyAxolotl, and AutoQuant.

🤗 Models

NeuralDaredevil-8B: Uncensored models with the highest MMLU score in the 8B category.
AlphaMonarch-7B: Top performer in terms of reasoning + conversational abilities on a variety of benchmarks. [Demo]
NeuralBeagle14-7B: The most powerful 7B model (rank 10 on the entire Open LLM Leaderboard). [Demo]
Phixtral: Novel Mixture of Experts architecture with phi-2 models. [Demo]
Beyonder-4x7B-v3: Mixture of Experts with four excellent fine-tuned Mistral-7b models. [Demo]
NeuralHermes: A DPO fine-tuned version of OpenHermes (extremely cost-efficient). [Demo]

llm-autoeval's People

Contributors

Stargazers

Watchers

llm-autoeval's Issues

multiple GPUs works in colab

I just wanted to let you know that I tested the multiple GPU feature and it definitely works great. This is the easiest way to evaluate 8x7b models for sure. Thanks

Proposal to Make LLM-AutoEval Repository Multi-Cloud Compatible

Subject: Proposal to Make LLM-AutoEval Repository Multi-Cloud Compatible

Issue Title: Enhancing LLM-AutoEval for Multi-Cloud Compatibility using Sky Pilot

Description:

I am proposing a significant enhancement to the LLM-AutoEval repository to make it multi-cloud compatible. Currently, the repository only supports the runpod GPU provider, and my goal is to extend its capabilities to run seamlessly on various cloud platforms using the Sky Pilot framework.

Proposed Roadmap:

Dynamic GPU Configuration:
- Adapt the existing execution script to support different GPU configurations dynamically.
- Allow users to specify their GPU preferences or let the system auto-detect and utilize available resources efficiently.
Integration with Sky Pilot:
- Modify the execution script to integrate with the Sky Pilot framework.
- Leverage Sky Pilot's capabilities for running LLMs, AI, and batch jobs on any cloud provider.
Multi-Cloud Testing:
- Thoroughly test the modified script on multiple cloud providers (e.g., AWS, Azure, GCP) to ensure consistency and efficiency.
- Validate performance and resource utilization across different cloud environments.
Documentation:
- Create comprehensive documentation guiding users on how to:
  - Specify cloud provider preferences and credentials.
  - Use the modified script for multi-cloud deployment.
  - Troubleshoot common issues and optimize performance.

Expected Outcome:

The enhanced LLM-AutoEval will empower users to choose their preferred cloud provider, customize GPU configurations, and seamlessly run experiments using the Sky Pilot framework. This will significantly improve accessibility and usability, fostering collaboration and adoption within the community.

Contributor - @adithya-s-k

token checks

after stupid me forgot to renew one of the tokens, i start with a precheck. in case you want to adapt this somehow, feel free: https://github.com/CrispStrobe/llm_scripts/blob/main/CrispAutoEval.ipynb. & thanks once more for your always great work!

How to set a Git Repo?

As the title described.

Add functionality for evaluating model safety/toxicity

The current evaluation metrics supported by llm-eval are robust. However, upon reviewing the documentation, I found that the current repo doesn't account for evaluating model toxicity. Assessing LLMs for toxicity is tricky and there are (surprisingly) few comprehensive, tested open source solutions for doing so. I've identified a few options that could be added to the llm-eval Google Colab notebook.

TrustLLM

What is it? A python package that evaluates trustworthiness by evaluating LLM responses to a mixture of well-known evaluation datasets.
How does it work? Download the TrustLLM dataset, use TrustLLM and your (supported) model to generate responses for the dataset, use TrustLLM to evaluate Truthfulness, Safety, Fairness, Robustness, Privacy, Ethics.
- Works with models inferenced via APIs, locally public models (HuggingFace), online models via Replicate or DeepInfra
Questions
- Could we integrate TrustLLM functionality with Runpod for generating the responses that are eventually evaluated using TrustLLM?

I'm more than happy to further discuss and pick this issue up myself!

assert isinstance(pretrained, str)

Hi!

First of all, I want to thank everyone involved in this great project!

I have a specific problem that I can't solve for hours, and I don't have a lot of programming experience, chatgpt and all the other chatbots can't help so I'm going to try here.

I'm trying to evaluate my model but I keep running into: "Error: File does not exist". This model in particular was already evaluated through open LLM leaderboard without a problem. I can also do inference. I have already checked "debug" and this is what I get in the runpod logs:

File "/lm-evaluation-harness/main.py", line 89, in <module>
main()
File "/lm-evaluation-harness/main.py", line 57, in main
results = evaluator.simple_evaluate(
File "/lm-evaluation-harness/lm_eval/utils.py", line 242, in _wrapper
return fn(*args, **kwargs)
File "/lm-evaluation-harness/lm_eval/evaluator.py", line 69, in simple_evaluate
lm = lm_eval.models.get_model(model).create_from_arg_string(
File "/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
return cls(**args, **args2)
File "/lm-evaluation-harness/lm_eval/models/gpt2.py", line 36, in __init__
assert isinstance(pretrained, str)
AssertionError

How to run it locally?

How Do I run it locally? And can it run .gguf files?

Why is `mmlu` benchmark commented out?

I was going over the runpod.sh script and couldn't help but notice mmlu benchmark is commented out. I'm curious, why is that so.

Thank you for putting this repository together @mlabonne. Learned about this from your X-post 🚀!!

llm-autoeval/runpod.sh

Lines 86 to 94 in 75d952e

 # benchmark="mmlu" 

 # lm_eval --model vllm \ 

 # --model_args pretrained=${MODEL},dtype=auto,gpu_memory_utilization=0.8,trust_remote_code=$TRUST_REMOTE_CODE \ 

 # --tasks mmlu \ 

 # --num_fewshot 5 \ 

 # --batch_size auto \ 

 # --verbosity DEBUG \ 

 # --output_path ./${benchmark}.json

File does not exist% on all of every eval

ARC
Average: Error: File does not exist%

HellaSwag

Use Colab notebook with private models

It is useful to benchmark private models from the hub, but the runpod instance is not authenticated to the user account.

By including the environment variables for HF_TOKEN and colab.userdata, it's straightforward to authenticate the runpod instance. I adapted the notebook here. I'm also happy to submit a PR.

Btw. Great tool. Thanks for your work.

log running time?

I love how simply this runs.

it would be useful to log how long the test took to run.

Maybe an optional checkbox on the colab and the code handles it?

Only one GPU is used during the autoeval

Hi! Great tool!

I attempted to use the autoeval feature on a dual RTX 3090 setup in RunPod, but it appeared that only the first GPU was utilized throughout the evaluation process.

I'm uncertain whether the second GPU was genuinely inactive or if RunPod simply did not display its activity.

Feature request - Local GPUs

Thanks @mlabonne for sharing this personal repo, dead simple! I just wanted to say it would be great if it can support local GPUs specially via that vLLM run in a container.

Thanks again, great job.

Novel errors when running notebook

~~(Edit: Somehow submitted this before it was ready; give me a moment to pin the bug down.)~~

Getting some unfamiliar errors running the notebook. Tested lighteval with one task and eq-bench; also tested switching the image to the one suggested in previous image-switching issue; and rerunning a model that worked yesterday.

Container error logs from debug, lighteval:

2024-05-03T22:44:09.217084005Z Traceback (most recent call last):
2024-05-03T22:44:09.217097561Z   File "/lighteval/run_evals_accelerate.py", line 29, in <module>
2024-05-03T22:44:09.217099188Z     from lighteval.main_accelerate import CACHE_DIR, main
2024-05-03T22:44:09.217100293Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/main_accelerate.py", line 31, in <module>
2024-05-03T22:44:09.217112611Z     from lighteval.evaluator import evaluate, make_results_table
2024-05-03T22:44:09.217113371Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/evaluator.py", line 32, in <module>
2024-05-03T22:44:09.217117633Z     from lighteval.logging.evaluation_tracker import EvaluationTracker
2024-05-03T22:44:09.217118437Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/logging/evaluation_tracker.py", line 37, in <module>
2024-05-03T22:44:09.217145402Z     from lighteval.logging.info_loggers import (
2024-05-03T22:44:09.217146831Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/logging/info_loggers.py", line 34, in <module>
2024-05-03T22:44:09.217290883Z     from lighteval.metrics import MetricCategory
2024-05-03T22:44:09.217297198Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/__init__.py", line 25, in <module>
2024-05-03T22:44:09.217298154Z     from lighteval.metrics.metrics import MetricCategory, Metrics
2024-05-03T22:44:09.217299180Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/metrics.py", line 34, in <module>
2024-05-03T22:44:09.217299880Z     from lighteval.metrics.metrics_sample import (
2024-05-03T22:44:09.217301038Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/metrics_sample.py", line 42, in <module>
2024-05-03T22:44:09.217370955Z     from lighteval.metrics.llm_as_judge import JudgeOpenAI
2024-05-03T22:44:09.217374502Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/llm_as_judge.py", line 30, in <module>
2024-05-03T22:44:09.217375365Z     from openai import OpenAI
2024-05-03T22:44:09.217376370Z ModuleNotFoundError: No module named 'openai'
2024-05-03T22:44:09.724293220Z Traceback (most recent call last):
2024-05-03T22:44:09.724322586Z   File "/usr/local/bin/accelerate", line 8, in <module>
2024-05-03T22:44:09.724335712Z     sys.exit(main())
2024-05-03T22:44:09.724337033Z   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
2024-05-03T22:44:09.724338223Z     args.func(args)
2024-05-03T22:44:09.724339038Z   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1082, in launch_command
2024-05-03T22:44:09.724457425Z     simple_launcher(args)
2024-05-03T22:44:09.724458705Z   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 688, in simple_launcher
2024-05-03T22:44:09.724545203Z     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
2024-05-03T22:44:09.724558533Z subprocess.CalledProcessError: Command '['/usr/bin/python', 'run_evals_accelerate.py', '--model_args', 'pretrained=<model name>, '--use_chat_template', '--tasks', 'helm|commonsenseqa|0|0', '--output_dir=./evals/']' returned non-zero exit status 1.
2024-05-03T22:44:10.119077403Z Traceback (most recent call last):
2024-05-03T22:44:10.119092033Z   File "/lighteval/../llm-autoeval/main.py", line 129, in <module>
2024-05-03T22:44:10.119093509Z     raise ValueError(f"The directory {args.directory} does not exist.")
2024-05-03T22:44:10.119094286Z ValueError: The directory ./evals/results does not exist.

Container error logs from debug, eq-bench:

2024-05-04T00:49:11.418739844+02:00 Traceback (most recent call last):
2024-05-04T00:49:11.418762136+02:00   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-05-04T00:49:11.418772005+02:00     return _run_code(code, main_globals, None,
2024-05-04T00:49:11.418774500+02:00   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-05-04T00:49:11.418784579+02:00     exec(code, run_globals)
2024-05-04T00:49:11.418787154+02:00   File "/lm-evaluation-harness/lm_eval/__main__.py", line 401, in <module>
2024-05-04T00:49:11.418867446+02:00     cli_evaluate()
2024-05-04T00:49:11.418877675+02:00   File "/lm-evaluation-harness/lm_eval/__main__.py", line 333, in cli_evaluate
2024-05-04T00:49:11.418913252+02:00     results = evaluator.simple_evaluate(
2024-05-04T00:49:11.418924233+02:00   File "/lm-evaluation-harness/lm_eval/utils.py", line 316, in _wrapper
2024-05-04T00:49:11.418957566+02:00     return fn(*args, **kwargs)
2024-05-04T00:49:11.418964680+02:00   File "/lm-evaluation-harness/lm_eval/evaluator.py", line 258, in simple_evaluate
2024-05-04T00:49:11.419002892+02:00     results = evaluate(
2024-05-04T00:49:11.419008102+02:00   File "/lm-evaluation-harness/lm_eval/utils.py", line 316, in _wrapper
2024-05-04T00:49:11.419049490+02:00     return fn(*args, **kwargs)
2024-05-04T00:49:11.419054069+02:00   File "/lm-evaluation-harness/lm_eval/evaluator.py", line 592, in evaluate
2024-05-04T00:49:11.419127468+02:00     "n-samples": {
2024-05-04T00:49:11.419131345+02:00   File "/lm-evaluation-harness/lm_eval/evaluator.py", line 595, in <dictcomp>
2024-05-04T00:49:11.419181300+02:00     "effective": min(limit, len(task_output.task.eval_docs)),
2024-05-04T00:49:11.419183995+02:00 TypeError: '<' not supported between instances of 'int' and 'NoneType'
2024-05-04T00:49:11.423606509+02:00 Passed argument batch_size = auto. Detecting largest batch size
2024-05-04T00:49:11.423615215+02:00 Determined Largest batch size: 4
2024-05-04T00:49:12.049730263+02:00 Traceback (most recent call last):
2024-05-04T00:49:12.049745021+02:00   File "/usr/local/bin/accelerate", line 8, in <module>
2024-05-04T00:49:12.049811316+02:00     sys.exit(main())
2024-05-04T00:49:12.049814512+02:00   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
2024-05-04T00:49:12.049956972+02:00     args.func(args)
2024-05-04T00:49:12.049966009+02:00   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1082, in launch_command
2024-05-04T00:49:12.050293249+02:00     simple_launcher(args)
2024-05-04T00:49:12.050303087+02:00   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 688, in simple_launcher
2024-05-04T00:49:12.050515349+02:00     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
2024-05-04T00:49:12.050519176+02:00 subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'lm_eval', '--model', 'hf', '--model_args', 'pretrained=<model name>,dtype=auto,trust_remote_code=False', '--tasks', 'eq_bench', '--num_fewshot', '0', '--batch_size', 'auto', '--output_path', './evals/eq-bench.json']' returned non-zero exit status 1.
2024-05-04T00:49:12.484988425+02:00 Traceback (most recent call last):
2024-05-04T00:49:12.485006620+02:00   File "/lm-evaluation-harness/../llm-autoeval/main.py", line 129, in <module>
2024-05-04T00:49:12.485009545+02:00     raise ValueError(f"The directory {args.directory} does not exist.")
2024-05-04T00:49:12.485011008+02:00 ValueError: The directory ./evals does not exist.

Docker setup for runpod

Hi maxime,
first: Thanks for your awesome tools and insights!
I am currently trying to provide support for german benchmarks (https://github.com/mayflower/llm-autoeval-de).
Since i need to use a different template on runpod, could you give me a hint what the docker setup looks like?
Thank you,
Johann

Runpod says 'no longer any instance available'

I was trying to use the notebook and run eval on one Nvidia A40 GPU 100 GB disk size and get the following error

---------------------------------------------------------------------------
QueryError                                Traceback (most recent call last)
[<ipython-input-5-05063fbf8ce4>](https://localhost:8080/#) in <cell line: 41>()
     39 
     40 # Create a pod
---> 41 pod = runpod.create_pod(
     42     name=f"Eval {MODEL.split('/')[-1]} on {BENCHMARK.capitalize()}",
     43     image_name="runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04",

1 frames
[/usr/local/lib/python3.10/dist-packages/runpod/api/graphql.py](https://localhost:8080/#) in run_graphql_query(query)
     28 
     29     if "errors" in response.json():
---> 30         raise error.QueryError(
     31             response.json()["errors"][0]["message"],
     32             query

QueryError: There are no longer any instances available with the requested specifications. Please refresh and try again.

I have funds in my runpod account and a pod with A40 can be created on runpod.io . Any idea how to go about debugging this?

Issues with BF16 models

Hey... Thanks for the great work!

While trying to evaluate a BF16 model, I encountered an error in my runpod container:
"triu_tril_cuda_template" not implemented for 'BFloat16'. (pytorch/pytorch#101932)

Switching the image from runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04
to runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 fixed the issue.

I'm reporting this for others who may have the same problem.
I don't know if it might make sense to update the Colab notebook and use a newer image or it might reveal other problems.

how can I add a new benchmark? I'm trying to evaluate a text to sql model!

Bug with lighteval on running from notebook

At the end of the run, I am getting:

2024-03-25T13:57:22.568287661Z Traceback (most recent call last):
2024-03-25T13:57:22.568325169Z File "/lm-evaluation-harness/../llm-autoeval/main.py", line 7, in
2024-03-25T13:57:22.568330133Z from lighteval.evaluator import make_results_table
2024-03-25T13:57:22.568333613Z ModuleNotFoundError: No module named 'lighteval'

	# benchmark="mmlu"
	# lm_eval --model vllm \
	# --model_args pretrained=${MODEL},dtype=auto,gpu_memory_utilization=0.8,trust_remote_code=$TRUST_REMOTE_CODE \
	# --tasks mmlu \
	# --num_fewshot 5 \
	# --batch_size auto \
	# --verbosity DEBUG \
	# --output_path ./${benchmark}.json