Giter VIP home page Giter VIP logo

llm-autoeval's Introduction

๐Ÿฆ Follow me on X โ€ข ๐Ÿค— Hugging Face โ€ข ๐Ÿ’ป Blog โ€ข ๐Ÿ“™ Hands-on GNN


Hi, I'm a Machine Learning Scientist, Author, Blogger, and LLM Developer.

๐Ÿ’ผ Projects

๐Ÿค— Models

llm-autoeval's People

Contributors

burtenshaw avatar cultrix-github avatar mlabonne avatar steel-skull avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm-autoeval's Issues

multiple GPUs works in colab

I just wanted to let you know that I tested the multiple GPU feature and it definitely works great. This is the easiest way to evaluate 8x7b models for sure. Thanks

Proposal to Make LLM-AutoEval Repository Multi-Cloud Compatible

Subject: Proposal to Make LLM-AutoEval Repository Multi-Cloud Compatible


Issue Title: Enhancing LLM-AutoEval for Multi-Cloud Compatibility using Sky Pilot

Description:

I am proposing a significant enhancement to the LLM-AutoEval repository to make it multi-cloud compatible. Currently, the repository only supports the runpod GPU provider, and my goal is to extend its capabilities to run seamlessly on various cloud platforms using the Sky Pilot framework.

Proposed Roadmap:

  1. Dynamic GPU Configuration:

    • Adapt the existing execution script to support different GPU configurations dynamically.
    • Allow users to specify their GPU preferences or let the system auto-detect and utilize available resources efficiently.
  2. Integration with Sky Pilot:

    • Modify the execution script to integrate with the Sky Pilot framework.
    • Leverage Sky Pilot's capabilities for running LLMs, AI, and batch jobs on any cloud provider.
  3. Multi-Cloud Testing:

    • Thoroughly test the modified script on multiple cloud providers (e.g., AWS, Azure, GCP) to ensure consistency and efficiency.
    • Validate performance and resource utilization across different cloud environments.
  4. Documentation:

    • Create comprehensive documentation guiding users on how to:
      • Specify cloud provider preferences and credentials.
      • Use the modified script for multi-cloud deployment.
      • Troubleshoot common issues and optimize performance.

Expected Outcome:

The enhanced LLM-AutoEval will empower users to choose their preferred cloud provider, customize GPU configurations, and seamlessly run experiments using the Sky Pilot framework. This will significantly improve accessibility and usability, fostering collaboration and adoption within the community.

Contributor - @adithya-s-k

Add functionality for evaluating model safety/toxicity

The current evaluation metrics supported by llm-eval are robust. However, upon reviewing the documentation, I found that the current repo doesn't account for evaluating model toxicity. Assessing LLMs for toxicity is tricky and there are (surprisingly) few comprehensive, tested open source solutions for doing so. I've identified a few options that could be added to the llm-eval Google Colab notebook.

TrustLLM

  • What is it? A python package that evaluates trustworthiness by evaluating LLM responses to a mixture of well-known evaluation datasets.
  • How does it work? Download the TrustLLM dataset, use TrustLLM and your (supported) model to generate responses for the dataset, use TrustLLM to evaluate Truthfulness, Safety, Fairness, Robustness, Privacy, Ethics.
    • Works with models inferenced via APIs, locally public models (HuggingFace), online models via Replicate or DeepInfra
  • Questions
    • Could we integrate TrustLLM functionality with Runpod for generating the responses that are eventually evaluated using TrustLLM?

I'm more than happy to further discuss and pick this issue up myself!

assert isinstance(pretrained, str)

Hi!

First of all, I want to thank everyone involved in this great project!

I have a specific problem that I can't solve for hours, and I don't have a lot of programming experience, chatgpt and all the other chatbots can't help so I'm going to try here.

I'm trying to evaluate my model but I keep running into: "Error: File does not exist". This model in particular was already evaluated through open LLM leaderboard without a problem. I can also do inference. I have already checked "debug" and this is what I get in the runpod logs:

File "/lm-evaluation-harness/main.py", line 89, in <module>
main()
File "/lm-evaluation-harness/main.py", line 57, in main
results = evaluator.simple_evaluate(
File "/lm-evaluation-harness/lm_eval/utils.py", line 242, in _wrapper
return fn(*args, **kwargs)
File "/lm-evaluation-harness/lm_eval/evaluator.py", line 69, in simple_evaluate
lm = lm_eval.models.get_model(model).create_from_arg_string(
File "/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
return cls(**args, **args2)
File "/lm-evaluation-harness/lm_eval/models/gpt2.py", line 36, in __init__
assert isinstance(pretrained, str)
AssertionError

Why is `mmlu` benchmark commented out?

I was going over the runpod.sh script and couldn't help but notice mmlu benchmark is commented out. I'm curious, why is that so.

Thank you for putting this repository together @mlabonne. Learned about this from your X-post ๐Ÿš€!!

llm-autoeval/runpod.sh

Lines 86 to 94 in 75d952e

# benchmark="mmlu"
# lm_eval --model vllm \
# --model_args pretrained=${MODEL},dtype=auto,gpu_memory_utilization=0.8,trust_remote_code=$TRUST_REMOTE_CODE \
# --tasks mmlu \
# --num_fewshot 5 \
# --batch_size auto \
# --verbosity DEBUG \
# --output_path ./${benchmark}.json

Use Colab notebook with private models

It is useful to benchmark private models from the hub, but the runpod instance is not authenticated to the user account.

By including the environment variables for HF_TOKEN and colab.userdata, it's straightforward to authenticate the runpod instance. I adapted the notebook here. I'm also happy to submit a PR.

Btw. Great tool. Thanks for your work.

log running time?

I love how simply this runs.

it would be useful to log how long the test took to run.

Maybe an optional checkbox on the colab and the code handles it?

Only one GPU is used during the autoeval

Hi! Great tool!

I attempted to use the autoeval feature on a dual RTX 3090 setup in RunPod, but it appeared that only the first GPU was utilized throughout the evaluation process.

I'm uncertain whether the second GPU was genuinely inactive or if RunPod simply did not display its activity.

Feature request - Local GPUs

Thanks @mlabonne for sharing this personal repo, dead simple! I just wanted to say it would be great if it can support local GPUs specially via that vLLM run in a container.

Thanks again, great job.

Novel errors when running notebook

(Edit: Somehow submitted this before it was ready; give me a moment to pin the bug down.)

Getting some unfamiliar errors running the notebook. Tested lighteval with one task and eq-bench; also tested switching the image to the one suggested in previous image-switching issue; and rerunning a model that worked yesterday.

Container error logs from debug, lighteval:

2024-05-03T22:44:09.217084005Z Traceback (most recent call last):
2024-05-03T22:44:09.217097561Z   File "/lighteval/run_evals_accelerate.py", line 29, in <module>
2024-05-03T22:44:09.217099188Z     from lighteval.main_accelerate import CACHE_DIR, main
2024-05-03T22:44:09.217100293Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/main_accelerate.py", line 31, in <module>
2024-05-03T22:44:09.217112611Z     from lighteval.evaluator import evaluate, make_results_table
2024-05-03T22:44:09.217113371Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/evaluator.py", line 32, in <module>
2024-05-03T22:44:09.217117633Z     from lighteval.logging.evaluation_tracker import EvaluationTracker
2024-05-03T22:44:09.217118437Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/logging/evaluation_tracker.py", line 37, in <module>
2024-05-03T22:44:09.217145402Z     from lighteval.logging.info_loggers import (
2024-05-03T22:44:09.217146831Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/logging/info_loggers.py", line 34, in <module>
2024-05-03T22:44:09.217290883Z     from lighteval.metrics import MetricCategory
2024-05-03T22:44:09.217297198Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/__init__.py", line 25, in <module>
2024-05-03T22:44:09.217298154Z     from lighteval.metrics.metrics import MetricCategory, Metrics
2024-05-03T22:44:09.217299180Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/metrics.py", line 34, in <module>
2024-05-03T22:44:09.217299880Z     from lighteval.metrics.metrics_sample import (
2024-05-03T22:44:09.217301038Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/metrics_sample.py", line 42, in <module>
2024-05-03T22:44:09.217370955Z     from lighteval.metrics.llm_as_judge import JudgeOpenAI
2024-05-03T22:44:09.217374502Z   File "/usr/local/lib/python3.10/dist-packages/lighteval/metrics/llm_as_judge.py", line 30, in <module>
2024-05-03T22:44:09.217375365Z     from openai import OpenAI
2024-05-03T22:44:09.217376370Z ModuleNotFoundError: No module named 'openai'
2024-05-03T22:44:09.724293220Z Traceback (most recent call last):
2024-05-03T22:44:09.724322586Z   File "/usr/local/bin/accelerate", line 8, in <module>
2024-05-03T22:44:09.724335712Z     sys.exit(main())
2024-05-03T22:44:09.724337033Z   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
2024-05-03T22:44:09.724338223Z     args.func(args)
2024-05-03T22:44:09.724339038Z   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1082, in launch_command
2024-05-03T22:44:09.724457425Z     simple_launcher(args)
2024-05-03T22:44:09.724458705Z   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 688, in simple_launcher
2024-05-03T22:44:09.724545203Z     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
2024-05-03T22:44:09.724558533Z subprocess.CalledProcessError: Command '['/usr/bin/python', 'run_evals_accelerate.py', '--model_args', 'pretrained=<model name>, '--use_chat_template', '--tasks', 'helm|commonsenseqa|0|0', '--output_dir=./evals/']' returned non-zero exit status 1.
2024-05-03T22:44:10.119077403Z Traceback (most recent call last):
2024-05-03T22:44:10.119092033Z   File "/lighteval/../llm-autoeval/main.py", line 129, in <module>
2024-05-03T22:44:10.119093509Z     raise ValueError(f"The directory {args.directory} does not exist.")
2024-05-03T22:44:10.119094286Z ValueError: The directory ./evals/results does not exist.

Container error logs from debug, eq-bench:

2024-05-04T00:49:11.418739844+02:00 Traceback (most recent call last):
2024-05-04T00:49:11.418762136+02:00   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-05-04T00:49:11.418772005+02:00     return _run_code(code, main_globals, None,
2024-05-04T00:49:11.418774500+02:00   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-05-04T00:49:11.418784579+02:00     exec(code, run_globals)
2024-05-04T00:49:11.418787154+02:00   File "/lm-evaluation-harness/lm_eval/__main__.py", line 401, in <module>
2024-05-04T00:49:11.418867446+02:00     cli_evaluate()
2024-05-04T00:49:11.418877675+02:00   File "/lm-evaluation-harness/lm_eval/__main__.py", line 333, in cli_evaluate
2024-05-04T00:49:11.418913252+02:00     results = evaluator.simple_evaluate(
2024-05-04T00:49:11.418924233+02:00   File "/lm-evaluation-harness/lm_eval/utils.py", line 316, in _wrapper
2024-05-04T00:49:11.418957566+02:00     return fn(*args, **kwargs)
2024-05-04T00:49:11.418964680+02:00   File "/lm-evaluation-harness/lm_eval/evaluator.py", line 258, in simple_evaluate
2024-05-04T00:49:11.419002892+02:00     results = evaluate(
2024-05-04T00:49:11.419008102+02:00   File "/lm-evaluation-harness/lm_eval/utils.py", line 316, in _wrapper
2024-05-04T00:49:11.419049490+02:00     return fn(*args, **kwargs)
2024-05-04T00:49:11.419054069+02:00   File "/lm-evaluation-harness/lm_eval/evaluator.py", line 592, in evaluate
2024-05-04T00:49:11.419127468+02:00     "n-samples": {
2024-05-04T00:49:11.419131345+02:00   File "/lm-evaluation-harness/lm_eval/evaluator.py", line 595, in <dictcomp>
2024-05-04T00:49:11.419181300+02:00     "effective": min(limit, len(task_output.task.eval_docs)),
2024-05-04T00:49:11.419183995+02:00 TypeError: '<' not supported between instances of 'int' and 'NoneType'
2024-05-04T00:49:11.423606509+02:00 Passed argument batch_size = auto. Detecting largest batch size
2024-05-04T00:49:11.423615215+02:00 Determined Largest batch size: 4
2024-05-04T00:49:12.049730263+02:00 Traceback (most recent call last):
2024-05-04T00:49:12.049745021+02:00   File "/usr/local/bin/accelerate", line 8, in <module>
2024-05-04T00:49:12.049811316+02:00     sys.exit(main())
2024-05-04T00:49:12.049814512+02:00   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
2024-05-04T00:49:12.049956972+02:00     args.func(args)
2024-05-04T00:49:12.049966009+02:00   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1082, in launch_command
2024-05-04T00:49:12.050293249+02:00     simple_launcher(args)
2024-05-04T00:49:12.050303087+02:00   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 688, in simple_launcher
2024-05-04T00:49:12.050515349+02:00     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
2024-05-04T00:49:12.050519176+02:00 subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'lm_eval', '--model', 'hf', '--model_args', 'pretrained=<model name>,dtype=auto,trust_remote_code=False', '--tasks', 'eq_bench', '--num_fewshot', '0', '--batch_size', 'auto', '--output_path', './evals/eq-bench.json']' returned non-zero exit status 1.
2024-05-04T00:49:12.484988425+02:00 Traceback (most recent call last):
2024-05-04T00:49:12.485006620+02:00   File "/lm-evaluation-harness/../llm-autoeval/main.py", line 129, in <module>
2024-05-04T00:49:12.485009545+02:00     raise ValueError(f"The directory {args.directory} does not exist.")
2024-05-04T00:49:12.485011008+02:00 ValueError: The directory ./evals does not exist.

Runpod says 'no longer any instance available'

I was trying to use the notebook and run eval on one Nvidia A40 GPU 100 GB disk size and get the following error

---------------------------------------------------------------------------
QueryError                                Traceback (most recent call last)
[<ipython-input-5-05063fbf8ce4>](https://localhost:8080/#) in <cell line: 41>()
     39 
     40 # Create a pod
---> 41 pod = runpod.create_pod(
     42     name=f"Eval {MODEL.split('/')[-1]} on {BENCHMARK.capitalize()}",
     43     image_name="runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04",

1 frames
[/usr/local/lib/python3.10/dist-packages/runpod/api/graphql.py](https://localhost:8080/#) in run_graphql_query(query)
     28 
     29     if "errors" in response.json():
---> 30         raise error.QueryError(
     31             response.json()["errors"][0]["message"],
     32             query

QueryError: There are no longer any instances available with the requested specifications. Please refresh and try again.

I have funds in my runpod account and a pod with A40 can be created on runpod.io . Any idea how to go about debugging this?

Issues with BF16 models

Hey... Thanks for the great work!

While trying to evaluate a BF16 model, I encountered an error in my runpod container:
"triu_tril_cuda_template" not implemented for 'BFloat16'. (pytorch/pytorch#101932)

Switching the image from runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04
to runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 fixed the issue.

I'm reporting this for others who may have the same problem.
I don't know if it might make sense to update the Colab notebook and use a newer image or it might reveal other problems.

Bug with lighteval on running from notebook

At the end of the run, I am getting:

2024-03-25T13:57:22.568287661Z Traceback (most recent call last):
2024-03-25T13:57:22.568325169Z File "/lm-evaluation-harness/../llm-autoeval/main.py", line 7, in
2024-03-25T13:57:22.568330133Z from lighteval.evaluator import make_results_table
2024-03-25T13:57:22.568333613Z ModuleNotFoundError: No module named 'lighteval'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.