confident-ai / deepeval Goto Github PK
View Code? Open in Web Editor NEWThe LLM Evaluation Framework
Home Page: https://docs.confident-ai.com/
License: Apache License 2.0
The LLM Evaluation Framework
Home Page: https://docs.confident-ai.com/
License: Apache License 2.0
When bulk reviewing a dataset, we need to add a context
column to make sure we review it properly.
Add a text categorisation approach based on AnyScale's blog article: https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper
It would be useful to have DeepEval ML models powering the validators in GuardRail Ai
https://github.com/ShreyaR/guardrails
For this - I think it would be useful to just have a guide on how to write a Guardrail using DeepEval
This should be a fairly straightforward tutorial.
For the CLI, we want to be able to record the aggregate metrics at the end of a test run
Add an evaluation framework for structured output for LLMs
Is your feature request related to a problem? Please describe.
Add a way to evaluate SQL queries based maximizing info gain while minimizing number of rows for synthetic query generation. Minimizing number of rows is important for
Describe the solution you'd like
Warning- this API is a WIP. Very open to suggestions.
from deepeval.sql import SQLEval
table = SQLEval.load_table(...)
Describe alternatives you've considered
Can't really see other alternatives for SQL tables right now.
Additional context
May require a bit of work around building SQL injection and also providing a frontend to make viewing the created table very simple.
With the improvements in the package, LangChain guide will need to be updated to demonstrate capabilities.
As the CLI flow gets more and more ironed out - will need to add tests to ensure the developer onboarding flow doesn't break.
Hey guys,
I just started today exploring you great library and was curious to understand the factual consistency metric.
Maybe i didnt got it right, but why do we have to create chunks of our context? It seems like they have no impact at all, since
scores = self.model.predict([(context, output), (output, context)])
is always called with context and output, hence, producing the same scoring results in each loop. The max_score
can be found in the first loop iteration allready.
Code: deepeval/metrics/factual_consistency.py:19-32
def measure(self, output: str, context: str):
context_list = chunk_text(context)
max_score = 0
for c in context_list:
scores = self.model.predict([(context, output), (output, context)])
print(scores)
# https://huggingface.co/cross-encoder/nli-deberta-base
# label_mapping = ["contradiction", "entailment", "neutral"]
softmax_scores = softmax(scores)
score = softmax_scores[0][1]
if score > max_score:
max_score = score
second_score = softmax_scores[1][1]
if second_score > max_score:
max_score = second_score
The use case for this is often businesses/enterprises will want to ensure that a specific sentence will match the tone in which the person said something. This check would be perfect.
Improve support for Jupyter notebooks by showing how to log data
Add RAGAS metrics to DeepEval.
Key metrics that would be useful:
========================================================= warnings summary =========================================================
../../../../../opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204
/opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204: PytestAssertRewriteWarning: Module already imported so
cannot be rewritten: deepeval
self._mark_plugins_for_rewrite(hook)
../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121
/opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an
API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870
/opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to
`pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See
https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
```
Will need more than 1 metric score to be recorded for the dashboard to be more useful
Add test case name and filename when logging test cases to the API
Hello deepeval
maintainers and community,
I am currently working on a project where I am building a chatbot to assist users in buying a product. I want to be able to evaluate the bot's responses in various conversation flows, and I was wondering if the deepeval
library supports such a use case.
Here are the three specific flows I'd like to test:
Ideally, I would like to set up a mock "user" (which could be another bot) to communicate with the bot we're aiming to send to production. This would simulate these three scenarios and allow us to test our bot's responses.
Questions:
deepeval
library support this use case of pre-configuring multiple conversation flows?Thank you for your time and looking forward to your response!
Prompt to use for LLMEvalMetric
We provide a question and the 'ground-truth' answer. We also provide \
the predicted answer.
Evaluate whether the predicted answer is correct, given its similarity \
to the ground-truth. If details provided in predicted answer are reflected \
in the ground-truth answer, return "YES". To return "YES", the details don't \
need to exactly match. Be lenient in evaluation if the predicted answer \
is missing a few details. Try to make sure that there are no blatant mistakes. \
Otherwise, return "NO".
Question: {question}
Ground-truth Answer: {gt_answer}
Predicted Answer: {pred_answer}
Evaluation Result: \
As featured from this guide:
https://gpt-index.readthedocs.io/en/latest/examples/finetuning/knowledge/finetune_knowledge.html
As Microsoft Guidance is a guidance language for controlling LLMs, an integration here could be quite useful.
Hey guys,
I see some LiteLLM docs in this repo - curious, did y'all fork it? Totally cool if so, just wondering why fork vs. using the package?
aka 'Brevity' metric. Simple character count of answer (this metric might be somewhere else in the code but I'm not seeing it?) Some models could score higher on relevancy but with more words (in theory) - an 'answer length' metric would control for that.
Looks like you've got an API key in your docs here: https://docs.confident-ai.com/docs/tutorials/evaluating-langchain
embeddings = OpenAIEmbeddings(openai_api_key=....)
Is your feature request related to a problem? Please describe.
If you are an engineer, it would be really important to be able to version and compare results as if it were a git branch
Describe the solution you'd like
git checkout -b feature/add-guidance
# To compare this against the most recent branch in terms of performance (which is saved/cached)
deepeval compare
# To compare the main branch
deepeval compare main
Ideally it should then output a table to compare results
Factual consistency can be significantly improved with a larger model - but this model can have issues when running in environments with limited GPU RAM. Have a section in the documentation for improved performances
โ Downloading models (may take up to 2 minutes if running for the first time)...Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 32, in run
self.live.refresh()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 241, in refresh
with self.console:
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 864, in exit
self._exit_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 822, in _exit_buffer
self._check_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 2038, in _check_buffer
write(text)
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u283c' in position 10: character maps to
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***
FF.s [100%]
============================================================ FAILURES =============================================================
_____________________________________________________________ test_1 ______________________________________________________________
def test_1():
# Check to make sure it is relevant
query = "What is the capital of France?"
output = "The capital of France is Paris."
metric = RandomMetric()
# Comment this out for differne metrics/models
# metric = AnswerRelevancyMetric(minimum_score=0.5)
test_case = LLMTestCase(query=query, output=output)
assert_test(test_case, [metric])
test_sample.py:18:
venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)
@retry(
max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
score = metric.measure(test_case)
success = metric.is_successful()
if isinstance(test_case, LLMTestCase):
log(
success=success,
score=score,
metric=metric,
query=test_case.query if test_case.query else "-",
output=test_case.output if test_case.output else "-",
expected_output=test_case.expected_output
if test_case.expected_output
else "-",
context=test_case.context if test_case.context else "-",
)
test_result = TestResult(
success=success,
score=score,
metric_name=metric.__name__,
query=test_case.query if test_case.query else "-",
output=test_case.output if test_case.output else "-",
expected_output=test_case.expected_output
if test_case.expected_output
else "-",
metadata=None,
context=test_case.context,
)
elif isinstance(test_case, SearchTestCase):
log(
success=success,
score=score,
metric=metric,
query=test_case.query if test_case.query else "-",
output=str(test_case.output_list)
if test_case.output_list
else "-",
expected_output=str(test_case.golden_list)
if test_case.golden_list
else "-",
context="-",
)
test_result = TestResult(
success=success,
score=score,
metric_name=metric.__name__,
query=test_case.query if test_case.query else "-",
output=test_case.output_list
if test_case.output_list
else "-",
expected_output=test_case.golden_list
if test_case.golden_list
else "-",
metadata=None,
context="-",
)
else:
raise ValueError("TestCase not supported yet.")
test_results.append(test_result)
if raise_error:
assert (
metric.is_successful()
), f"{metric.__name__} failed. Score: {score}."
E AssertionError: Random failed. Score: 0.2304173247532807.
venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2304173247532807.
Max retries (1) exceeded.
_____________________________________________________________ test_2 ______________________________________________________________
def test_2():
# Check to make sure it is factually consistent
output = "Cells have many major components, including the cell membrane, nucleus, mitochondria, and endoplasmic reticulum."
context = "Biology"
metric = RandomMetric()
# Comment this out for factual consistency tests
# metric = FactualConsistencyMetric(minimum_score=0.8)
test_case = LLMTestCase(output=output, context=context)
assert_test(test_case, [metric])
test_sample.py:29:
venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)
@retry(
max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
score = metric.measure(test_case)
success = metric.is_successful()
if isinstance(test_case, LLMTestCase):
log(
success=success,
score=score,
metric=metric,
query=test_case.query if test_case.query else "-",
output=test_case.output if test_case.output else "-",
expected_output=test_case.expected_output
if test_case.expected_output
else "-",
context=test_case.context if test_case.context else "-",
)
test_result = TestResult(
success=success,
score=score,
metric_name=metric.__name__,
query=test_case.query if test_case.query else "-",
output=test_case.output if test_case.output else "-",
expected_output=test_case.expected_output
if test_case.expected_output
else "-",
metadata=None,
context=test_case.context,
)
elif isinstance(test_case, SearchTestCase):
log(
success=success,
score=score,
metric=metric,
query=test_case.query if test_case.query else "-",
output=str(test_case.output_list)
if test_case.output_list
else "-",
expected_output=str(test_case.golden_list)
if test_case.golden_list
else "-",
context="-",
)
test_result = TestResult(
success=success,
score=score,
metric_name=metric.__name__,
query=test_case.query if test_case.query else "-",
output=test_case.output_list
if test_case.output_list
else "-",
expected_output=test_case.golden_list
if test_case.golden_list
else "-",
metadata=None,
context="-",
)
else:
raise ValueError("TestCase not supported yet.")
test_results.append(test_result)
if raise_error:
assert (
metric.is_successful()
), f"{metric.__name__} failed. Score: {score}."
E AssertionError: Random failed. Score: 0.2175566574653277.
venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2175566574653277.
Max retries (1) exceeded.
====================================================== slowest 10 durations =======================================================
4.99s call test_sample.py::test_1
2.37s call test_sample.py::test_2
2.32s call test_sample.py::test_3
(7 durations < 0.005s hidden. Use -vv to show these durations.)
===================================================== short test summary info =====================================================
FAILED test_sample.py::test_1 - AssertionError: Random failed. Score: 0.2304173247532807.
FAILED test_sample.py::test_2 - AssertionError: Random failed. Score: 0.2175566574653277.
2 failed, 1 passed, 1 skipped in 10.94s
โ
Tests finished! View results on https://app.confident-ai.com/
To check for hallucination, we can perform the following:
Develop an automated way to automatically create an evaluation dataset with edge cases so that users don't have to write tests. Then make it super easy to run!
New Design Plan:
For Overall Score, implement the score breakdown to better understand what goes wrong with the score
The tutorial has a bug https://docs.confident-ai.com/docs/tutorials/evaluating-langchain
query = "Who is the president?
should be query = "Who is the president?"
Measure the amount of times an LLM is "unsure" of something or "unwilling" to answer. This is a growing pain point of LLMs. Not sure if research papers have caught up in this area unfortunately so some sort of Conceptual Similarity across a few such prompts should be the easiest way to do this.
Currently planning how this looks at the moment (and where we fit in with guardrails).
deepeval auto-generate sample.txt
This would save questions and answers inside of a CSV. it would generate questions for each line of the text file.
Add unstructured integration to help create synthetic data for benchmarking a lot easier from data sources for LLM applications! ๐
Support running parallelized tests with pytest xdist
.
Desired developer experience:
# Distribute the number of workers to 4
deepeval test run test_sample.py -num_workers 4
Motivation:
One problem we're running into is it's currently taking too long to run unit tests for LLMs, and it's bad developer experience for engineers to wait for 5-10 minutes to get their test results.
For example, it takes around 30 seconds to generate a LLM output depending on the length of the response, and simply running 10 test (which isn't a lot) takes quite a while.
We're hoping to fix this with this issue.
I think it's a good idea to integrate with the Litellm as it provides an easy to integrate with 100+ LLMs.
https://github.com/BerriAI/litellm
LLM as a drop in replacement for GPT. Use Azure, OpenAI, Cohere, Anthropic, Ollama, VLLM, Sagemaker, HuggingFace, Replicate (100+ LLMs)
Looking to potentially implement COMET's neural translation framework to compare against other AI companies.
Unbabel's COMET
https://github.com/Unbabel/COMET
This can be important to ensure consistent performance when changing models for different queries.
assert_translation_performance(...)
Integrate a EvaluationDataset
class to a HuggingFace dataset.
Developer experience should look something like:
EvaluationDataset.from_huggingface_dataset(...)
I'd added FallacyChain to LangChain last month (checks for logical fallacies) - link below - thinking I could add similar functionality to natural language model output here - thoughts?
https://python.langchain.com/docs/guides/safety/logical_fallacy_chain
A super valuable integration would be Cohere's Reranker integration. Will need to check if Cohere has anything useful that might be able to evaluate LLMs.
Add way to create synthetic data with ChatGPT and create an EvaluationDataset
class that allows you to process things in bulk.
dataset = create_query_answer_pairs(text="""Your content goes here""",)
Expected output would be the evaluation dataset will be filled with TestCase
classes. A TestCase
should have a query
and expected_output
. See deepeval/test_case.py
Add functionality to calculate average score across multiple runs.
This ensures a fair comparison so you don't have to view the test run multiple times.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.