confident-ai / deepeval Goto Github PK

View Code? Open in Web Editor NEW

2.7K 2.7K 190.0 58.34 MB

The LLM Evaluation Framework

Home Page: https://docs.confident-ai.com/

License: Apache License 2.0

Python 100.00%

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics

deepeval's People

Contributors

Stargazers

Watchers

Forkers

aiworkspace itsharex hbcbh1999 henry-zeng arcrats dpandove muralidharand coinhubx j-space-b rogerspy likhith00 integrate-your-mind partnerise thanhpham1987 emekaokoli19 nitin-mane mukanzi earnest-testlabs donaldwasserman 3a1b2c3 chziakas aayushchou andrea23romano anindyadeep tanduong techthiyanes vasilije1990 sagaust f901107 krrishdholakia agokrani onbncbjocp68898 fernandezbaptiste brunoscaglione razzzzee makoweb3 pomcho555 jaywyawhare kklingeman kbdub ai-app ravinderk1191 rigvedrocks lvsuno pratyush-exe jansystemic rajarampanigrahi tuanzi1015 mdwoicke sundogs8603 liugangtaotie scchess niumeng07 alaaboukhary-aily nikhase christophergs se-hun hipotures lozanocelia jeffometer nicholasburka vinayreddy100 alvaropaco mentordotgit kb231000 nictuku naneded6 dsnnaveen kiselitza r4zz3 mikkeyboi ashely94 bderenzi sachinmyneni lmbarak01 jaytoday moruga123 andresprez rohinish404 wanghaoxue0 andysucao kritinv dibs06 jwretham askngan navkar98 kubre cprhero7744 ai-bassem peilun-li kelp710 ishan-marikar o7s8r6 xinweiwanggithub suryamahadi-gdp hopewangms terminalgravity elafo pedroallenrevez fabian57fabian

deepeval's Issues

Compare Anthropic Claude and ChatGPT in Factual Consistency, Answer Relevancy

Add context to the dataset column

When bulk reviewing a dataset, we need to add a context column to make sure we review it properly.

Add tone similarity

The use case for this is often businesses/enterprises will want to ensure that a specific sentence will match the tone in which the person said something. This check would be perfect.

Add alternative way to create project via CLI

Litellm integration request

I think it's a good idea to integrate with the Litellm as it provides an easy to integrate with 100+ LLMs.

https://github.com/BerriAI/litellm

LLM as a drop in replacement for GPT. Use Azure, OpenAI, Cohere, Anthropic, Ollama, VLLM, Sagemaker, HuggingFace, Replicate (100+ LLMs)

Add Microsoft Guidance Integration

As Microsoft Guidance is a guidance language for controlling LLMs, an integration here could be quite useful.

Add a "not sure" / "unwilling to answer" metric

Measure the amount of times an LLM is "unsure" of something or "unwilling" to answer. This is a growing pain point of LLMs. Not sure if research papers have caught up in this area unfortunately so some sort of Conceptual Similarity across a few such prompts should be the easiest way to do this.

Add unstructured integration for synthetic data

Add unstructured integration to help create synthetic data for benchmarking a lot easier from data sources for LLM applications! 👍

Improve typing (Mypy) support

Integration with HuggingFace Datasets

Integrate a EvaluationDataset class to a HuggingFace dataset.

Developer experience should look something like:

EvaluationDataset.from_huggingface_dataset(...)

Exposed OpenAI API key

Looks like you've got an API key in your docs here: https://docs.confident-ai.com/docs/tutorials/evaluating-langchain

embeddings = OpenAIEmbeddings(openai_api_key=....)

Auto-create evaluation dataset and edge cases

Develop an automated way to automatically create an evaluation dataset with edge cases so that users don't have to write tests. Then make it super easy to run!

New Design Plan:

Prompt to generate tests - ensure to include edge cases and RAG performance
Function to run evaluation on the generated tests
Improve the functionality

Add functionality to calculate average score

Add functionality to calculate average score across multiple runs.

This ensures a fair comparison so you don't have to view the test run multiple times.

LiteLLM Docs

Hey guys,

I see some LiteLLM docs in this repo - curious, did y'all fork it? Totally cool if so, just wondering why fork vs. using the package?

Migrate from `dataset.run_evaluation` to `pytest.mark.parametrize`

Improve Overall Score

For Overall Score, implement the score breakdown to better understand what goes wrong with the score

Improve support for Jupyter notebook usage

Improve support for Jupyter notebooks by showing how to log data

Minimum score and score is the same on dashboard when logging

Support new factual consistency model

Factual consistency can be significantly improved with a larger model - but this model can have issues when running in environments with limited GPU RAM. Have a section in the documentation for improved performances

Add "logical fallacy" check to model output

I'd added FallacyChain to LangChain last month (checks for logical fallacies) - link below - thinking I could add similar functionality to natural language model output here - thoughts?
https://python.langchain.com/docs/guides/safety/logical_fallacy_chain

Add Ruff Linter

Versioning

Is your feature request related to a problem? Please describe.
If you are an engineer, it would be really important to be able to version and compare results as if it were a git branch

Describe the solution you'd like

git checkout -b feature/add-guidance

# To compare this against the most recent branch in terms of performance (which is saved/cached)
deepeval compare

# To compare the main branch
deepeval compare main

Ideally it should then output a table to compare results

Issues with running the tutorial https://docs.confident-ai.com/docs/tutorials/evaluating-langchain

The tutorial has a bug https://docs.confident-ai.com/docs/tutorials/evaluating-langchain

query = "Who is the president? should be query = "Who is the president?"

Only 1 metric score is recorded

Will need more than 1 metric score to be recorded for the dashboard to be more useful

Add Microsoft Guidance Integration

Currently planning how this looks at the moment (and where we fit in with guardrails).

Add ChatGPT Synthetic Data Creation

Add way to create synthetic data with ChatGPT and create an EvaluationDataset class that allows you to process things in bulk.

dataset = create_query_answer_pairs(text="""Your content goes here""",)

Expected output would be the evaluation dataset will be filled with TestCase classes. A TestCase should have a query and expected_output. See deepeval/test_case.py

Add GuardRails Tutorial

It would be useful to have DeepEval ML models powering the validators in GuardRail Ai

https://github.com/ShreyaR/guardrails

For this - I think it would be useful to just have a guide on how to write a Guardrail using DeepEval

This should be a fairly straightforward tutorial.

'Answer Length' metric

aka 'Brevity' metric. Simple character count of answer (this metric might be somewhere else in the code but I'm not seeing it?) Some models could score higher on relevancy but with more words (in theory) - an 'answer length' metric would control for that.

Add unit tests for CLI

As the CLI flow gets more and more ironed out - will need to add tests to ensure the developer onboarding flow doesn't break.

Parallelized Tests

Support running parallelized tests with pytest xdist.

Desired developer experience:

# Distribute the number of workers to 4
deepeval test run test_sample.py -num_workers 4

Motivation:
One problem we're running into is it's currently taking too long to run unit tests for LLMs, and it's bad developer experience for engineers to wait for 5-10 minutes to get their test results.

For example, it takes around 30 seconds to generate a LLM output depending on the length of the response, and simply running 10 test (which isn't a lot) takes quite a while.

We're hoping to fix this with this issue.

error in command -> deepeval test run test_sample.py

⠋ Downloading models (may take up to 2 minutes if running for the first time)...Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 32, in run
self.live.refresh()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 241, in refresh
with self.console:
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 864, in exit
self._exit_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 822, in _exit_buffer
self._check_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 2038, in _check_buffer
write(text)
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u283c' in position 10: character maps to
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***
FF.s [100%]
============================================================ FAILURES =============================================================
_____________________________________________________________ test_1 ______________________________________________________________

def test_1():
    # Check to make sure it is relevant
    query = "What is the capital of France?"
    output = "The capital of France is Paris."
    metric = RandomMetric()
    # Comment this out for differne metrics/models
    # metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=query, output=output)

  assert_test(test_case, [metric])

test_sample.py:18:

venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)

@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:

      assert (

            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2304173247532807.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2304173247532807.
Max retries (1) exceeded.
_____________________________________________________________ test_2 ______________________________________________________________

def test_2():
    # Check to make sure it is factually consistent
    output = "Cells have many major components, including the cell membrane, nucleus, mitochondria, and endoplasmic reticulum." 
    context = "Biology"
    metric = RandomMetric()
    # Comment this out for factual consistency tests
    # metric = FactualConsistencyMetric(minimum_score=0.8)
    test_case = LLMTestCase(output=output, context=context)

  assert_test(test_case, [metric])

test_sample.py:29:

@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:

      assert (

            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2175566574653277.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2175566574653277.
Max retries (1) exceeded.
====================================================== slowest 10 durations =======================================================
4.99s call test_sample.py::test_1
2.37s call test_sample.py::test_2
2.32s call test_sample.py::test_3

(7 durations < 0.005s hidden. Use -vv to show these durations.)
===================================================== short test summary info =====================================================
FAILED test_sample.py::test_1 - AssertionError: Random failed. Score: 0.2304173247532807.
FAILED test_sample.py::test_2 - AssertionError: Random failed. Score: 0.2175566574653277.
2 failed, 1 passed, 1 skipped in 10.94s
✅ Tests finished! View results on https://app.confident-ai.com/

Record the aggregate metrics in the CLI

For the CLI, we want to be able to record the aggregate metrics at the end of a test run

Add Translation Similarity

Looking to potentially implement COMET's neural translation framework to compare against other AI companies.

Unbabel's COMET
https://github.com/Unbabel/COMET

This can be important to ensure consistent performance when changing models for different queries.

assert_translation_performance(...)

Cohere Integration

A super valuable integration would be Cohere's Reranker integration. Will need to check if Cohere has anything useful that might be able to evaluate LLMs.

Be able to submit user config

HallucinationMetric

To check for hallucination, we can perform the following:

Grab sources from a Google Search/Query
Run Factual Consistency On Top

Why are chunks required in FactualConsistencyMetric

Hey guys,

I just started today exploring you great library and was curious to understand the factual consistency metric.

Maybe i didnt got it right, but why do we have to create chunks of our context? It seems like they have no impact at all, since

scores = self.model.predict([(context, output), (output, context)])

is always called with context and output, hence, producing the same scoring results in each loop. The max_score can be found in the first loop iteration allready.

Code: deepeval/metrics/factual_consistency.py:19-32

def measure(self, output: str, context: str):
    context_list = chunk_text(context)
    max_score = 0
    for c in context_list:
        scores = self.model.predict([(context, output), (output, context)])
        print(scores)
        # https://huggingface.co/cross-encoder/nli-deberta-base
        # label_mapping = ["contradiction", "entailment", "neutral"]
        softmax_scores = softmax(scores)
        score = softmax_scores[0][1]
        if score > max_score:
            max_score = score

        second_score = softmax_scores[1][1]
        if second_score > max_score:
            max_score = second_score

Improve LangChain Integration

With the improvements in the package, LangChain guide will need to be updated to demonstrate capabilities.

Adding a Langchain callback here

Add SQL Evaluation

Is your feature request related to a problem? Please describe.
Add a way to evaluate SQL queries based maximizing info gain while minimizing number of rows for synthetic query generation. Minimizing number of rows is important for

Describe the solution you'd like
Warning- this API is a WIP. Very open to suggestions.

from deepeval.sql import SQLEval
table = SQLEval.load_table(...)

Describe alternatives you've considered
Can't really see other alternatives for SQL tables right now.

Additional context
May require a bit of work around building SQL injection and also providing a frontend to make viewing the created table very simple.

Error logging conceptual similarity

Add CLI for AutoEvals

deepeval auto-generate sample.txt

This would save questions and answers inside of a CSV. it would generate questions for each line of the text file.

Add test case name and filename

Add test case name and filename when logging test cases to the API

Remove warnings

========================================================= warnings summary =========================================================
../../../../../opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204
  /opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204: PytestAssertRewriteWarning: Module already imported so
cannot be rewritten: deepeval
    self._mark_plugins_for_rewrite(hook)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an 
API
    warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to 
`pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See 
https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)
    ```

Pre-configuring Multiple Responses for a Conversation

Hello deepeval maintainers and community,

I am currently working on a project where I am building a chatbot to assist users in buying a product. I want to be able to evaluate the bot's responses in various conversation flows, and I was wondering if the deepeval library supports such a use case.

Here are the three specific flows I'd like to test:

Positive Flow: The user agrees with everything and always responds positively, essentially saying 'yes' to all prompts.
Inquisitive Flow: The user asks 1 to 3 out of 5 possible questions during the interaction.
Human Representative Request: Throughout the conversation, the user consistently asks to speak to a human representative.

Ideally, I would like to set up a mock "user" (which could be another bot) to communicate with the bot we're aiming to send to production. This would simulate these three scenarios and allow us to test our bot's responses.

Questions:

Does the deepeval library support this use case of pre-configuring multiple conversation flows?
If yes, could you provide any pointers or documentation links on how to set it up?
If no, do you have any plans in the future roadmap to support such functionality or do you know of any other tools/libraries that might assist with this?

Thank you for your time and looking forward to your response!

Enable running an individual test with deepeval CLI

Add RAGAS metrics to DeepEval

Add RAGAS metrics to DeepEval.
Key metrics that would be useful:

Ragas Score (Harmonic mean)
Context Recall (for the retriever)

https://github.com/explodinggradients/ragas

Add LLMEvalMetric

Prompt to use for LLMEvalMetric

We provide a question and the 'ground-truth' answer. We also provide \
the predicted answer.

Evaluate whether the predicted answer is correct, given its similarity \
to the ground-truth. If details provided in predicted answer are reflected \
in the ground-truth answer, return "YES". To return "YES", the details don't \
need to exactly match. Be lenient in evaluation if the predicted answer \
is missing a few details. Try to make sure that there are no blatant mistakes. \
Otherwise, return "NO".

Question: {question}
Ground-truth Answer: {gt_answer}
Predicted Answer: {pred_answer}
Evaluation Result: \

As featured from this guide:

https://gpt-index.readthedocs.io/en/latest/examples/finetuning/knowledge/finetune_knowledge.html