Giter VIP home page Giter VIP logo

deepeval's People

Contributors

agokrani avatar andrea23romano avatar andresprez avatar anindyadeep avatar bderenzi avatar colabdog avatar deeds67 avatar donaldwasserman avatar elafo avatar fabian57fabian avatar j-space-b avatar ji21 avatar kelp710 avatar kritinv avatar krrishdholakia avatar kubre avatar lbux avatar mikkeyboi avatar navkar98 avatar nictuku avatar pedroallenrevez avatar peilun-li avatar penguine-ip avatar philipchung avatar pratyush-exe avatar rohinish404 avatar se-hun avatar shippy avatar vasilije1990 avatar vmesel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepeval's Issues

Add SQL Evaluation

Is your feature request related to a problem? Please describe.
Add a way to evaluate SQL queries based maximizing info gain while minimizing number of rows for synthetic query generation. Minimizing number of rows is important for

Describe the solution you'd like
Warning- this API is a WIP. Very open to suggestions.

from deepeval.sql import SQLEval
table = SQLEval.load_table(...)

Describe alternatives you've considered
Can't really see other alternatives for SQL tables right now.

Additional context
May require a bit of work around building SQL injection and also providing a frontend to make viewing the created table very simple.

Improve LangChain Integration

With the improvements in the package, LangChain guide will need to be updated to demonstrate capabilities.

  • Adding a Langchain callback here

Add unit tests for CLI

As the CLI flow gets more and more ironed out - will need to add tests to ensure the developer onboarding flow doesn't break.

Why are chunks required in FactualConsistencyMetric

Hey guys,

I just started today exploring you great library and was curious to understand the factual consistency metric.

Maybe i didnt got it right, but why do we have to create chunks of our context? It seems like they have no impact at all, since

scores = self.model.predict([(context, output), (output, context)])

is always called with context and output, hence, producing the same scoring results in each loop. The max_score can be found in the first loop iteration allready.

Code: deepeval/metrics/factual_consistency.py:19-32

def measure(self, output: str, context: str):
    context_list = chunk_text(context)
    max_score = 0
    for c in context_list:
        scores = self.model.predict([(context, output), (output, context)])
        print(scores)
        # https://huggingface.co/cross-encoder/nli-deberta-base
        # label_mapping = ["contradiction", "entailment", "neutral"]
        softmax_scores = softmax(scores)
        score = softmax_scores[0][1]
        if score > max_score:
            max_score = score

        second_score = softmax_scores[1][1]
        if second_score > max_score:
            max_score = second_score

Add tone similarity

The use case for this is often businesses/enterprises will want to ensure that a specific sentence will match the tone in which the person said something. This check would be perfect.

Remove warnings

========================================================= warnings summary =========================================================
../../../../../opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204
  /opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204: PytestAssertRewriteWarning: Module already imported so
cannot be rewritten: deepeval
    self._mark_plugins_for_rewrite(hook)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an 
API
    warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to 
`pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See 
https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)
    ```

Pre-configuring Multiple Responses for a Conversation

Hello deepeval maintainers and community,

I am currently working on a project where I am building a chatbot to assist users in buying a product. I want to be able to evaluate the bot's responses in various conversation flows, and I was wondering if the deepeval library supports such a use case.

Here are the three specific flows I'd like to test:

  1. Positive Flow: The user agrees with everything and always responds positively, essentially saying 'yes' to all prompts.
  2. Inquisitive Flow: The user asks 1 to 3 out of 5 possible questions during the interaction.
  3. Human Representative Request: Throughout the conversation, the user consistently asks to speak to a human representative.

Ideally, I would like to set up a mock "user" (which could be another bot) to communicate with the bot we're aiming to send to production. This would simulate these three scenarios and allow us to test our bot's responses.

Questions:

  1. Does the deepeval library support this use case of pre-configuring multiple conversation flows?
  2. If yes, could you provide any pointers or documentation links on how to set it up?
  3. If no, do you have any plans in the future roadmap to support such functionality or do you know of any other tools/libraries that might assist with this?

Thank you for your time and looking forward to your response!

Add LLMEvalMetric

Prompt to use for LLMEvalMetric

We provide a question and the 'ground-truth' answer. We also provide \
the predicted answer.

Evaluate whether the predicted answer is correct, given its similarity \
to the ground-truth. If details provided in predicted answer are reflected \
in the ground-truth answer, return "YES". To return "YES", the details don't \
need to exactly match. Be lenient in evaluation if the predicted answer \
is missing a few details. Try to make sure that there are no blatant mistakes. \
Otherwise, return "NO".

Question: {question}
Ground-truth Answer: {gt_answer}
Predicted Answer: {pred_answer}
Evaluation Result: \

As featured from this guide:

https://gpt-index.readthedocs.io/en/latest/examples/finetuning/knowledge/finetune_knowledge.html

LiteLLM Docs

Hey guys,

I see some LiteLLM docs in this repo - curious, did y'all fork it? Totally cool if so, just wondering why fork vs. using the package?

'Answer Length' metric

aka 'Brevity' metric. Simple character count of answer (this metric might be somewhere else in the code but I'm not seeing it?) Some models could score higher on relevancy but with more words (in theory) - an 'answer length' metric would control for that.

Versioning

Is your feature request related to a problem? Please describe.
If you are an engineer, it would be really important to be able to version and compare results as if it were a git branch

Describe the solution you'd like

git checkout -b feature/add-guidance

# To compare this against the most recent branch in terms of performance (which is saved/cached)
deepeval compare

# To compare the main branch
deepeval compare main 

Ideally it should then output a table to compare results

Support new factual consistency model

Factual consistency can be significantly improved with a larger model - but this model can have issues when running in environments with limited GPU RAM. Have a section in the documentation for improved performances

error in command -> deepeval test run test_sample.py

โ ‹ Downloading models (may take up to 2 minutes if running for the first time)...Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 32, in run
self.live.refresh()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 241, in refresh
with self.console:
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 864, in exit
self._exit_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 822, in _exit_buffer
self._check_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 2038, in _check_buffer
write(text)
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u283c' in position 10: character maps to
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***
FF.s [100%]
============================================================ FAILURES =============================================================
_____________________________________________________________ test_1 ______________________________________________________________

def test_1():
    # Check to make sure it is relevant
    query = "What is the capital of France?"
    output = "The capital of France is Paris."
    metric = RandomMetric()
    # Comment this out for differne metrics/models
    # metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=query, output=output)
  assert_test(test_case, [metric])

test_sample.py:18:


venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)


@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:
      assert (
            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2304173247532807.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2304173247532807.
Max retries (1) exceeded.
_____________________________________________________________ test_2 ______________________________________________________________

def test_2():
    # Check to make sure it is factually consistent
    output = "Cells have many major components, including the cell membrane, nucleus, mitochondria, and endoplasmic reticulum." 
    context = "Biology"
    metric = RandomMetric()
    # Comment this out for factual consistency tests
    # metric = FactualConsistencyMetric(minimum_score=0.8)
    test_case = LLMTestCase(output=output, context=context)
  assert_test(test_case, [metric])

test_sample.py:29:


venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)


@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:
      assert (
            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2175566574653277.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2175566574653277.
Max retries (1) exceeded.
====================================================== slowest 10 durations =======================================================
4.99s call test_sample.py::test_1
2.37s call test_sample.py::test_2
2.32s call test_sample.py::test_3

(7 durations < 0.005s hidden. Use -vv to show these durations.)
===================================================== short test summary info =====================================================
FAILED test_sample.py::test_1 - AssertionError: Random failed. Score: 0.2304173247532807.
FAILED test_sample.py::test_2 - AssertionError: Random failed. Score: 0.2175566574653277.
2 failed, 1 passed, 1 skipped in 10.94s
โœ… Tests finished! View results on https://app.confident-ai.com/

HallucinationMetric

To check for hallucination, we can perform the following:

  • Grab sources from a Google Search/Query
  • Run Factual Consistency On Top

Auto-create evaluation dataset and edge cases

Develop an automated way to automatically create an evaluation dataset with edge cases so that users don't have to write tests. Then make it super easy to run!

New Design Plan:

  • Prompt to generate tests - ensure to include edge cases and RAG performance
  • Function to run evaluation on the generated tests
  • Improve the functionality

Improve Overall Score

For Overall Score, implement the score breakdown to better understand what goes wrong with the score

Add a "not sure" / "unwilling to answer" metric

Measure the amount of times an LLM is "unsure" of something or "unwilling" to answer. This is a growing pain point of LLMs. Not sure if research papers have caught up in this area unfortunately so some sort of Conceptual Similarity across a few such prompts should be the easiest way to do this.

Add CLI for AutoEvals

deepeval auto-generate sample.txt

This would save questions and answers inside of a CSV. it would generate questions for each line of the text file.

Parallelized Tests

Support running parallelized tests with pytest xdist.

Desired developer experience:

# Distribute the number of workers to 4
deepeval test run test_sample.py -num_workers 4

Motivation:
One problem we're running into is it's currently taking too long to run unit tests for LLMs, and it's bad developer experience for engineers to wait for 5-10 minutes to get their test results.

For example, it takes around 30 seconds to generate a LLM output depending on the length of the response, and simply running 10 test (which isn't a lot) takes quite a while.

We're hoping to fix this with this issue.

Litellm integration request

I think it's a good idea to integrate with the Litellm as it provides an easy to integrate with 100+ LLMs.

https://github.com/BerriAI/litellm

LLM as a drop in replacement for GPT. Use Azure, OpenAI, Cohere, Anthropic, Ollama, VLLM, Sagemaker, HuggingFace, Replicate (100+ LLMs)

Add Translation Similarity

Looking to potentially implement COMET's neural translation framework to compare against other AI companies.

Unbabel's COMET
https://github.com/Unbabel/COMET

This can be important to ensure consistent performance when changing models for different queries.

assert_translation_performance(...)

Integration with HuggingFace Datasets

Integrate a EvaluationDataset class to a HuggingFace dataset.

Developer experience should look something like:

EvaluationDataset.from_huggingface_dataset(...)

Cohere Integration

A super valuable integration would be Cohere's Reranker integration. Will need to check if Cohere has anything useful that might be able to evaluate LLMs.

Add ChatGPT Synthetic Data Creation

Add way to create synthetic data with ChatGPT and create an EvaluationDataset class that allows you to process things in bulk.

dataset = create_query_answer_pairs(text="""Your content goes here""",)

Expected output would be the evaluation dataset will be filled with TestCase classes. A TestCase should have a query and expected_output. See deepeval/test_case.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.