explodinggradients / ragas Goto Github PK

View Code? Open in Web Editor NEW

5.0K 27.0 439.0 5.1 MB

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

Home Page: https://docs.ragas.io

License: Apache License 2.0

Python 98.97% Makefile 0.96% Dockerfile 0.07%

llm llmops

ragas's People

Contributors

Stargazers

Watchers

Forkers

jjmachan shahules786 vinayreddy100 techthiyanes hjeffreywang inkerton sanyamlakhanpal tomchapin jaytoday rescuerunaways axiomofjoy sangeethavdirectlink mikeldking colabdog mivanovitch cyberflamego williamtran29 cyrilmagsuci stjordanis haoyitedaniu samee99 chris415 nasame hendrikmax biyanisuraj jfeng3 consultingmd codeaudit coding-alt hieutrluu sixvo-labs onirban afilimonov inweb3 yoon-gu venuraja79 data-drone jakubik2023 ishaan-jaff hbcbh1999 ssahgal jakkaj msetbar yongtae723 starrywheat bonzuz shashi792 subratcall emilesilvis yujonglee cdaprod peytontolbert momuno tleyden ayankayode chrisdixson kyemaloy97 buddies0710 inflaton devtribble devanshubrahmbhatt dcoinhub sherryxding kevinqu-00100 logp amandawinkles tch-at-bain monstertruck jxzhangjhu amalreji111 pitmonticone mkchaitanya03 wangjiaqiys poggiolabs rachittshah tianbingsheng yejiahaoye devpod mmaysami dmkwon josephrp dil-abarbosa brunoscaglione arm-diaz marcklingen tinomaxthayil chankeith96 grauvictor austinmw wuxixixi chinchillaa trislee02 jsham042 alebondarenko whitewum gsittyz jeekim mzm-moazam aarnphm ashely94

ragas's Issues

protobuf version incompatible with tf

i am facing problems during the installation phase. It seems that the protobuf version required does not match dependencies required by tensorflow >= 2.13.0.

What does ground_truth refer to?

DatasetDict({
    baseline: Dataset({
        features: ['question', 'ground_truths', 'answer', 'contexts'],
        num_rows: 30
    })
})

I wanted to know if the ground truth refers to the ground truth passages that need to be retrieved or the the final answer
I was a bit confused as I saw
ground_truths: list[list[str]] in the documntation.
Should it not be ground_truths: list[str] ?

Answer relevancy use custom embedding

At the moment, this metrics use OpenAIEmbedding by default.
Can we allow this to use custom embedding instead ?

open bigquery table for public

Ensure reproducibility of results.

Ensure results are reproducible by using a Seed generator.

chore: improve caching for Github Actions

Query about calculation of metrics

Hey,

Thanks for creating and maintaining this repository.

I assume that you would be using an LLM to get out the scores for each metric. Or are you guys using some bespoke model for each metric like coherence, faithfulness?

If you rely on an LLM, how do you get the score? Do you ask the LLM to spit out a score?

Please let me know!

Thanks!

Improve context relevancy

TODO

Improve prompt
Add guardrails (score max 1)

OpenAI key not found

Hey guys,
I have a problem that occured all of a sudden. I am using the Azure OpenAI client and it worked until recently. Now I am getting the error ´ Did not find openai_api_key, please add an environment variable OPENAI_API_KEY which contains it, or pass openai_api_key as a named parameter.´
I am setting the paramters like so at the beginning of the code:

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_KEY"] = "..................................."
os.environ["OPENAI_API_VERSION"] = "....................................."
os.environ["OPENAI_API_BASE"] = "..........................."

llm = AzureOpenAI(deployment_name = ".................................", openai_api_key=os.environ["OPENAI_API_KEY"])
Passing it as a named parameter to the evaluate method did not solve it by the way.

I have absolutely zero idea where this is coming from. Can anyone hint me in a direction?

Testset Generation: bringing continual learning to RAG pipelines

We started ragas with ground-truthfree evaluations so that you didn't have to put significant upfront effort into building an ideal test set before running evaluations. Creating a test set needs substantial upfront investment in time, money, human hours and expertise to get it right. It is also a continuous process as your product and ML model evolve to cater to diverse use cases. We are exploring the possibilities of synthetic test set generation because

As RAG users get more mature and go into production, having a solid test set and evaluation strategies becomes critical to give users a seamless experience. This means they have to put more time into building solid test sets and evaluation methodologies.
ground-truth free evaluation has its limitations. It is very effective in quantifying aspects like faithfulness but ground-truth free evaluation cannot be used to ensure aspects like answer correctness which is also very important. Here a synthetic test set with ground-truth can be of high utility.

The whole focus of the Ragas library is to help you build more reliable RAG applications which is why with the next leg of Ragas we'll be focusing a lot more on test set generation and continual learning of RAG pipelines. The goal is to leverage custom LLMs and Data-Centric AI techniques to

Build more robust paradigms for test set generation.
1. Many libraries already have some sort of test set generation but they have a few shortcomings. Ideally, the test set should have a good distribution of easy -> hard questions across different tasks/situations as seen in production.
Tools to scale up and reduce the cost of test set generation.
1. Works like self-instruct and evol-instruct have proven that LLMs can generate human-quality synthetic data. We are working on paradigms to generate high-quality synthetic data generation specific to RAG. Ref [1] [2]
Methodologies to continuously add to and improve the test set as your RAG pipelines evolve using other data points like logs and feedback.

there is a lot of work to be done but with the v0.1 release of Ragas, we'll be releasing features in this direction. In the meantime, we would love to hear your opinions, expectations, suggestions and ideas about this too :)

Team Ragas

No evaluation possible with context_relevancy/answer_relevancy/context_recall/harmfullness for Azure OpenAI

Hi guys,

I am using Azure OpenAI and the only evaluation method that is currently working is "faithfullness". All the others fail with the same error Exception: n=3 was passed to generate but the LLM AzureOpenAI Params: {'deployment_name': '.........', 'model_name': 'text-davinci-003', 'temperature': 0.2, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}} does not support it. Raise an issue if you want support for {llm}.

running quickstart.ipynb fails

Hi,

When running the quickstart.ipynb notebook, i got the following errors:

missing tiktoken (fixed by adding !pip install tiktoken)
missing langcahin embedding (fixed by !pip install langchain==0.0.288)

Prevent hallucination in candidate sentence extraction

For measuring context_relevancy, the LLM tries to extract sentences from a given context that is actually useful to answer the question. But in this process, there is a slight chance that the LLM might hallucinate and give out a sentence that is not exactly present in the given context.

It would be good to have some kind of check to ensure that only sentences present in the given context are selected by the LLM.
Another way to look at it is to change the prompt so that the LLM outputs the indices of candidate sentences from a given context. This could bypass the need for an extra check.

Improve metrics docs

Improve documentation of metrics. Try to explain the working of different metrics more deeply.

max sequence length in context relevancy.

Currently, if the sequence length of context is more than max_length (512) it will be truncated before scoring relevancy. Instead, such contexts should be chunked to sequence < 512 tokens before scoring and then averaged.
Changes for this could be made here

Exact versions of libraries

HuggingFace is famous for moving fast and breaking things. Would be great to have the exact dependency versions of all the python libraries. Something like "datasets==2.14.3" instead of datasets.

Inspect Entailment score

Make sure the output distribution from the entailment score is correct.

test

Support vLLM

Exception: n=1 was passed to generate but the LLM VLLM
Params: {} does not support it. Raise an issue if you want support for {llm}.

Following the instructions from the exception.

[BUG] running ragas evaluate tries to post results to local langsmith

I don't have any langchain env vars set but I am getting a network error as seen below.

Possible bug? context_<space>relevancy -> has space in between

Hi team,

I'm new with RAGAS, however I found this :


from ragas.metrics import faithfulness, answer_relevancy, context_relevancy, context_recall
from ragas.langchain import RagasEvaluatorChain

# make eval chains
eval_chains = {
    m.name: RagasEvaluatorChain(metric=m) 
    for m in [faithfulness, answer_relevancy, context_relevancy, context_recall]
}

for name, eval_chain in eval_chains.items():
    score_name = f"{name}_score"
    print(f"{score_name}: {eval_chain(result)[score_name]}")

using template code, i found that the score name for context_ relevancy -> it has space in between. Is it just me?

(dont mind the len=3, i was using different code)

I cannot reproduce the results of the baseline because its context_relevancy is too high.

I cannot reproduce the results of the fiqa baseline in this notebook:
https://github.com/explodinggradients/ragas/blob/main/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb

At the end of this notebook, is shows the score:
{'NLI_score': 0.8655555555555556, 'answer_relevancy': 0.8737666666666667, 'context_ relevancy': 0.8181444444444443, 'ragas_score': 0.8517704492684051}

But when I test, the score I get is:

{
      "context_ relevancy": 0.10368744449698047,
      "faithfulness": 1.0,
      "answer_relevancy": 0.9286177818722253,
      "context_recall": 0.6370370370370371,
      "harmfulness": 0.0,
      "ragas_score": 0.300955397960847,
}

I noticed that my context_relevancy is very low. I know that in the latest pr, the prompt used to test context_relevancy has been modified, but I am not running with this latest version. But I don’t think it’s caused by this, because this jupyter notebook seems to be running with an old version.
In addition to the context_relevancy value, there seem to be some gaps in the other values. So I'm wondering, what could be causing this?

InvalidRequestError when context size to too big

I want to evaluate a single completion of my LLM.

Code:

from ragas import evaluate
from datasets import Dataset
import os

# prepare your huggingface dataset in the format
# Dataset({
#     features: ['question','contexts','answer'],
#     num_rows: 25
# })

data = {
    "question": [query],  # single query, string
    "contexts": [sources],  # single source document in string, I have tried sources[:3000], sources[:3500] to avoid this error
    "answer": [answer]  # single answer
}

# Create the Hugging Face dataset
dataset = Dataset.from_dict(data)

# Set the dataset format
dataset.set_format(
    type="torch", columns=["question", "contexts", "answer"]  # I have tried without type='torch'
)

# Print dataset information
print(dataset)


dataset: Dataset

results = evaluate(dataset)

Complete Output traceback:

Dataset({
    features: ['question', 'contexts', 'answer'],
    num_rows: 1
})
100%|██████████| 1/1 [00:00<00:00,  1.44it/s]
100%|██████████| 52/52 [03:09<00:00,  3.65s/it]
  0%|          | 0/1 [00:02<?, ?it/s]
---------------------------------------------------------------------------
InvalidRequestError                       Traceback (most recent call last)
[<ipython-input-32-8200ac55342c>](https://localhost:8080/#) in <cell line: 38>()
     36 dataset: Dataset
     37 
---> 38 results = evaluate(dataset)

9 frames
[/usr/local/lib/python3.10/dist-packages/ragas/evaluation.py](https://localhost:8080/#) in evaluate(dataset, metrics)
     86     scores = []
     87     for metric in metrics:
---> 88         scores.append(metric.score(dataset).select_columns(metric.name))
     89 
     90     return Result(scores=concatenate_datasets(scores, axis=1), dataset=dataset)

[/usr/local/lib/python3.10/dist-packages/ragas/metrics/factual.py](https://localhost:8080/#) in score(self, dataset)
     71         scores = []
     72         for batch in tqdm(self.get_batches(len(dataset))):
---> 73             score = self._score_batch(dataset.select(batch))
     74             scores.append(score)
     75 

[/usr/local/lib/python3.10/dist-packages/ragas/metrics/factual.py](https://localhost:8080/#) in _score_batch(self, ds)
    101             prompts.append(prompt)
    102 
--> 103         response = openai_completion(prompts)
    104         outputs = response["choices"]  # type: ignore
    105 

[/usr/local/lib/python3.10/dist-packages/backoff/_sync.py](https://localhost:8080/#) in retry(*args, **kwargs)
    103 
    104             try:
--> 105                 ret = target(*args, **kwargs)
    106             except exception as e:
    107                 max_tries_exceeded = (tries == max_tries_value)

[/usr/local/lib/python3.10/dist-packages/ragas/metrics/llms.py](https://localhost:8080/#) in openai_completion(prompts, **kwargs)
     24     - what happens when backoff fails?
     25     """
---> 26     response = openai.Completion.create(
     27         model=kwargs.get("model", "text-davinci-003"),
     28         prompt=prompts,

[/usr/local/lib/python3.10/dist-packages/openai/api_resources/completion.py](https://localhost:8080/#) in create(cls, *args, **kwargs)
     23         while True:
     24             try:
---> 25                 return super().create(*args, **kwargs)
     26             except TryAgain as e:
     27                 if timeout is not None and time.time() > start + timeout:

[/usr/local/lib/python3.10/dist-packages/openai/api_resources/abstract/engine_api_resource.py](https://localhost:8080/#) in create(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)
    151         )
    152 
--> 153         response, _, api_key = requestor.request(
    154             "post",
    155             url,

[/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py](https://localhost:8080/#) in request(self, method, url, params, headers, files, stream, request_id, request_timeout)
    296             request_timeout=request_timeout,
    297         )
--> 298         resp, got_stream = self._interpret_response(result, stream)
    299         return resp, got_stream, self.api_key
    300 

[/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py](https://localhost:8080/#) in _interpret_response(self, result, stream)
    698         else:
    699             return (
--> 700                 self._interpret_response_line(
    701                     result.content.decode("utf-8"),
    702                     result.status_code,

[/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py](https://localhost:8080/#) in _interpret_response_line(self, rbody, rcode, rheaders, stream)
    761         stream_error = stream and "error" in resp.data
    762         if stream_error or not 200 <= rcode < 300:
--> 763             raise self.handle_error_response(
    764                 rbody, rcode, resp.data, rheaders, stream_error=stream_error
    765             )

InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 4331 tokens (3831 in your prompt; 500 for the completion). Please reduce your prompt; or completion length.

Even after changing all the lengths inside my dataset, I get the same error every single time. 3831 tokens in my prompt, 500 for completion.

What's context recall?

Hi guys - I'm currently adding Ragas to DeepEval (confident-ai/deepeval#101)
One thing I found odd though was context_recall was done using prompts. Curious to hear if there's other experiments done in this space to answer the question of if the retriever answered the query. Potentially a mean Cross-Encoder QA would be helpful here?

Curious to hear your thoughts guys!

Quality of metrics questionable with Azure OpenAI

Hi guys,
I am evaluating the quality of the metrics in your framework for my RAG use case. So far I a am afraid to say that most prompts and responses are not delivering the expected result and are therefore not reliable. I am using the Azure OpenAI endpoint with GPT 3.5 so it is very possible that it has everything to do with that. Can anyone else confirm this observation?

Expose LLM

Hi,

Great piece of work here, really well done.

It would be fantastic if we could any supported LLM from langchain to do the evaluation.

I have a number of use cases which require sovereignty, which means essentially either using on prem LLMs or Azure OpenAPI locked to a certain region.

Happy to help wherever I can!

ragas evaluate with llama_index and langchain doesn't seem to work with Azure OpenAI

Azure OpenAI requires the special parameter deployment or deployment_id

The langchain wrappers seem to mostly have been updated to accomodate for this but it doesn't seem to work with ragas.

I ended up getting faithfulness working but updating the generate method with:

from


    elif isinstance(llm, BaseChatModel):
        ps = [p.format_messages() for p in prompts]
        result = llm.generate(ps, callbacks=callbacks)

    elif isinstance(llm, BaseChatModel):
        ps = [p.format_messages() for p in prompts]
        result = llm.generate(ps, callbacks=callbacks, deployment_id='<my_id>', api_version='<my_version>')

but with answer_relevancy I hit the same issue when it tries to run:

│    91 │   def calculate_similarity(                                                              │
│    92 │   │   self: t.Self, question: str, generated_questions: list[str]                        │
│    93 │   ):                                                                                     │
│ ❱  94 │   │   question_vec = np.asarray(self.embedding.embed_query(question)).reshape(1, -1)     │
│    95 │   │   gen_question_vec = np.asarray(                                                     │
│    96 │   │   │   self.embedding.embed_documents(generated_questions)                            │
│    97 │   │   )

Any ideas?

what langchain version should I use? got below issue.

from ragas.metrics.answer_relevance import AnswerRelevancy, answer_relevancy
File "####\Python\Python311\Lib\site-packages\ragas\metrics\answer_relevance.py", line 10, in
from langchain.embeddings.base import Embeddings
ModuleNotFoundError: No module named 'langchain.embeddings.base'

my langchain version 0.0.261

Typo in prompt

ragas/src/ragas/metrics/faithfulnes.py

Line 25 in 5cf4975

answer: alochol

add open analytics to ragas

Can't run evaluate() from llamaindex integration

Hello team 👋

When I try to reproduce your llamaindex notebook, without modifying anything, I've got an error with:

result = evaluate(query_engine, metrics, eval_questions, eval_answers)

It says:

TypeError: evaluate() takes 3 positional arguments but 4 were given

Any idea on how to make it work? Thanks!

Open Analytics

We track very basic usage metrics to guide us to figure out what our users want, what is working and what's not. As a young startup, we have to be brutally honest about this which is why we are tracking these metrics. We are also an Open Startup, which is a product or company which operates in the open and shares its statistics publically.

All the data and the code we track will be open-sourced soon

If you don't want to send tracking info, you can easily disable it by setting RAGAS_DO_NOT_TRACK to True.

Keeping this issue open for feedback from the community and further discussions.

Refs

Discord server link in README.md is expired

Current link in README.md https://github.com/explodinggradients/ragas/blob/main/README.md points to https://discord.gg/5djav8GGNZ, which returns an invalid invite.

Hopefully you can refresh the invite link :). If you don't want people to join, probably should remove the link

Negative faithfulness result

I got the result for faithfulness as -1.5 for one of the rows of my dataset even though its defined scale is 0 - 1. I am currently using ragas version 0.0.9.
This did not happen when I previously ran the version 0.0.7 on the same dataset.

Following is the piece of code I used :

gpt3 = ChatOpenAI()
faithfulness_gpt3 = Faithfulness(
    name="faithfulness_gpt3", llm=gpt3, batch_size=3
)
subset = df.iloc[:3]
hg_dataset_1 = Dataset(pa.Table.from_pandas(subset))
result = evaluate(hg_dataset_1, metrics=[faithfulness_gpt3, context_relevancy, answer_relevancy])```

Issue with Context type

I am trying to run evaluate on outputs generated by GPT4. I have the columns structured in the desired format however am running into the following error

Dataset feature "contexts" should be of type Sequence[string[, got <class 'datasets.features.features.Value'>

Any tips on how to resolve? Thank you!

local variable 'result' referenced before assignment

Issue in src/ragas/metrics/llms.py
The result variable is not initiated outside the if elif else blocks.

Use local LLMs

Hi, Thanks for this amazing framework

I would like to know is there any support for using locally trained LLMs, I see that currently we can change the LLM with Langchain, but I don't want to use langchain and use a local LLM like Llama 2

KeyError while using dataset with extra attributes

Ragas can be only used with dataset that contains fixed attributes. Any other attributes other than required causes key errors. For example, here I used dataset with column names [question,answer,contexts,ungrounded_answer]

from datasets import load_dataset
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)
from ragas import evaluate
wikieval = load_dataset("explodinggradients/WikiEval")
wikieval = wikieval['train'].rename_columns({"grounded_answer":"answer","context_v1":"contexts"})
results = evaluate(dataset=wikieval1,metrics=[faithfulness,answer_relevancy])

KeyError                                  Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 results = evaluate(dataset=wikieval,metrics=[context_relevancy,faithfulness,answer_relevancy])

File ~/belar/src/ragas/evaluation.py:89, in evaluate(dataset, metrics, column_map)
     86     metrics = [answer_relevancy, context_relevancy, faithfulness, context_recall]
     88 # remap column names from the dataset
---> 89 dataset = remap_column_names(dataset, column_map)
     91 # validation
     92 validate_evaluation_modes(dataset, metrics)

File ~/belar/src/ragas/validation.py:14, in remap_column_names(dataset, column_map)
      9 """
     10 Remap the column names in case dataset uses different column names
     11 """
     12 inverse_column_map = {v: k for k, v in column_map.items()}
     13 return dataset.from_dict(
---> 14     {inverse_column_map[name]: dataset[name] for name in dataset.column_names}
     15 )

File ~/belar/src/ragas/validation.py:14, in <dictcomp>(.0)
      9 """
     10 Remap the column names in case dataset uses different column names
     11 """
     12 inverse_column_map = {v: k for k, v in column_map.items()}
     13 return dataset.from_dict(
---> 14     {inverse_column_map[name]: dataset[name] for name in dataset.column_names}
     15 )

KeyError: 'ungrounded_answer'

BaseLLM issue with ContextRelevancy and AnswerRelevancy

Hello! Thanks for the great work :)

I've run into a bug while trying to use a local LLM. I cannot compute either ContextRelevancy or AnswerRelevancy when using a langchain BaseLLM, due to this Exception in line 36 in ragas.metrics.llms:

20      def generate(
21          prompts: list[ChatPromptTemplate],
22          llm: BaseLLM | BaseChatModel,
23          n: t.Optional[int] = None,
24          temperature: float = 0,
25          callbacks: t.Optional[Callbacks] = None,
26      ) -> LLMResult:
27          old_n = None
28          n_swapped = False
29          llm.temperature = temperature
30          if n is not None:
31               if isinstance(llm, OpenAI) or isinstance(llm, ChatOpenAI):
32                    old_n = llm.n
33                    llm.n = n
34                    n_swapped = True
35               else:
36  --->           raise Exception(
37                     f"n={n} was passed to generate but the LLM {llm} does not support it."
38                     " Raise an issue if you want support for {llm}."
39                )

The issue arises because when ContextRelevancy and AnswerRelevancy call this function they pass in n=self.strictness, eg in ragas.metrics.answer_relevance:

75        results = generate(
76            prompts,
77            self.llm,
78            n=self.strictness,
79            temperature=self.temperature,
80            callbacks=batch_group,
81        )

However strictness must be an integer due to both class's post_init:

54    def __post_init__(self: t.Self):
55        self.temperature = 0.2 if self.strictness > 0 else 0

Passing strictness=None would resolve the Exception issue but yields another error due to the post_init integer comparison.

Is there any scope to change the Exception to allow for n=0 or similar (eg change line 30 in ragas.metrics.llms to if n is not None and n>0?

If not, we are currently unable to use these metrics with a non-OpenAI LLM as far as I can tell.

Thank you!

RuntimeError: Expected all tensors to be on the same device

Hello, I'm trying to use ragas for a simple evaluation on my dataset with only 2 columns ("question", "answer").

import datasets
from ragas.metrics import answer_relevancy
import os


os.environ["OPENAI_API_KEY"] = {my OpenAI key}

data=datasets.Dataset.from_dict({"question":["2+2","what is it ragas"], "answer":["4","an evaluation metric"]})

results = evaluate(data, metrics=[answer_relevancy])

From this very simple example I receive an error message

RuntimeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 results = evaluate(data, metrics=[answer_relevancy])

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/evaluation.py:89](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/evaluation.py:89), in evaluate(dataset, metrics)
     87 scores = []
     88 for metric in metrics:
---> 89     scores.append(metric.score(dataset).select_columns(metric.name))
     91 # log the evaluation event
     92 metrics_names = [m.name for m in metrics]

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:164](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:164), in AnswerRelevancy.score(self, dataset)
    160 sentence_ds = dataset.map(
    161     self._make_question_answer_pairs, batched=True, batch_size=10
    162 )
    163 # we loose memory here because we have to make it py_list
--> 164 scores = self.model.predict(sentence_ds["sentences"])
    165 return Dataset.from_dict({f"{self.name}": scores})

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:133](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:133), in QGen.predict(self, sentences, batch_size, show_progress)
    131 inputs, labels = data
    132 with torch.no_grad():
--> 133     logits = self.model(**inputs, output_hidden_states=False).logits
    134     loss = self.get_loss(logits, labels)
    135     predictions.append(loss)

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1683](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1683), in T5ForConditionalGeneration.forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1680 # Encode if needed (training, first prediction pass)
   1681 if encoder_outputs is None:
   1682     # Convert encoder inputs in embeddings if needed
-> 1683     encoder_outputs = self.encoder(
   1684         input_ids=input_ids,
   1685         attention_mask=attention_mask,
   1686         inputs_embeds=inputs_embeds,
   1687         head_mask=head_mask,
   1688         output_attentions=output_attentions,
   1689         output_hidden_states=output_hidden_states,
   1690         return_dict=return_dict,
   1691     )
   1692 elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
   1693     encoder_outputs = BaseModelOutput(
   1694         last_hidden_state=encoder_outputs[0],
   1695         hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
   1696         attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
   1697     )

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:988](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:988), in T5Stack.forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    986     if self.embed_tokens is None:
    987         raise ValueError("You have to initialize the model with valid token embeddings")
--> 988     inputs_embeds = self.embed_tokens(input_ids)
    990 batch_size, seq_length = input_shape
    992 # required mask seq length can be calculated via length of past

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/sparse.py:162](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/sparse.py:162), in Embedding.forward(self, input)
    161 def forward(self, input: Tensor) -> Tensor:
--> 162     return F.embedding(
    163         input, self.weight, self.padding_idx, self.max_norm,
    164         self.norm_type, self.scale_grad_by_freq, self.sparse)

File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/functional.py:2210](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/functional.py:2210), in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2204     # Note [embedding_renorm set_grad_enabled]
   2205     # XXX: equivalent to
   2206     # with torch.no_grad():
   2207     #   torch.embedding_renorm_
   2208     # remove once script supports set_grad_enabled
   2209     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

I've python 3.10.10 with:
torch 2.0.1
ragas 0.0.5
datasets 2.12.0

Thanks a lot

Rename citations.md

Citations.md refers to how others should cite ragas.

ground truth

hello, i read the document about the quickstart. ground_truth just required if you are using context_recall.if I do not need context_recall, i do not parpare the context_recall. i do not add context_recall in my metrics,but something wrong happen "Column ground_truths not in the dataset. Current columns in the dataset: ['question', 'answer', 'contexts']"

My faithfulness and context_recall is incorrect。

I set OPENAI_API_BASE to my own deployed model, and then there were some errors in the data evaluated by Ragas.
First, the same data produces different results each time it's evaluated. Second, there are values outside the range, like -1, in faithfulnessn.
Is there something wrong with my model?how can i locate the bug

Update dependancy to Pydantic to 2.X.X

Typo in results key -> context_ relevancy

results.keys() return

dict_keys(['answer_relevancy', 'context_ relevancy', 'faithfulness', 'ragas_score'])

where there is a whitespace in context_ relevancy

Due to

ragas/src/ragas/metrics/context_relevance.py

Line 110 in 9f54d01

name: str = "context_ relevancy"

how to use RAGAS metrics for GPT3.5 Finetuned model

Hi all,
I understand how RAGAS work for RAG system.
I have use case where I have fine-tuned a GPT3.5 model on my data and using this model for question-answering.
I want to know how if I can use RAGAS to evaluate this fine-tuned model, as it does not have context/chunks to be passed into RAGAS metrics function.
Can anyone help me with "how to use RAGAS metrics for my GPT3.5 Finetuned model"