Giter VIP home page Giter VIP logo

parthsarthi03 / raptor Goto Github PK

View Code? Open in Web Editor NEW
472.0 9.0 53.0 875 KB

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Home Page: https://arxiv.org/abs/2401.18059

License: MIT License

Jupyter Notebook 12.03% Python 87.97%
rag retrieval retrieval-augmented-generation clustering language-model machine-learning vector-database agents framework llm

raptor's Introduction

Shows an illustrated sun in light color mode and a moon with stars in dark color mode.

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAPTOR introduces a novel approach to retrieval-augmented language models by constructing a recursive tree structure from documents. This allows for more efficient and context-aware information retrieval across large texts, addressing common limitations in traditional language models.

For detailed methodologies and implementations, refer to the original paper:

Paper page PWC

Installation

Before using RAPTOR, ensure Python 3.8+ is installed. Clone the RAPTOR repository and install necessary dependencies:

git clone https://github.com/parthsarthi03/raptor.git
cd raptor
pip install -r requirements.txt

Basic Usage

To get started with RAPTOR, follow these steps:

Setting Up RAPTOR

First, set your OpenAI API key and initialize the RAPTOR configuration:

import os
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

from raptor import RetrievalAugmentation

# Initialize with default configuration. For advanced configurations, check the documentation. [WIP]
RA = RetrievalAugmentation()

Adding Documents to the Tree

Add your text documents to RAPTOR for indexing:

with open('sample.txt', 'r') as file:
    text = file.read()
RA.add_documents(text)

Answering Questions

You can now use RAPTOR to answer questions based on the indexed documents:

question = "How did Cinderella reach her happy ending?"
answer = RA.answer_question(question=question)
print("Answer: ", answer)

Saving and Loading the Tree

Save the constructed tree to a specified path:

SAVE_PATH = "demo/cinderella"
RA.save(SAVE_PATH)

Load the saved tree back into RAPTOR:

RA = RetrievalAugmentation(tree=SAVE_PATH)
answer = RA.answer_question(question=question)

Extending RAPTOR with other Models

RAPTOR is designed to be flexible and allows you to integrate any models for summarization, question-answering (QA), and embedding generation. Here is how to extend RAPTOR with your own models:

Custom Summarization Model

If you wish to use a different language model for summarization, you can do so by extending the BaseSummarizationModel class. Implement the summarize method to integrate your custom summarization logic:

from raptor import BaseSummarizationModel

class CustomSummarizationModel(BaseSummarizationModel):
    def __init__(self):
        # Initialize your model here
        pass

    def summarize(self, context, max_tokens=150):
        # Implement your summarization logic here
        # Return the summary as a string
        summary = "Your summary here"
        return summary

Custom QA Model

For custom QA models, extend the BaseQAModel class and implement the answer_question method. This method should return the best answer found by your model given a context and a question:

from raptor import BaseQAModel

class CustomQAModel(BaseQAModel):
    def __init__(self):
        # Initialize your model here
        pass

    def answer_question(self, context, question):
        # Implement your QA logic here
        # Return the answer as a string
        answer = "Your answer here"
        return answer

Custom Embedding Model

To use a different embedding model, extend the BaseEmbeddingModel class. Implement the create_embedding method, which should return a vector representation of the input text:

from raptor import BaseEmbeddingModel

class CustomEmbeddingModel(BaseEmbeddingModel):
    def __init__(self):
        # Initialize your model here
        pass

    def create_embedding(self, text):
        # Implement your embedding logic here
        # Return the embedding as a numpy array or a list of floats
        embedding = [0.0] * embedding_dim  # Replace with actual embedding logic
        return embedding

Integrating Custom Models with RAPTOR

After implementing your custom models, integrate them with RAPTOR as follows:

from raptor import RetrievalAugmentation, RetrievalAugmentationConfig

# Initialize your custom models
custom_summarizer = CustomSummarizationModel()
custom_qa = CustomQAModel()
custom_embedding = CustomEmbeddingModel()

# Create a config with your custom models
custom_config = RetrievalAugmentationConfig(
    summarization_model=custom_summarizer,
    qa_model=custom_qa,
    embedding_model=custom_embedding
)

# Initialize RAPTOR with your custom config
RA = RetrievalAugmentation(config=custom_config)

Check out demo.ipynb for examples on how to specify your own summarization/QA models, such as Llama/Mistral/Gemma, and Embedding Models such as SBERT, for use with RAPTOR.

Note: More examples and ways to configure RAPTOR are forthcoming. Advanced usage and additional features will be provided in the documentation and repository updates.

Contributing

RAPTOR is an open-source project, and contributions are welcome. Whether you're fixing bugs, adding new features, or improving documentation, your help is appreciated.

License

RAPTOR is released under the MIT License. See the LICENSE file in the repository for full details.

Citation

If RAPTOR assists in your research, please cite it as follows:

@inproceedings{sarthi2024raptor,
    title={RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval},
    author={Sarthi, Parth and Abdullah, Salman and Tuli, Aditi and Khanna, Shubh and Goldie, Anna and Manning, Christopher D.},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2024}
}

Stay tuned for more examples, configuration guides, and updates.

raptor's People

Contributors

extremlapin avatar llleoli avatar parthsarthi03 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raptor's Issues

Missing implied functionality `add to existing`

# self.add_to_existing(docs)
.

Problem.
If I want to iteratively add documents to a raptor index. Whenever I call add_documents I am met with an un-avoidable message:

Warning: Overwriting existing tree. Did you mean to call 'add_to_existing' instead? (y/n): 

For the 200 or so documents that I wish to index, typing y each time is not feasible. Especially if I attempt to scale to thousands or tens of thousands in the future.

Barring the inconvenience of there being no way to override to this, upon inspecting the code, I've realized that selecting yes or y just terminates the function without doing anything. In the code block that's responsible for handling the yes there is this line that's commented out, followed by a return statement:

#self.add_to_existing(docs)
return 

Wishing to still iteratively add to the index, I've searched the codebase coming to the realization that the add_to_existing method does not actually exist anywhere in the codebase.


Speculation:.

Perhaps the intended approach to add_to_existing is too difficult. The approach may have initially been to build one large tree for all documents and then rebalance already constructed trees with the addition of new documents?

If avoiding solving that monolith tree-rebalancing problem is why add_to_exisitng does not exist, I speculate that if raptor instead maintained an index of several smaller disjointed trees, it would be more appropriate of an index as that allows for scaling. The entry point to a tree not being the top or root of a single tree, but any node that is semantically close to the question from an embedding space encompassing multiple trees (every file that's indexed gets a tree).

The algorithm would pull in context in both directions (up the hierarchy and down the hierarchy).

Constructing Layer Issue

Hello, I have different length of samples, the num_layers range from 17 to 51, but all of them "Stopping Layer construction: Cannot Create More Layers. Total Layers in tree: 2", it seems I can't build a tree more than 2 layers.

multi doc

小白求问,这个支持add多个document,然后进行rag吗?非常感谢非常感谢

Custom Model Guidance

Hi, thanks for sharing the code publicly. I have a few questions about using a custom model for RAPTOR:

  1. Do I create a new file for running the lines under "Setting Up RAPTOR" and "Adding Documents to the Tree" or is there a specific location to add these lines code?
  2. How does setting up RAPTOR and adding documents differ with using a custom model? Do I just follow what's on the README page but ignore importing os and setting up my openai key?
  3. Would adding documents to the tree be the same regardless of the type of model (custom vs. baseline) I use?
  4. I assume the RetrievalAugmentation function is unique to RAPTOR, correct? In other words, using a different model still preserves the methods of RAPTOR, right?

Thanks in advance! Was looking forward to this work :)

ValueError: Input contains NaN.

I encountered this error when I was adding text. Hope to get a solution to deal with this error.Thank you very much.
Traceback (most recent call last):
File "/home/jyc23/raptor-master/demo/newdemo.py", line 123, in
RA.add_documents(text)
File "/home/jyc23/raptor-master/raptor/RetrievalAugmentation.py", line 217, in add_documents
self.tree = self.tree_builder.build_from_text(text=docs)
File "/home/jyc23/raptor-master/raptor/tree_builder.py", line 280, in build_from_text
root_nodes = self.construct_tree(all_nodes, all_nodes, layer_to_nodes)
File "/home/jyc23/raptor-master/raptor/cluster_tree_builder.py", line 102, in construct_tree
clusters = self.clustering_algorithm.perform_clustering(
File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 194, in perform_clustering
clusters = perform_clustering(
File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 120, in perform_clustering
reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 32, in global_cluster_embeddings
reduced_embeddings = umap.UMAP(
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/umap/umap_.py", line 2887, in fit_transform
self.fit(X, y, force_all_finite)
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/umap/umap_.py", line 2354, in fit
X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C", force_all_finite=force_all_finite)
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 957, in check_array
_assert_all_finite(
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 122, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 171, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains NaN.

Is it not possible to use in AZURE?

I tried with the following configuration, but it doesn't work.

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_BASE"] = "https://proxy url"
os.environ["OPENAI_API_KEY"] = "key"
os.environ["OPENAI_API_VERSION"] = "version"

Traceback (most recent call last):
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_transports\default.py", line 69, in map_httpcore_exceptions
yield
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_transports\default.py", line 233, in handle_request
resp = self._pool.handle_request(req)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpcore_sync\connection_pool.py", line 216, in handle_request
raise exc from None
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpcore_sync\connection_pool.py", line 196, in handle_request
response = connection.handle_request(
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpcore_sync\connection.py", line 99, in handle_request
raise exc
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpcore_sync\connection.py", line 76, in handle_request
stream = self._connect(request)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpcore_sync\connection.py", line 154, in _connect
stream = stream.start_tls(**kwargs)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpcore_backends\sync.py", line 168, in start_tls
raise exc
File "C:\Users\10021782\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 135, in exit
self.gen.throw(type, value, traceback)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpcore_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 858, in _request
response = self._client.send(request, auth=self.custom_auth, stream=stream)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_client.py", line 914, in send
response = self._send_handling_auth(
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_client.py", line 942, in _send_handling_auth
response = self._send_handling_redirects(
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_client.py", line 979, in _send_handling_redirects
response = self._send_single_request(request)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_client.py", line 1015, in _send_single_request
response = transport.handle_request(request)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_transports\default.py", line 233, in handle_request
resp = self._pool.handle_request(req)
File "C:\Users\10021782\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 135, in exit
self.gen.throw(type, value, traceback)
File "c:\GPT\raptor\raptor-39\lib\site-packages\httpx_transports\default.py", line 86, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "c:\GPT\raptor\raptor-39\lib\site-packages\tenacity_init_.py", line 382, in call
result = fn(*args, **kwargs)
File "c:\GPT\raptor\raptor-39\raptor\EmbeddingModels.py", line 26, in create_embedding
self.client.embeddings.create(input=[text], model=self.model)
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai\resources\embeddings.py", line 105, in create
return self._post(
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 1055, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 834, in request
return self._request(
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 890, in _request
return self._retry_request(
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 925, in _retry_request
return self._request(
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 890, in _request
return self._retry_request(
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 925, in _retry_request
return self._request(
File "c:\GPT\raptor\raptor-39\lib\site-packages\openai_base_client.py", line 897, in _request
raise APIConnectionError(request=request) from err
openai.APIConnectionError: Connection error.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "c:\GPT\raptor\raptor-39\rag.py", line 28, in
RA.add_documents(text)
File "c:\GPT\raptor\raptor-39\raptor\RetrievalAugmentation.py", line 220, in add_documents
self.tree = self.tree_builder.build_from_text(text=docs, use_multithreading=False)
File "c:\GPT\raptor\raptor-39\raptor\tree_builder.py", line 280, in build_from_text
, node = self.create_node(index, text)
File "c:\GPT\raptor\raptor-39\raptor\tree_builder.py", line 175, in create_node
embeddings = {
File "c:\GPT\raptor\raptor-39\raptor\tree_builder.py", line 176, in
model_name: model.create_embedding(text)
File "c:\GPT\raptor\raptor-39\lib\site-packages\tenacity_init
.py", line 289, in wrapped_f
return self(f, *args, **kw)
File "c:\GPT\raptor\raptor-39\lib\site-packages\tenacity_init
.py", line 379, in call
do = self.iter(retry_state=retry_state)
File "c:\GPT\raptor\raptor-39\lib\site-packages\tenacity_init_.py", line 326, in iter
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x1c7ff8743d0 state=finished raised APIConnectionError>]

`RA.retrieve`: AttributeError: 'NoneType' object has no attribute 'encode'"

I asked this:

question = "What is the topic described in Article 202 ?"

answer = RA.retrieve(question, collapse_tree=True)

print("Answer: ", answer)

and got this:

{
	"name": "AttributeError",
	"message": "'NoneType' object has no attribute 'encode'",
	"stack": "---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 4
      1 question = \"What is the topic described in Article 202?\"
      3 # answer = RA.answer_question(question=question)
----> 4 answer = RA.retrieve(question, collapse_tree=True)
      6 print(\"Answer: \", answer)

File /workspaces/aider_repos/raptor/RetrievalAugmentation.py:250, in RetrievalAugmentation.retrieve(self, question, start_layer, num_layers, max_tokens, collapse_tree, return_layer_information)
    245 if self.retriever is None:
    246     raise ValueError(
    247         \"The TreeRetriever instance has not been initialized. Call 'add_documents' first.\"
    248     )
--> 250 return self.retriever.retrieve(
    251     question,
    252     start_layer,
    253     num_layers,
    254     max_tokens,
    255     collapse_tree,
    256     return_layer_information,
    257 )

File /workspaces/aider_repos/raptor/tree_retriever.py:293, in TreeRetriever.retrieve(self, query, start_layer, num_layers, max_tokens, collapse_tree, return_layer_information)
    291 if collapse_tree:
    292     logging.info(f\"Using collapsed_tree\")
--> 293     selected_nodes, context = self.retrieve_information_collapse_tree(
    294         query, max_tokens
    295     )
    296 else:
    297     layer_nodes = self.tree.layer_to_nodes[start_layer]

File /workspaces/aider_repos/raptor/tree_retriever.py:176, in TreeRetriever.retrieve_information_collapse_tree(self, query, max_tokens)
    174 for idx in indices:
    175     node = node_list[idx]
--> 176     node_tokens = len(self.tokenizer.encode(node.text))
    178     if total_tokens + node_tokens > max_tokens:
    179         break

AttributeError: 'NoneType' object has no attribute 'encode'"
}

Adding new Document to the existing RAPTOR setup

The RAPTOR looks interesting but I see a big limitation in case one wants to incrementally add information to a vectorstore (quite common in a production scenarios). Raptor only works by looking globally at the entire pool of documents, as summaries are iteratively computed on clusters. This produces a sort of "immutable" vectorstore. In other words, if a user wants to simply add a document to an existing vectorstore, the full Raptor pipeline would have to run again to take into account the new information in existing summaries, which may become quite expensive with many documents (both in terms of cost and latency of the operation). Maybe one could simply replace the most similar summary at each level? I'd love to hear how people will address this.

Return Citation

Hi team, really awesome library. Is there a way to return the source text to be used in the retrieval as source citation? Similar to llama_index's CitationQueryEngine.

Question about multiple texts

Hey @parthsarthi03 , how many pages of text should I be going up to for each tree? Also, if I wanted to ingest say a 100 ebooks, do you recommend doing them all in the same tree? Or maybe put another layer on top where the llm chooses from tree level summaries?

num_layers acts like max_num_layers

I am trying to build a tree setting num_layers=3. When I access to num_layers with RA.tree.num_layers, the value 1 is given, so the three has only one layer.

It seems this parameter is set as a max_num_layers, so if some condition is met, this num_layers is not reached by the tree.

The code used is the following:

from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_random_exponential

class OpenAIEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model='text-embedding-3-small'):
        self.client = OpenAI()
        self.model = model

    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
    def create_embedding(self, text):
        text = text.replace("\n", " ")
        return (
            self.client.embeddings.create(input=[text], model=self.model)
            .data[0]
            .embedding
        )
    
RAC = RetrievalAugmentationConfig(tb_num_layers=3, 
                                  tb_selection_mode='threshold', 
                                  tb_threshold=0.3,
                                  tb_summarization_length=250,
                                  embedding_model=OpenAIEmbeddingModel())

Inquiring about Vector DB implementation

Hi, thanks for the code.
I want to understand why any vector database is not implemented for storing the embeddings for fast retrieval as we do in conventional RAG.
I read in the paper that 'yes' the collapsed tree approach calculates the cosine similarity against all nodes, better approach is to use some fast k-nearest neighbor libraries such as FAISS. So my question is:
1- What were the considerations behind not integrating a vector database? Was there any benefit?
2- When recommending the adoption of k-nearest neighbor libraries, is the intention solely to substitute the existing cosine similarity search methodology? So that you don't need to run the search over all the nodes?
3- And how can I integrate this recommended library for retrieval with my answer_question method?

Your insights on these matters would be greatly appreciated.

Thanks!

Sorry! We've encountered an issue with repetitive patterns in your prompt. Please try again with a different prompt

I wanted to try RAPTOR on some .txt files that I have


Context to summarize:
 .

. // VERY BIG NUMBER OF LINES WITH THESE DOTS AND THAT'S IT (I reduced the number here)

.

.

.


2024-03-08 18:43:52,574 - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 400 Bad Request"
Error code: 400 - {'error': {'message': "Sorry! We've encountered an issue with repetitive patterns in your prompt. Please try again with a different prompt.", 'type': 'invalid_request_error', 'param': 'prompt', 'code': 'invalid_prompt'}}

Is RAPTOR creating a cluster of punctuation (dots) in this case? Because my docs don't have dots like this..

UMAP n_neighbors must be greater than 1

Hi team, currently I am building with raptor to achieve the open-domain QA as following:
we have data stored as question-answer pair, and when user have a input query, I try to match the query with top-k most related questions asked in my data and concatenate their answer, and then use raptor to try to get a answer for the input query, but when the length of docs in RA.add_documents(docs) gets longer, it gives me "n_neighbors must be greater than 1" error for UMAP part at fit transform in this code chunk:
def global_cluster_embeddings(
embeddings: np.ndarray,
dim: int,
n_neighbors: Optional[int] = None,
metric: str = "cosine",
) -> np.ndarray:
if n_neighbors is None:
n_neighbors = int((len(embeddings) - 1) ** 0.5)
reduced_embeddings = umap.UMAP(
n_neighbors=n_neighbors, n_components=dim, metric=metric
).fit_transform(embeddings)
return reduced_embeddings
Is there any way to resolve UMAP issue in this case?

How put my own inference logic

Thanks for your work. I have chat logic with my own code. For example I must use Azure Open AI. How I could incluye My own inference logic in some class using the raptor methodogy?

NarrativeQA metrics

Can you share your modified NarrativeQA metrics calculation script? Can't find it in the released code

Want to use RAPTOR for legal research. How to add legislation citation?

First of all, thanks for publishing the paper and the python codes. Both are easy to follow. I am trying to use RAPTOR to build a backend for legal research. I inputted the legislations with section numbers. But then in the summary steps, the sections number information is lost. Should I amend the ChatGPT prompt to keep the section number information? What are your recommendations on adapting RAPTOR for legal research that requires citation to section numbers and legislation names?

Clarification about the experiment setting in the paper

Hi, thanks for the work and I found the idea really interesting!

Since I wanted to make a reproduction of the experiments, I have a small question. To my knowledge, your method is to build the tree first and query the tree later. However, in the QuALITY dataset, there is one passage of around 5000 tokens for each question. In this case, do you build one tree for each question using the passage corresponding to that specific question?
Also, for the other datasets (NarrativeQA and QASPER), do you build a shared tree for all questions?

thanks again for your kind sharing!

RAG seems lazy and don't understand the sample

Hello, I used the demo notebook you provided with some questions. Most of the time the model cannot answer the questions that we can find the answers from the sample's text. Could you provide some guidance to overcome this issue?

Example:

Question: "How did her life became hard?" 
Answer:  Sure, here's the answer to your question: The passage does not provide any information about how her life became hard, so I cannot answer this question from the context.

Maybe something wrong with the context? The code is exactly the same as your demo, except the few code added to install dependencies and correct typos.

RAPTOR_Clustering() takes no arguments

I encountered an error when I imported a relatively long txt file. I did not encounter this error when using the txt in the demo. Could you please tell me how to deal with this problem. Thank you very much.

Traceback (most recent call last):
File "/home/jyc23/raptor-master/demo/demo.py", line 132, in
RA.add_documents(text)
File "/home/jyc23/raptor-master/raptor/RetrievalAugmentation.py", line 217, in add_documents
self.tree = self.tree_builder.build_from_text(text=docs)
File "/home/jyc23/raptor-master/raptor/tree_builder.py", line 280, in build_from_text
root_nodes = self.construct_tree(all_nodes, all_nodes, layer_to_nodes)
File "/home/jyc23/raptor-master/raptor/cluster_tree_builder.py", line 102, in construct_tree
clusters = self.clustering_algorithm.perform_clustering(
File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 226, in perform_clustering
RAPTOR_Clustering(
TypeError: RAPTOR_Clustering() takes no arguments

Question about experiment

In the paper's experimental section, is the tree constructed using multiple documents or just a single document? Based on the code, it appears that a tree can only be constructed using one document. I look forward to your response!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.