peterw / chat-with-github-repo Goto Github PK

This repository contains two Python scripts that demonstrate how to create a chatbot using Streamlit, OpenAI GPT-3.5-turbo, and Activeloop's Deep Lake.

Home Page: https://explodinginsights.com/

License: MIT License

Python 100.00%

chat-with-github-repo's People

Contributors

Stargazers

Watchers

Forkers

francyjglisboa aosbornee alcattell ml-sketch bowtiedswan zmon3y antoniorossi markedmondson1234 shelbaz commerceless owain-s hyojunguy dtbuchholz twilwa thebalaa enkaybit fungilation airtacte zeroxclem abdoiiii dgilperez hbcbh1999 tiebingzhang kfish jflo awesome-archive alberto-codes axdsan santoshdahale omar16100 marcusasplund danilowhk nopeanuts matijagrcic silveroak-apps drejom jamiechicago312 sanchitram1 ramvinoth shaneholloman arvesolland charlesfr nooproblem negadras cdelalama 6ilcarlos anandanne thenetguy eltociear jeromebanks legendsort goswamig jacob780712 hemangjoshi37a hirajanwin evdcush aiisgmi touristshaun cyruszei rjm521 lgs marciob kenyanroot garrett361 nothans rymalia jidakang jackcloudman jonzhep ssukhawani oyemade far5man al-kindi mcsts cmy235 hdooster ben18938630900 wealthyslave ai-jie01 kivar thinkifyai theurgi sid-delicious aphexcx sailor723 tzengwei kustomzone d3287t328 jessems pauliusos jilijo2020 qilol exs-dcutting idiom-bytes allthesalsas tara-det-ai shawnharmsen aalizzwell-ai jdehorty nslaughter

chat-with-github-repo's Issues

[Discussion] idea to include filenames

This is amazing, I just tried it today on an open source project I work on. After querying it a number of times, I got to asking it about filenames and realized it doesn't have that context.

Would it be easy to add? I am imagining so, but I imagine you know better!

Expensive

I just run the code and before even finishing first build and until evaluating ingest to 30%, it spent 100$ on defaul open ai api gpt3.5 model. Is this normal? This cannot continue, the tested repo is really small, what will happen with larger codebase and with talking to the chat.

deeplake.util.exceptions.TensorDoesNotExistError: "Tensor 'id' does not exist."

I'm getting an error during processing. Any suggestions?

Deep Lake Dataset in hub://adrenalineuser/repo-chat already exists, loading from the storage
Traceback (most recent call last):
File "src/main.py", line 110, in
main()
File "src/main.py", line 104, in main
process_repo(args)
File "src/main.py", line 33, in process_repo
process(
File "/home/username/DevelopmentWSL/Chat-with-Github-Repo/src/utils/process.py", line 105, in process
db.add_documents(texts)
File "/home/username/.local/lib/python3.8/site-packages/langchain/vectorstores/base.py", line 72, in add_documents
return self.add_texts(texts, metadatas, **kwargs)
File "/home/username/.local/lib/python3.8/site-packages/langchain/vectorstores/deeplake.py", line 184, in add_texts
return self.vectorstore.add(
File "/home/username/.local/lib/python3.8/site-packages/deeplake/core/vectorstore/deeplake_vectorstore.py", line 230, in add
processed_tensors, id = dataset_utils.preprocess_tensors(
File "/home/username/.local/lib/python3.8/site-packages/deeplake/core/vectorstore/vector_search/dataset/dataset.py", line 254, in preprocess_tensors
if dataset and dataset[tensor_name].htype == "image":
File "/home/username/.local/lib/python3.8/site-packages/deeplake/core/dataset/dataset.py", line 506, in getitem
raise TensorDoesNotExistError(item)
deeplake.util.exceptions.TensorDoesNotExistError: "Tensor 'id' does not exist."

Embeddings are not getting created in Deep Lake

When I start the streamlit app, with the .env file as below:

OPENAI_API_KEY=sk-xxxxxxxx
ACTIVELOOP_TOKEN=xxxxxxxxxxxxxx
DEEPLAKE_USERNAME=vinodvarma24
DEEPLAKE_DATASET_PATH=hub://orgname/embeddings
DEEPLAKE_REPO_NAME=embeddings
REPO_URL=https://github.com/peterw/Chat-with-Github-Repo
SITE_TITLE="Your Site Title"

I'm getting the below message: Saying the dataset is read-only and 'label` got an empty value.

May i know what is the issue here?

hub://talktodata/embeddings loaded successfully.

2023-04-26 07:58:11.346 Deep Lake Dataset in hub://orgname/embeddings already exists, loading from the storage
Dataset(path='hub://orgname/embeddings', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype    shape    dtype  compression
  -------   -------  -------  -------  ------- 
 embedding  generic   (0,)    float32   None   
    ids      text     (0,)      str     None   
 metadata    json     (0,)      str     None   
   text      text     (0,)      str     None   
2023-04-26 07:58:11.348 `label` got an empty value. This is discouraged for accessibility reasons and may be disallowed in the future by raising an exception. Please provide a non-empty label and hide it with label_visibility if needed.

ValueError

This is the error I got after process it on this repo: https://github.com/ArshaanB/question-and-answer-contracts

Any tips?

ValueError: shapes (1536,) and (0,) not aligned: 1536 (dim 0) != 0 (dim 0)
Traceback:
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "/Users/arshaan/Desktop/development/ai/Chat-with-Github-Repo/src/utils/chat.py", line 93, in <module>
    run_chat_app(args.activeloop_dataset_path)
File "/Users/arshaan/Desktop/development/ai/Chat-with-Github-Repo/src/utils/chat.py", line 42, in run_chat_app
    output = search_db(db, user_input)
File "/Users/arshaan/Desktop/development/ai/Chat-with-Github-Repo/src/utils/chat.py", line 85, in search_db
    return qa.run(query)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/chains/base.py", line 213, in run
    return self(args[0])[self.output_keys[0]]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/chains/base.py", line 116, in __call__
    raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/chains/base.py", line 113, in __call__
    outputs = self._call(inputs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/chains/retrieval_qa/base.py", line 109, in _call
    docs = self._get_docs(question)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/chains/retrieval_qa/base.py", line 166, in _get_docs
    return self.retriever.get_relevant_documents(question)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/vectorstores/base.py", line 279, in get_relevant_documents
    docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/vectorstores/deeplake.py", line 350, in similarity_search
    return self.search(query=query, k=k, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/vectorstores/deeplake.py", line 294, in search
    indices, scores = vector_search(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/vectorstores/deeplake.py", line 47, in vector_search
    distances = distance_metric_map[distance_metric](query_embedding, data_vectors)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/langchain/vectorstores/deeplake.py", line 22, in <lambda>
    "cos": lambda a, b: np.dot(a, b.T)
File "<__array_function__ internals>", line 200, in dot

Exception in chat.py due to maximal_marginal_relevance Invalid Argument in DeepLake Similarity Search

Howdy! I was going through the readme, all was well until I got to the step of doing a search.

Describe the bug
An uncaught exception occurs in the chat.py module when executing a similarity search through the DeepLake vector store. The traceback indicates that the maximal_marginal_relevance argument is not a valid parameter for the search method. This results in a failure of the search_db function, impacting the chat application's ability to process and respond to user inputs.

Set up the environment and dependencies as per the project requirements.
Run the chat application using the command:python src/main.py chat --activeloop-dataset-name my-dataset.
Input a query that triggers the search_db function, for example "what are the apis of the project"
The application throws the exception and terminates.
Expected behavior
The expected behavior is for the application to successfully process the query and return relevant results without crashing. The maximal_marginal_relevance argument should either be correctly handled or removed if it's not applicable to the similarity search method in the DeepLake vector store.

exception below

2024-02-29 09:03:58.916 Uncaught app exception
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script
    exec(code, module.__dict__)
  File "/Users/mnakhimovich/workspace/Chat-with-Github-Repo/src/utils/chat.py", line 93, in <module>
    run_chat_app(args.activeloop_dataset_path)
  File "/Users/mnakhimovich/workspace/Chat-with-Github-Repo/src/utils/chat.py", line 42, in run_chat_app
    output = search_db(db, user_input)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mnakhimovich/workspace/Chat-with-Github-Repo/src/utils/chat.py", line 85, in search_db
    return qa.run(query)
           ^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_core/_api/deprecation.py", line 145, in warning_emitting_wrapper
    return wrapped(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain/chains/base.py", line 545, in run
    return self(args[0], callbacks=callbacks, tags=tags, metadata=metadata)[
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_core/_api/deprecation.py", line 145, in warning_emitting_wrapper
    return wrapped(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain/chains/base.py", line 378, in __call__
    return self.invoke(
           ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain/chains/base.py", line 163, in invoke
    raise e
  File "/opt/homebrew/lib/python3.11/site-packages/langchain/chains/base.py", line 153, in invoke
    self._call(inputs, run_manager=run_manager)
  File "/opt/homebrew/lib/python3.11/site-packages/langchain/chains/retrieval_qa/base.py", line 141, in _call
    docs = self._get_docs(question, run_manager=_run_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain/chains/retrieval_qa/base.py", line 221, in _get_docs
    return self.retriever.get_relevant_documents(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_core/retrievers.py", line 244, in get_relevant_documents
    raise e
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_core/retrievers.py", line 237, in get_relevant_documents
    result = self._get_relevant_documents(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_core/vectorstores.py", line 674, in _get_relevant_documents
    docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_community/vectorstores/deeplake.py", line 530, in similarity_search
    return self._search(
           ^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_community/vectorstores/deeplake.py", line 402, in _search
    self._validate_kwargs(kwargs, "search")
  File "/opt/homebrew/lib/python3.11/site-packages/langchain_community/vectorstores/deeplake.py", line 929, in _validate_kwargs
    raise TypeError(
TypeError: `maximal_marginal_relevance` are not a valid argument to search method

why gpt-3.5, can gpt4 works?

ValueError when processing repo:: `texts` parameter shouldn't be empty.

Description

Followed the README instructions for installation, encountered a ValueError when processing repositories:

(.venv) ➜  Chat-with-Github-Repo (main) python3 src/main.py process --repo-url https://github.com/chrisammon3000/aws-json-dataset               ✭
Your Deep Lake dataset has been successfully created!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/chrisammon3000/aws-json-dataset
hub://chrisammon3000/aws-json-dataset loaded successfully.
fatal: destination path 'repos' already exists and is not an empty directory.
Using embedding function is deprecated and will be removed in the future. Please use embedding instead.
Deep Lake Dataset in hub://chrisammon3000/aws-json-dataset already exists, loading from the storage
Traceback (most recent call last):
  File "/Users/ammon/Projects/chrisammon3000/experiments/Chat-with-Github-Repo/src/main.py", line 110, in <module>
    main()
  File "/Users/ammon/Projects/chrisammon3000/experiments/Chat-with-Github-Repo/src/main.py", line 104, in main
    process_repo(args)
  File "/Users/ammon/Projects/chrisammon3000/experiments/Chat-with-Github-Repo/src/main.py", line 33, in process_repo
    process(
  File "/Users/ammon/Projects/chrisammon3000/experiments/Chat-with-Github-Repo/src/utils/process.py", line 105, in process
    db.add_documents(texts)
  File "/Users/ammon/Projects/chrisammon3000/experiments/Chat-with-Github-Repo/.venv/lib/python3.11/site-packages/langchain/vectorstores/base.py", line 101, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ammon/Projects/chrisammon3000/experiments/Chat-with-Github-Repo/.venv/lib/python3.11/site-packages/langchain/vectorstores/deeplake.py", line 217, in add_texts
    raise ValueError("`texts` parameter shouldn't be empty.")
ValueError: `texts` parameter shouldn't be empty.

Deeplake Transform failed error

Following is the entire error thread

fatal: destination path './gumroad' already exists and is not an empty directory.
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/sai13579/code_repo_qa2

hub://sai13579/code_repo_qa2 loaded successfully.

Deep Lake Dataset in hub://sai13579/code_repo_qa2 already exists, loading from the storage
Dataset(path='hub://sai13579/code_repo_qa2', tensors=[])

 tensor    htype    shape    dtype  compression
 -------  -------  -------  -------  -------
Evaluating ingest: 0%|                                                                                    | 0/1 [00:02<? 
Error in sys.excepthook:
Traceback (most recent call last):
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\humbug\report.py", line 540, in _hook 
    self.error_report(error=exception_instance, tags=tags, publish=publish)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\humbug\report.py", line 274, in error_report
    traceback.format_exception(
TypeError: format_exception() got an unexpected keyword argument 'etype'

Original exception was:
Traceback (most recent call last):
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform_tensor.py", line 117, in append
    raise TensorDoesNotExistError(self.name)
deeplake.util.exceptions.TensorDoesNotExistError: "Tensor 'text' does not exist."

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\util\transform.py", line 207, in _transform_and_append_data_slice
    out = transform_sample(sample, pipeline, tensors)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\util\transform.py", line 75, 
in transform_sample
    fn(out, result, *args, **kwargs)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\langchain\vectorstores\deeplake.py", line 219, in ingest
    sample_out.append(
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform_dataset.py", line 67, in append
    self[k].append(v)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform_tensor.py", line 127, in append
    raise SampleAppendError(self.name, item) from e
deeplake.util.exceptions.SampleAppendError: Failed to append the sample [core]
        repositoryformatversion = 0
        filemode = false
        bare = false
        logallrefupdates = true
        symlinks = false
        ignorecase = true
[remote "origin"]
        url = https://github.com/sai-krishna-msk/VtopScrapper
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master to the tensor 'text'. See more details in the traceback.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\320117176\OneDrive - Philips\Documents\projects\ai_agent\Chat-with-Github-Repo\github.py", line 53, in <module>
    main(repo_url, root_dir, deeplake_repo_name, deeplake_username)
  File "c:\Users\320117176\OneDrive - Philips\Documents\projects\ai_agent\Chat-with-Github-Repo\github.py", line 44, in main
    db.add_documents(texts)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\langchain\vectorstores\base.py", line 
61, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\langchain\vectorstores\deeplake.py", line 236, in add_texts
    ingest().eval(
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform.py", line 99, in eval
    pipeline.eval(
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform.py", line 298, in eval
    raise TransformError(
deeplake.util.exceptions.TransformError: Transform failed at index 0 of the input data on the item: [('[core]\n\trepositoryformatversion = 0\n\tfilemo...n\\HEAD'}, 'a217eccf-e42f-11ed-94dd-f47b099e160e')]. See traceback for more details.

I have tried using different repositories and finally the error I uploaded, I was using this very repository
The exact error is at line 44, file: Github.py

db.add_documents(texts)

Can anyone please help me understand and resolve this issue

Thank you in advance 🙌

why i am getting openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens.However, your messages resulted in 16762 tokens.

hello i am getting this error any idea to fix this :
2023-05-02 16:58:57.051 label got an empty value. This is discouraged for accessibility reasons and may be disallowed in the future by raising an exception. Please provide a non-empty label and hide it with label_visibility if needed. 2023-05-02 16:59:06.633 Uncaught app exception Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script exec(code, module.__dict__) File "/Users/xfocus/Downloads/chatRepo/Chat-with-Github-Repo/chat.py", line 104, in <module> output = search_db(user_input) File "/Users/xfocus/Downloads/chatRepo/Chat-with-Github-Repo/chat.py", line 85, in search_db result = qa({"query": query}) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py", line 116, in __call__ raise e File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py", line 113, in __call__ outputs = self._call(inputs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 110, in _call answer = self.combine_documents_chain.run( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py", line 216, in run return self(kwargs)[self.output_keys[0]] File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py", line 116, in __call__ raise e File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py", line 113, in __call__ outputs = self._call(inputs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 75, in _call output, extra_return_dict = self.combine_docs(docs, **other_keys) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 82, in combine_docs return self.llm_chain.predict(**inputs), {} File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/llm.py", line 151, in predict return self(kwargs)[self.output_key] File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py", line 116, in __call__ raise e File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py", line 113, in __call__ outputs = self._call(inputs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/llm.py", line 57, in _call return self.apply([inputs])[0] File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/llm.py", line 118, in apply response = self.generate(input_list) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/llm.py", line 62, in generate return self.llm.generate_prompt(prompts, stop) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chat_models/base.py", line 82, in generate_prompt raise e File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chat_models/base.py", line 79, in generate_prompt output = self.generate(prompt_messages, stop=stop) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chat_models/base.py", line 54, in generate results = [self._generate(m, stop=stop) for m in messages] File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chat_models/base.py", line 54, in <listcomp> results = [self._generate(m, stop=stop) for m in messages] File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chat_models/openai.py", line 266, in _generate response = self.completion_with_retry(messages=message_dicts, **params) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chat_models/openai.py", line 228, in completion_with_retry return _completion_with_retry(**kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/__init__.py", line 289, in wrapped_f return self(f, *args, **kw) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/__init__.py", line 379, in __call__ do = self.iter(retry_state=retry_state) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/__init__.py", line 314, in iter return fut.result() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__ result = fn(*args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chat_models/openai.py", line 226, in _completion_with_retry return self.client.create(**kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/api_resources/chat_completion.py", line 25, in create return super().create(*args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create response, _, api_key = requestor.request( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/api_requestor.py", line 226, in request resp, got_stream = self._interpret_response(result, stream) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/api_requestor.py", line 619, in _interpret_response self._interpret_response_line( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/openai/api_requestor.py", line 682, in _interpret_response_line raise self.handle_error_response( openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 16762 tokens. Please reduce the length of the messages.

Why is the text being split by 1000?

I'd like to understand why the text is being split by 1000 characters?

Is it that each chunk if to be submitted using 1000 tokens? (In which case the character limit is 750)

Chat-with-Github-Repo/github.py

Line 32 in 918977b

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

Can't replicate the intended behavior

After installing the app and scraping the repo you referred to in the demo (https://github.com/peterw/Gumroad-Landing-Page-Generator) I can't get the chat to analyze the repo.

This is my chat interaction using the same questions as in the demo. Looks like the repo data embeddings are not used properly in inferences.

This is the logging output in the terminal, not sure if it's relevant.

2023-04-27 19:05:51.248 Deep Lake Dataset in my_test_dataset already exists, loading from the storage
Dataset(path='my_test_dataset', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype    shape    dtype  compression
  -------   -------  -------  -------  ------- 
 embedding  generic   (0,)    float32   None   
    ids      text     (0,)      str     None   
 metadata    json     (0,)      str     None   
   text      text     (0,)      str     None   
2023-04-27 19:05:51.255 `label` got an empty value. This is discouraged for accessibility reasons and may be disallowed in the future by raising an exception. Please provide a non-empty label and hide it with label_visibility if needed.

TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'

2023-04-25 18:37:50.156 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:37:51.843 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:37:54.475 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:37:58.851 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:38:08.388 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:47:02.915 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:47:02.918 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:47:03.473 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:47:03.820 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'
2023-04-25 18:47:03.959 Uncaught app exception
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "Documents/Chat-with-Github-Repo/chat.py", line 38, in
db = DeepLake(dataset_path=active_loop_data_set_path, read_only=True, embedding_function=embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepLake.init() got an unexpected keyword argument 'read_only'

Can't we do using local DB?

Do we need ActiveLoop unless the scale is pretty high? Won't a local one solve it?

datalake S3 error

deeplake.util.exceptions.S3DeletionError: An error occurred (AccessDenied) when calling the DeleteObjects operation: User: arn:aws:sts::10000007...:assumed-role/hyper-assumerole/xxx is not authorized to perform: s3:DeleteObject on resource: arn:aws:s3:::snark-hub

Needs file type filters to avoid tokenizing models, images, etc.

Noticing this was taking lots of tokens, and realized it was sending everything to OpenAI.

install langchain community package

as description

openai.error.RateLimitError

The code crashes in the processing step with the following error:

openai.error.RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vBsMRTgzeINBj79qEcyzb4EI on tokens per min. Limit: 1000000 / min. Current: 1 / min. Contact us through our help center at help.openai.com if you continue to have issues.

Now I know it is caused by sending too many tokens at once through the API. Can a timer configuration be added to the code to send the tokens in fixed intervals?

Corrupted dataset error when running github.py

Traceback (most recent call last):
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/core/dataset/dataset.py", line 240, in init
self._set_derived_attributes(address=address)
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/core/dataset/dataset.py", line 2065, in _set_derived_attributes
self._set_read_only(
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/core/dataset/dataset.py", line 1727, in _set_read_only
locked = self._lock(err=err)
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/core/dataset/dataset.py", line 1296, in _lock
raise ReadOnlyModeError()
deeplake.util.exceptions.ReadOnlyModeError: Modification when in read-only mode is not supported!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/api/dataset.py", line 569, in load
return dataset._load(dataset_kwargs, access_method)
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/api/dataset.py", line 638, in _load
ret = dataset_factory(**dataset_kwargs)
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/core/dataset/init.py", line 23, in dataset_factory
ds = clz(path=path, *args, **kwargs)
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/core/dataset/dataset.py", line 246, in init
raise ReadOnlyModeError(
deeplake.util.exceptions.ReadOnlyModeError: This dataset cannot be open for writing as you don't have permissions. Try loading the dataset with `read_only=True.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "github.py", line 46, in
main(repo_url, root_dir, deeplake_repo_name, deeplake_username)
File "github.py", line 37, in main
db = DeepLake(dataset_path=f"hub://{username}/{repo_name}", embedding_function=embeddings)
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/langchain/vectorstores/deeplake.py", line 125, in init
self.ds = deeplake.load(
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/util/spinner.py", line 139, in inner
return func(*args, **kwargs)
File "/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/api/dataset.py", line 581, in load
raise DatasetCorruptError(
deeplake.util.exceptions.DatasetCorruptError: Exception occured (see Traceback). The dataset maybe corrupted. Try using reset=True to reset HEAD changes and load the previous commit. This will delete all uncommitted changes on the branch you are trying to load.