Giter VIP home page Giter VIP logo

verba's Introduction

Verba

The Golden RAGtriever

Weaviate PyPi downloads Docker support Demo

Welcome to Verba: The Golden RAGtriever, an open-source application designed to offer an end-to-end, streamlined, and user-friendly interface for Retrieval-Augmented Generation (RAG) out of the box. In just a few easy steps, explore your datasets and extract insights with ease, either locally with Ollama and Huggingface or through LLM providers such as Anthrophic, Cohere, and OpenAI.

pip install goldenverba

Demo of Verba

What Is Verba?

Verba is a fully-customizable personal assistant utilizing Retrieval Augmented Generation (RAG) for querying and interacting with your data, either locally or deployed via cloud. Resolve questions around your documents, cross-reference multiple data points or gain insights from existing knowledge bases. Verba combines state-of-the-art RAG techniques with Weaviate's context-aware database. Choose between different RAG frameworks, data types, chunking & retrieving techniques, and LLM providers based on your individual use-case.

Open Source Spirit

Weaviate is proud to offer this open-source project for the community. While we strive to address issues as fast as we can, please understand that it may not be maintained with the same rigor as production software. We welcome and encourage community contributions to help keep it running smoothly. Your support in fixing open issues quickly is greatly appreciated.

Watch our newest Verba video here:

VIDEO LINK

Feature Lists

๐Ÿค– Model Support Implemented Description
Ollama (e.g. Llama3) โœ… Local Embedding and Generation Models powered by Ollama
HuggingFace (e.g. MiniLMEmbedder) โœ… Local Embedding Models powered by HuggingFace
Cohere (e.g. Command R+) โœ… Embedding and Generation Models by Cohere
Anthrophic (e.g. Claude Sonnet) โœ… Embedding and Generation Models by Anthrophic
OpenAI (e.g. GPT4) โœ… Embedding and Generation Models by OpenAI
๐Ÿค– Embedding Support Implemented Description
Weaviate โœ… Embedding Models powered by Weaviate
Ollama โœ… Local Embedding Models powered by Ollama
SentenceTransformers โœ… Embedding Models powered by HuggingFace
Cohere โœ… Embedding Models by Cohere
VoyageAI โœ… Embedding Models by VoyageAI
OpenAI โœ… Embedding Models by OpenAI
๐Ÿ“ Data Support Implemented Description
UnstructuredIO โœ… Import Data through Unstructured
Firecrawl โœ… Scrape and Crawl URL through Firecrawl
PDF Ingestion โœ… Import PDF into Verba
GitHub & GitLab โœ… Import Files from Github and GitLab
CSV/XLSX Ingestion โœ… Import Table Data into Verba
.DOCX โœ… Import .docx files
Multi-Modal (using AssemblyAI) โœ… Import and Transcribe Audio through AssemblyAI
โœจ RAG Features Implemented Description
Hybrid Search โœ… Semantic Search combined with Keyword Search
Autocomplete Suggestion โœ… Verba suggests autocompletion
Filtering โœ… Apply Filters (e.g. documents, document types etc.) before performing RAG
Customizable Metadata โœ… Free control over Metadata
Async Ingestion โœ… Ingest data asynchronously to speed up the process
Advanced Querying planned โฑ๏ธ Task Delegation Based on LLM Evaluation
Reranking planned โฑ๏ธ Rerank results based on context for improved results
RAG Evaluation planned โฑ๏ธ Interface for Evaluating RAG pipelines
๐Ÿ—ก๏ธ Chunking Techniques Implemented Description
Token โœ… Chunk by Token powered by spaCy
Sentence โœ… Chunk by Sentence powered by spaCy
Semantic โœ… Chunk and group by semantic sentence similarity
Recursive โœ… Recursively chunk data based on rules
HTML โœ… Chunk HTML files
Markdown โœ… Chunk Markdown files
Code โœ… Chunk Code files
JSON โœ… Chunk JSON files
๐Ÿ†’ Cool Bonus Implemented Description
Docker Support โœ… Verba is deployable via Docker
Customizable Frontend โœ… Verba's frontend is fully-customizable via the frontend
Vector Viewer โœ… Visualize your data in 3D
๐Ÿค RAG Libraries Implemented Description
LangChain โœ… Implement LangChain RAG pipelines
Haystack planned โฑ๏ธ Implement Haystack RAG pipelines
LlamaIndex planned โฑ๏ธ Implement LlamaIndex RAG pipelines

Something is missing? Feel free to create a new issue or discussion with your idea!

Showcase of Verba


Getting Started with Verba

You have three deployment options for Verba:

  • Install via pip
pip install goldenverba
  • Build from Source
git clone https://github.com/weaviate/Verba

pip install -e .
  • Use Docker for Deployment

Prerequisites: If you're not using Docker, ensure that you have Python >=3.10.0 installed on your system.

git clone https://github.com/weaviate/Verba

docker compose --env-file <your-env-file> up -d --build

If you're unfamiliar with Python and Virtual Environments, please read the python tutorial guidelines.

API Keys

You can set all API keys in the Verba frontend, but to make your life easier, we can also prepare a .env file in which Verba will automatically look for the keys. Create a .env in the same directory you want to start Verba in. You can find an .env.example file in the goldenverba directory.

Make sure to only set environment variables you intend to use, environment variables with missing or incorrect values may lead to errors.

Below is a comprehensive list of the API keys and variables you may require:

Environment Variable Value Description
WEAVIATE_URL_VERBA URL to your hosted Weaviate Cluster Connect to your WCS Cluster
WEAVIATE_API_KEY_VERBA API Credentials to your hosted Weaviate Cluster Connect to your WCS Cluster
ANTHROPIC_API_KEY Your Anthropic API Key Get Access to Anthropic Models
OPENAI_API_KEY Your OpenAI Key Get Access to OpenAI Models
OPENAI_BASE_URL URL to OpenAI instance Models
COHERE_API_KEY Your API Key Get Access to Cohere Models
OLLAMA_URL URL to your Ollama instance (e.g. http://localhost:11434 ) Get Access to Ollama Models
UNSTRUCTURED_API_KEY Your API Key Get Access to Unstructured Data Ingestion
UNSTRUCTURED_API_URL URL to Unstructured Instance Get Access to Unstructured Data Ingestion
ASSEMBLYAI_API_KEY Your API Key Get Access to AssemblyAI Data Ingestion
GITHUB_TOKEN Your GitHub Token Get Access to Data Ingestion via GitHub
GITLAB_TOKEN Your GitLab Token Get Access to Data Ingestion via GitLab
FIRECRAWL_API_KEY Your Firecrawl API Key Get Access to Data Ingestion via Firecrawl
VOYAGE_API_KEY Your VoyageAI API Key Get Access to Embedding Models via VoyageAI
EMBEDDING_SERVICE_URL URL to your Embedding Service Instance Get Access to Embedding Models via Weaviate Embedding Service
EMBEDDING_SERVICE_KEY Your Embedding Service Key Get Access to Embedding Models via Weaviate Embedding Service

API Keys in Verba

Weaviate

Verba provides flexibility in connecting to Weaviate instances based on your needs. You have three options:

  1. Local Deployment: Use Weaviate Embedded which runs locally on your device (except Windows, choose the Docker/Cloud Deployment)
  2. Docker Deployment: Choose this option when you're running Verba's Dockerfile.
  3. Cloud Deployment: Use an existing Weaviate instance hosted on WCD to run Verba

๐ŸŒฉ๏ธ Weaviate Cloud Deployment (WCD)

If you prefer a cloud-based solution, Weaviate Cloud (WCD) offers a scalable, managed environment. Learn how to set up a cloud cluster and get the API keys by following the Weaviate Cluster Setup Guide.

๐Ÿณ Docker Deployment Another local alternative is deploying Weaviate using Docker. For more details, follow the How to install Verba with Docker section.

Deployment in Verba

Ollama

Verba supports Ollama models. Download and Install Ollama on your device (https://ollama.com/download). Make sure to install your preferred LLM using ollama run <model>.

Tested with llama3, llama3:70b and mistral. The bigger models generally perform better, but need more computational power.

Make sure Ollama Server runs in the background and that you don't ingest documents with different ollama models since their vector dimension can vary that will lead to errors

You can verify that by running the following command

ollama run llama3

Unstructured

Verba supports importing documents through Unstructured IO (e.g plain text, .pdf, .csv, and more). To use them you need the UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL environment variable. You can get it from Unstructured

UNSTRUCTURED_API_URL is set to https://api.unstructured.io/general/v0/general by default

AssemblyAI

Verba supports importing documents through AssemblyAI (audio files or audio from video files). To use them you need the ASSEMBLYAI_API_KEY environment variable. You can get it from AssemblyAI

OpenAI

Verba supports OpenAI Models such as Ada, GPT3, and GPT4. To use them, you need to specify the OPENAI_API_KEY environment variable. You can get it from OpenAI

You can also add a OPENAI_BASE_URL to use proxies such as LiteLLM (https://github.com/BerriAI/litellm)

OPENAI_BASE_URL=YOUR-OPENAI_BASE_URL

HuggingFace

If you want to use the HuggingFace Features, make sure to install the correct Verba package. It will install required packages to use the local embedding models. Please note that on startup, Verba will automatically download and install embedding models when used.

pip install goldenverba[huggingface]

or

pip install `.[huggingface]`

If you're using Docker, modify the Dockerfile accordingly

How to deploy with pip

Python >=3.10.0

  1. (Very Important) Initialize a new Python Environment
python3 -m virtualenv venv
  1. Install Verba
pip install goldenverba
  1. Launch Verba
verba start

You can specify the --port and --host via flags

  1. Access Verba
Visit localhost:8000
  1. (Optional)Create .env file and add environment variables

How to build from Source

  1. Clone the Verba repos
git clone https://github.com/weaviate/Verba.git
  1. Initialize a new Python Environment
python3 -m virtualenv venv
  1. Install Verba
pip install -e .
  1. Launch Verba
verba start

You can specify the --port and --host via flags

  1. Access Verba
Visit localhost:8000
  1. (Optional) Create .env file and add environment variables

How to install Verba with Docker

Docker is a set of platform-as-a-service products that use OS-level virtualization to deliver software in packages called containers. To get started with deploying Verba using Docker, follow the steps below. If you need more detailed instructions on Docker usage, check out the Docker Curriculum.

  1. Clone the Verba repos Ensure you have Git installed on your system. Then, open a terminal or command prompt and run the following command to clone the Verba repository:
git clone https://github.com/weaviate/Verba.git
  1. Set necessary environment variables Make sure to set your required environment variables in the .env file. You can read more about how to set them up in the API Keys Section

  2. Adjust the docker-compose file You can use the docker-compose.yml to add required environment variables under the verba service and can also adjust the Weaviate Docker settings to enable Authentification or change other settings of your database instance. You can read more about the Weaviate configuration in our docker-compose documentation

Please make sure to only add environment variables that you really need.

  1. Deploy using Docker With Docker installed and the Verba repository cloned, navigate to the directory containing the Docker Compose file in your terminal or command prompt. Run the following command to start the Verba application in detached mode, which allows it to run in the background:
docker compose up -d
docker compose --env-file goldenverba/.env up -d --build

This command will download the necessary Docker images, create containers, and start Verba. Remember, Docker must be installed on your system to use this method. For installation instructions and more details about Docker, visit the official Docker documentation.

  1. Access Verba
  • You can access your local Weaviate instance at localhost:8080

  • You can access the Verba frontend at localhost:8000

If you want your Docker Instance to install a specific version of Verba you can edit the Dockerfile and change the installation line.

RUN pip install -e '.'

Verba Walkthrough

Import Your Data

First thing you need to do is to add your data. You can do this by clicking on Import Data and selecting either Add Files, Add Directory, or Add URL tab. Here you can add all your files that you want to ingest. You can then configure every file individually by selecting the file and clicking on Overview or Configure tab. Demo of Verba

Query Your Data

With Data imported, you can use the Chat page to ask any related questions. You will receive relevant chunks that are semantically relevant to your question and an answer generated by your choosen model. You can configure the RAG pipeline under the Config tab.

Demo of Verba

Open Source Contribution

Your contributions are always welcome! Feel free to contribute ideas, feedback, or create issues and bug reports if you find any! Before contributing, please read the Contribution Guide. Visit our Weaviate Community Forum if you need any help!

Project Architecture

You can learn more about Verba's architecture and implementation in its technical documentation and frontend documentation. It's recommended to have a look at them before making any contributions.

Known Issues

  • Weaviate Embeeded currently not working on Windows yet
    • Will be fixed in future versions, until then please use the Docker or WCS Deployment

FAQ

  • Is Verba Multi-Lingual?

    • This depends on your choosen Embedding and Generation Model whether they support multi-lingual data.
  • Can I use my Ollama Server with the Verba Docker?

    • Yes, you can! Make sure the URL is set to: OLLAMA_URL=http://host.docker.internal:11434
    • If you're running on Linux, you might need to get the IP Gateway of the Ollama server: OLLAMA_URL="http://YOUR-IP-OF-OLLAMA:11434"
  • How to clear Weaviate Embedded Storage?

    • You'll find the stored data here: ~/.local/share/weaviate
  • How can I specify the port?

    • You can use the port and host flag verba start --port 9000 --host 0.0.0.0

verba's People

Contributors

ajit283 avatar badhansen avatar cam-barts avatar cyborgmarina avatar dannyjameswilliams avatar dudanogueira avatar eltociear avatar erika-cardenas avatar grski avatar hholtmann avatar isikhi avatar janeggers-hr avatar jcarme avatar kjeldahl avatar lotzf avatar nasonz avatar recursionbane avatar rugveddarwhekar avatar samos123 avatar taigrr avatar thomashacker avatar timsnow avatar tomaarsen avatar vankeer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

verba's Issues

Customize name + colours

Great job with Verba!
I was really missing a tool like this to quickly prototype demo's for customers.

Would be great if we could document where we should rename what values in order to replace;

  • "Verba" with "Customer X"
  • Verba logo with my customer's logo
  • Update CSS values with my customer's colours.

If you point me to the location, I can create a PR with some documentation.

Chunks are not imported

image
It said: Imported all chunks
In fact, after checking I see that the chunks are not imported

Docker doc enhancements

Please consider modifying the Docker instructions a bit

You currently have the following steps

  1. Clone the repo
  2. Deploy using Docker

Suggest the this might be more informative- as currently written it misses required steps.
Although these steps are documented above, it's not possible to get a working install just from the current Docker section.

  1. Clone the repo using git clone https://github.com/weaviate/Verba.git
  2. Navigate into the new Verba directory
  3. Add a .env file in the root of your Verba folder or add your tokens to the compose.yaml file. At a minimum, you must include x,y and z, but a,b and c are optional depending on your chosen feature set. Adding tokens to .env or compose.yaml file is your choice, but a .env may have some security advantages and allows you to run the compose.yaml file with less modifications, allowing easier upgrades in future. (Maybe these can be configured via the GUI? Is that why they aren't mentioned in the Docker section?)
  4. run 'docker compare up -d' and your project should build and run, you can test readiness by using your web browser to navigate to http://localhost:8080 or http://DockerHostIP:8080 where you will see API endpoint details
  5. You can then access the web interface at http://localhost:8000 or http://DockerHostIP:8000

Some of this is probably wrong, as my own install isn't yet working.

also under 'Status Page' you say
'Once configured, you can monitor your Verba installation's health and status via the 'Status Verba' page'
but you don't indicate how to reach this page... would be helpful to include that path...

FR: incremental data ingest

It'd be great to allow incremental data ingestion to enhance efficiency and reduce API call costs. i.e., only ingest new or modified files with verba import --path "Path to your dir or file" ..

Add proper linting

The code lacks support for modern linting tools which will enable to code to be more standardised and clean.
Suggested solution: ruff & pre-commit.

Display line breaks in chat

Hi,

Suggestion for a small but nice frontend improvement: add whitespace-pre-line as an HTML class in this line:

className={`inline-block p-3 rounded-xl animate-press-in shadow-md font-mono text-sm ${message.type === "user" ? "bg-yellow-200" : "bg-white"

As such, messages in the chat will be displayed with proper line breaks.

With:
image

Without:
image

Embedding Failure With Sufficiently Long Documents

When testing this locally, I noticed that certain documents I was trying to upload continued to fail, and I was able to correlate the issue to the document's size.

Here's what I'm thinking is happening:
In this snippet, we define the schema for documents, but the there is no explicit skip for the property. Therefore, if the text is longer than the embedding context allows, it will fail.

"class": "Document",
"description": "Documentation",
"properties": [
{
"name": "text",
"dataType": ["text"],
"description": "Content of the document",
},

This is the call that fails:

uuid = client.batch.add_data_object(properties, class_name)

Because this call fails, it can't proceed to chunking, as the next step is setting this generated uuid on each of the chunks.

Locally, I was able to work around this in a couple of different ways, but I don't know know how each might affect the rest of the app:

  • Add a skip on the property in the schema definition
  • Truncate the content of document.text here
    "text": str(document.text),
  • (Larger commitment) Run the chunks through something like a map-reduce summary chain in LangChain so the content is just a summary. This would require a bit more time to ingest the document which would grow with the size of the injested document.

Any plan to implement custom API calls

Hey I'd love to test your repository with my custom model hosted with a custom API (ex with oobabooga).

Do you plan on implementing calls to custom API or do you recommend to do it myself?

Error when importing larger PDF files

When trying to import larger PDF files (tested with upwards of 20 pages) using ADAEmbedder I'm seeing the following error in the front end and in the console.

However the embeddings somehow seem to be generated since asking questions for that context works. But the uploaded document isn't showing in the frontend under the documents section.

frontend console

requests.exceptions.ReadTimeout: The 'objects' creation was cancelled because it took longer than the configured timeout of 60s. Try reducing the batch size (currently 1550) to a lower value. Aim to on average complete batch request within less than 10s

โ„น (1549/6283) Importing chunk of /home/huy/sutta3/mn122.txt (7)
โ„น (1550/6283) Importing chunk of /home/huy/sutta3/mn122.txt (8)
[ERROR] Batch ReadTimeout Exception occurred! Retrying in 2s. [1/3]
[ERROR] Batch ReadTimeout Exception occurred! Retrying in 4s. [2/3]
[ERROR] Batch ReadTimeout Exception occurred! Retrying in 6s. [3/3]
Traceback (most recent call last):
File "/home/huy/miniconda3/envs/verba/bin/verba", line 33, in
sys.exit(load_entry_point('goldenverba', 'console_scripts', 'verba')())
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/huy/Verba/goldenverba/ingestion/cli.py", line 68, in import_data_command
import_data(path, model)
File "/home/huy/Verba/goldenverba/ingestion/import_data.py", line 69, in import_data
import_chunks(client=client, chunks=chunks, doc_uuid_map=uuid_map)
File "/home/huy/Verba/goldenverba/ingestion/util.py", line 123, in import_chunks
with client.batch as batch:
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 1646, in exit
self.flush()
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 1252, in flush
self._send_batch_requests(force_wait=True)
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 1151, in _send_batch_requests
response_objects, nr_objects = done_future.result()
File "/home/huy/miniconda3/envs/verba/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/home/huy/miniconda3/envs/verba/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/huy/Verba/goldenverba/ingestion/util.py", line 141, in import_chunks
client.batch.add_data_object(properties, "Chunk")
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 569, in add_data_object
self._auto_create()
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 1242, in _auto_create
self._send_batch_requests(force_wait=False)
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 1151, in _send_batch_requests
response_objects, nr_objects = done_future.result()
File "/home/huy/miniconda3/envs/verba/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/home/huy/miniconda3/envs/verba/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/huy/miniconda3/envs/verba/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 1099, in _flush_in_thread
response = self._create_data(
File "/home/huy/miniconda3/envs/verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 750, in _create_data
raise ReadTimeout(message) from None
requests.exceptions.ReadTimeout: The 'objects' creation was cancelled because it took longer than the configured timeout of 60s. Try reducing the batch size (currently 1550) to a lower value. Aim to on average complete batch request within less than 10s

Import stopped at chunk number 1550. How can I fix this?
Thank for nice app :D

Azure OpenAI support

It seems Azure OpenAI is supported by Weaviate but not by Verba so far, it would be great to implement it. Thanks!

Cohere API Key: no api key found neither in request header:

Hi, I'm starting to try Verba, I cloned the project, modified the .env file, and installed it with pip install, but it seems that there is something that I have not done, it is not taking the Keys from the .env file when I use verba start, can you help me?

Regards.

load_data API throw "NoneType' object has no attribute 'tokenize'"

http://0.0.0.0:8000/api/load_data

Request payload:
{
"reader": "SimpleReader",
"chunker": "TokenChunker",
"embedder": "MiniLMEmbedder",
"fileBytes": [
"T3JnYW5pemF0aW9uIE5hbWU6IFJhaGFzYWsgUmVzZWFyY2ggUGFwZXJzCgpEYXRlOiBOb3ZlbWJlciA2LCAyMDIzCgpSZXNlYXJjaCBUb3BpYzogRXhwbG9yaW5nIFN1c3RhaW5hYmxlIEVuZXJneSBTb2x1dGlvbnMgZm9yIFVyYmFuIEVudmlyb25tZW50cwoKVGVzdCBEYXRhIFNldDogU2NpZW50aWZpYyBSZXNlYXJjaCBQYXBlcgoKVGl0bGU6IFRvd2FyZHMgQ2FyYm9uIE5ldXRyYWxpdHk6IEludGVncmF0aW5nIFJlbmV3YWJsZSBFbmVyZ3kgU291cmNlcyBpbiBVcmJhbiBJbmZyYXN0cnVjdHVyZQoKQXV0aG9yczogRHIuIEVtaWx5IEMuIFBhcmtlciwgRHIuIE1pY2hhZWwgSi4gQWRhbXMsIERyLiBTb3BoaWEgTC4gQ2hlbgoKQWJzdHJhY3Q6IFRoaXMgY29tcHJlaGVuc2l2ZSBzY2llbnRpZmljIHJlc2VhcmNoIHBhcGVyIGludmVzdGlnYXRlcyB0aGUgaW50ZWdyYXRpb24gb2YgcmVuZXdhYmxlIGVuZXJneSBzb3VyY2VzIGludG8gdXJiYW4gaW5mcmFzdHJ1Y3R1cmUgdG8gYWNoaWV2ZSBjYXJib24gbmV1dHJhbGl0eS4gVGhlIHN0dWR5IGFuYWx5emVzIHZhcmlvdXMgc3VzdGFpbmFibGUgZW5lcmd5IHNvbHV0aW9ucywgaW5jbHVkaW5nIHNvbGFyIHBvd2VyLCB3aW5kIGVuZXJneSwgYW5kIGh5ZHJvZWxlY3RyaWNpdHksIGFuZCBleHBsb3JlcyB0aGVpciBpbXBsZW1lbnRhdGlvbiBpbiBkZW5zZWx5IHBvcHVsYXRlZCB1cmJhbiBlbnZpcm9ubWVudHMuIFRoZSBwYXBlciBkaXNjdXNzZXMgdGhlIGNoYWxsZW5nZXMsIGJlbmVmaXRzLCBhbmQgcG90ZW50aWFsIGltcGFjdCBvbiByZWR1Y2luZyBjYXJib24gZW1pc3Npb25zIGluIGNpdGllcy4KClB1Ymxpc2hlZCBEYXRlOiBPY3RvYmVyIDI1LCAyMDIzCgpTdW1tYXJ5OgpJbiB0aGlzIGdyb3VuZGJyZWFraW5nIHJlc2VhcmNoIHBhcGVyLCB0aGUgdGVhbSBvZiBleHBlcnRzIGF0IFJhaGFzYWsgUmVzZWFyY2ggUGFwZXJzLCBsZWQgYnkgRHIuIEVtaWx5IEMuIFBhcmtlciwgZGVsdmVzIGRlZXAgaW50byB0aGUgcmVhbG0gb2Ygc3VzdGFpbmFibGUgZW5lcmd5IHNvbHV0aW9ucyBmb3IgdXJiYW4gZW52aXJvbm1lbnRzLiBUaGUgcGFwZXIgcHJvdmlkZXMgYW4gaW4tZGVwdGggYW5hbHlzaXMgb2YgdGhlIGNoYWxsZW5nZXMgZmFjZWQgYnkgbW9kZXJuIGNpdGllcyBpbiBtaXRpZ2F0aW5nIHRoZSBlZmZlY3RzIG9mIGNsaW1hdGUgY2hhbmdlLiBCeSBmb2N1c2luZyBvbiByZW5ld2FibGUgZW5lcmd5IHNvdXJjZXMgc3VjaCBhcyBzb2xhciwgd2luZCwgYW5kIGh5ZHJvZWxlY3RyaWMgcG93ZXIsIHRoZSBhdXRob3JzIHByZXNlbnQgaW5ub3ZhdGl2ZSBzdHJhdGVnaWVzIHRvIHRyYW5zZm9ybSB1cmJhbiBsYW5kc2NhcGVzIGludG8gZW52aXJvbm1lbnRhbGx5IGZyaWVuZGx5IGh1YnMuCgpUaGUgcmVzZWFyY2ggcGFwZXIgbWV0aWN1bG91c2x5IGV4YW1pbmVzIGNhc2Ugc3R1ZGllcyBmcm9tIGVjby1jb25zY2lvdXMgY2l0aWVzIHdvcmxkd2lkZSwgc2hvd2Nhc2luZyBzdWNjZXNzZnVsIGltcGxlbWVudGF0aW9ucyBvZiByZW5ld2FibGUgZW5lcmd5IHRlY2hub2xvZ2llcy4gSXQgZGlzY3Vzc2VzIHRoZSBlY29ub21pYyB2aWFiaWxpdHksIGVudmlyb25tZW50YWwgaW1wYWN0LCBhbmQgc29jaWV0YWwgYmVuZWZpdHMgb2YgdHJhbnNpdGlvbmluZyBmcm9tIGZvc3NpbCBmdWVscyB0byByZW5ld2FibGUgc291cmNlcy4gVGhlIGF1dGhvcnMgZW1waGFzaXplIHRoZSBpbXBvcnRhbmNlIG9mIHBvbGljeSBpbnRlcnZlbnRpb25zLCB0ZWNobm9sb2dpY2FsIGFkdmFuY2VtZW50cywgYW5kIGNvbW11bml0eSBlbmdhZ2VtZW50IGluIGZvc3RlcmluZyBhIHN1c3RhaW5hYmxlIGVuZXJneSB0cmFuc2l0aW9uLgoKQWRkaXRpb25hbGx5LCB0aGUgcGFwZXIgZXhwbG9yZXMgdGhlIGludGVncmF0aW9uIG9mIHNtYXJ0IGdyaWQgc3lzdGVtcywgZW5lcmd5IHN0b3JhZ2Ugc29sdXRpb25zLCBhbmQgYWR2YW5jZWQgbW9uaXRvcmluZyB0ZWNobmlxdWVzIHRvIG9wdGltaXplIHRoZSB1dGlsaXphdGlvbiBvZiByZW5ld2FibGUgZW5lcmd5IGluIHVyYmFuIGFyZWFzLiBCeSBhZGRyZXNzaW5nIGJvdGggdGVjaG5pY2FsIGFuZCBzb2NpZXRhbCBhc3BlY3RzLCB0aGUgcmVzZWFyY2ggcGFwZXIgb2ZmZXJzIGEgaG9saXN0aWMgYXBwcm9hY2ggdG8gY3JlYXRpbmcgZW52aXJvbm1lbnRhbGx5IHN1c3RhaW5hYmxlIGNpdGllcy4KCkZ1dHVyZSBJbXBsaWNhdGlvbnM6ClRoaXMgcmVzZWFyY2ggcGFwZXIgc2VydmVzIGFzIGEgZm91bmRhdGlvbmFsIHJlc291cmNlIGZvciBwb2xpY3ltYWtlcnMsIHVyYmFuIHBsYW5uZXJzLCBhbmQgZW52aXJvbm1lbnRhbGlzdHMgc3RyaXZpbmcgdG8gY3JlYXRlIGdyZWVuZXIgY2l0aWVzLiBUaGUgZmluZGluZ3MgYW5kIHJlY29tbWVuZGF0aW9ucyBwcm92aWRlIHZhbHVhYmxlIGluc2lnaHRzIGludG8gdGhlIHByYWN0aWNhbCBpbXBsZW1lbnRhdGlvbiBvZiByZW5ld2FibGUgZW5lcmd5IHNvbHV0aW9ucywgcGF2aW5nIHRoZSB3YXkgZm9yIGNhcmJvbi1uZXV0cmFsIHVyYmFuIGVudmlyb25tZW50cy4gVGhlIHJlc2VhcmNoIGNvbnRyaWJ1dGVzIHNpZ25pZmljYW50bHkgdG8gdGhlIGdsb2JhbCBkaXNjb3Vyc2Ugb24gc3VzdGFpbmFibGUgZGV2ZWxvcG1lbnQsIG9mZmVyaW5nIHRhbmdpYmxlIHNvbHV0aW9ucyB0byBjb21iYXQgY2xpbWF0ZSBjaGFuZ2UgYW5kIGVuc3VyZSBhIG1vcmUgc3VzdGFpbmFibGUgZnV0dXJlIGZvciBnZW5lcmF0aW9ucyB0byBjb21l"
],
"fileNames": [
"data.txt"
],
"filePath": "",
"document_type": "Documentation",
"chunkUnits": 250,
"chunkOverlap": 50
}

Response:
{
"status": "400",
"status_msg": "'NoneType' object has no attribute 'tokenize'"
}

Add a non-api way to load PDFs with pypdf

Hello !

It would be nice if we could load PDFs (which is for me the main reason to use the app) without the dependency on the Unstructured API.

Why not simply use something like pydf to load PDFs ?

Docker deployment: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Thanks for this project!

I'm seeing an error on CentOS 7.6 with Docker Compose version v2.23.2 and Docker Client 19.03.13 with Docker Engine 24.0.7

Startup seems normal:

docker-compose up -d                                                                                                                                                                                                                                                                                                                                                                         
WARN[0000] The "OPENAI_API_KEY" variable is not set. Defaulting to a blank string.                                                                                                                                                                                                                                                                                                                                                         
WARN[0000] The "OPENAI_API_KEY" variable is not set. Defaulting to a blank string.                                                                                                                                                                                                                                                                                                                                                         
[+] Running 2/2                                                                                                                                                                                                                                                                                                                                                                                                                            
 โœ” Container verba-weaviate-1  Healthy                                                                                                                                                                                                                                                                                                                                                                                                0.0s 
 โœ” Container verba-verba-1     Started

But the verba-verba-1 container immediately fails:

docker logs verba-verba-1
/usr/local/lib/python3.9/site-packages/PyPDF2/__init__.py:21: DeprecationWarning: PyPDF2 is deprecated. Please move to the pypdf library instead.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/bin/verba", line 5, in <module>
    from goldenverba.server.cli import cli
  File "/Verba/goldenverba/server/cli.py", line 6, in <module>
    from goldenverba.verba_manager import VerbaManager
  File "/Verba/goldenverba/verba_manager.py", line 29, in <module>
    class VerbaManager:
  File "/Verba/goldenverba/verba_manager.py", line 156, in VerbaManager
    def setup_client(self) -> Client | None:
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Any ideas to get around this?

Llama model support

Instead of using OpenAI, it would be cool if we could use Llama 2 instead.

If I made the PR and it works well, would you be open to merging it into this repo?

Add poetry support

Currently poetry is the go-to standard to manage dependencies, builds and virtualenvs. I think it'd be useful to incorporate it.

'NoneType' object has no attribute 'tokenize'

I'm using Cohere and unstructured, and I'm receiving that error when trying to load a pdf. It works ok with the simple reader, but not with the options for PDF.

this is the log:

โ„น Received Data to Import: READER(PDFReader, Documents 1, Type
Documentation) CHUNKER (TokenChunker, UNITS 250, OVERLAP 50), EMBEDDER
(MiniLMEmbedder)
โœ” Loaded ai-03-00057.pdf
โœ” Loaded 1 documents
Chunking documents: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 37.20it/s]
โœ” Chunking completed
Vectorizing document chunks: 0%| | 0/1 [00:00<?, ?it/s]
โœ˜ Loading data failed 'NoneType' object has no attribute 'tokenize'

Regards.

Loading data failed Expected all tensors to be on the same device,

Hey, I'm using Verba with OpenAI.

I'm trying to import a txt file with Simple Reader and MiniLMEmbedding.

Receiving the following error:

โœ˜ Loading data failed Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu! (when checking argument for
argument index in method wrapper_CUDA__index_select)

It seems to be a pytorch error related to moving model to device.

Here's what's intalled after pip install verba
Successfully installed MarkupSafe-2.1.3 accelerate-0.24.1 aiohttp-3.8.6 aiosignal-1.3.1 annotated-types-0.6.0 anyio-4.0.0 async-timeout-4.0.3 asyncio-3.4.3 attrs-23.1.0 authlib-1.2.1 backoff-2.2.1 blis-0.7.11 catalogue-2.0.10 certifi-2023.7.22 cffi-1.16.0 charset-normalizer-3.3.2 click-8.1.7 cohere-4.33 confection-0.1.3 cryptography-41.0.5 cymem-2.0.8 fastapi-0.102.0 fastavro-1.8.2 filelock-3.13.1 frozenlist-1.4.0 fsspec-2023.10.0 goldenverba-0.3.0 h11-0.14.0 httptools-0.6.1 huggingface-hub-0.19.4 idna-3.4 importlib_metadata-6.8.0 jinja2-3.1.2 joblib-1.3.2 langcodes-3.3.0 mpmath-1.3.0 multidict-6.0.4 murmurhash-1.0.10 networkx-3.2.1 nltk-3.8.1 numpy-1.26.2 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.101 nvidia-nvtx-cu12-12.1.105 openai-0.27.9 packaging-23.2 pathy-0.10.3 pillow-10.1.0 preshed-3.0.9 psutil-5.9.6 pycparser-2.21 pydantic-2.5.1 pydantic-core-2.14.3 python-dotenv-1.0.0 pyyaml-6.0.1 regex-2023.10.3 requests-2.31.0 safetensors-0.4.0 scikit-learn-1.3.2 scipy-1.11.3 sentence-transformers-2.2.2 sentencepiece-0.1.99 smart-open-6.4.0 sniffio-1.3.0 spacy-3.6.1 spacy-legacy-3.0.12 spacy-loggers-1.0.5 srsly-2.4.8 starlette-0.27.0 sympy-1.12 thinc-8.1.12 threadpoolctl-3.2.0 tiktoken-0.5.1 tokenizers-0.15.0 torch-2.1.1 torchvision-0.16.1 tqdm-4.66.1 transformers-4.35.2 triton-2.1.0 typer-0.9.0 typing-extensions-4.8.0 urllib3-2.1.0 uvicorn-0.24.0.post1 uvloop-0.19.0 validators-0.21.0 wasabi-1.1.2 watchfiles-0.21.0 weaviate-client-3.23.1 websockets-12.0 yarl-1.9.2 zipp-3.17.0

Query failed building main branch from source

โœ˜ Query failed
Query was not successful! Unexpected status code: 500, with response body: {'code': 500, 'message': 'no consumer registered for application/json'}.

verba import --path ./data also returns:

File "C:\Python312\Lib\site-packages\weaviate\schema\crud_schema.py", line 816, in _create_class_with_primitives
raise UnexpectedStatusCodeException("Create class", response) weaviate.exceptions.UnexpectedStatusCodeException: Create class! Unexpected status code: 500, with response body: {'code': 500, 'message': 'no consumer registered for application/json'}.

Error code is too value - Something went wrong! object of type 'NoneType' has no len()|

verba import --path data/minecraft/ --model "gpt-3.5-turbo" --clear True
โœ” Client connected to local Weaviate server
โ„น All schemas available
โ„น Reading data/minecraft/minecraft_wiki.txt
โ„น Reading data/minecraft/minecraft_guide.txt
โœ” Loaded 2 files
โ„น Converted data/minecraft/minecraft_wiki.txt
โ„น Converted data/minecraft/minecraft_guide.txt
โœ” All 2 files successfully loaded

in the FE:
Q: how to play minecraft?
A: Something went wrong! object of type 'NoneType' has no len()|

Error: Missing Embedder and Options Not Displayed in 'Add Document

I have deployed Verba via Docker, and I can successfully access the frontend side. However, I've encountered an issue where no options are displayed in the "Add document" section. Upon inspecting the Docker logs, I've come across the following error message:
File "/Verba/goldenverba/server/api.py", line 300, in get_components config_manager.get_embedder(), embedders[config_manager.get_embedder()] KeyError: ''

Additionally, I've created my own .env file and have added only the Hugging Face API token.

Could you please help me understand why this error is occurring right after the initialization? Is there any crucial installation step that I might have missed?

Verba Start fails

 โœ˜ ๎‚ฑ ๎‰ฎ ๎‚ฐ ~ ๎‚ฐ verba start                                                                                                                          ๎‚ฒ (verba) ๎˜ผ ๎‚ฒ 10:18 ๎กจ  28.09.23 ๎„ฎ
INFO:     Will watch for changes in these directories: ['/Users/brianjking']
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [28438] using WatchFiles
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 28442
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-28T10:19:02-07:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-28T10:19:02-07:00"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-28T10:19:02-07:00"}
{"action":"grpc_startup","error":"listen tcp :50051: bind: address already in use","level":"fatal","msg":"failed to start grpc server","time":"2023-09-28T10:19:02-07:00"}
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/homebrew/lib/python3.11/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
    target(sockets=sockets)
  File "/opt/homebrew/lib/python3.11/site-packages/uvicorn/server.py", line 61, in run
    return asyncio.run(self.serve(sockets=sockets))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/opt/homebrew/lib/python3.11/site-packages/uvicorn/server.py", line 68, in serve
    config.load()
  File "/opt/homebrew/lib/python3.11/site-packages/uvicorn/config.py", line 473, in load
    self.loaded_app = import_from_string(self.app)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/homebrew/lib/python3.11/site-packages/goldenverba/server/api.py", line 22, in <module>
    verba_engine = AdvancedVerbaQueryEngine()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/goldenverba/retrieval/interface.py", line 15, in __init__
    VerbaQueryEngine.client = setup_client()
                              ^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/goldenverba/ingestion/util.py", line 41, in setup_client
    client = weaviate.Client(
             ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/weaviate/client.py", line 147, in __init__
    url, embedded_db = self.__parse_url_and_embedded_db(url, embedded_options)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/weaviate/client.py", line 288, in __parse_url_and_embedded_db
    embedded_db.start()
  File "/opt/homebrew/lib/python3.11/site-packages/weaviate/embedded.py", line 229, in start
    self.wait_till_listening()
  File "/opt/homebrew/lib/python3.11/site-packages/weaviate/embedded.py", line 174, in wait_till_listening
    raise WeaviateStartUpError(
weaviate.exceptions.WeaviateStartUpError: Embedded DB did not start listening on port 6666 within 30 seconds
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name=0 mode='r' encoding='UTF-8'>

websocket error

Same error in both main and dev branches

Python 3.10.12 on Fedora Linux 38

tried re-installing websocket and websocket-client but no resolution.

Note, this is working on my Ubuntu machine with python 3.11.4, so not sure if that is an issue or not. Both are above the recommend 3.9 level.

I appreciate the fact it is working on one machine and not the other points to it being on my end. However, wanted to post the issue in case someone could help me with it, or it was of broader interest.

โœ” Received query: what is ocean infohub
โ„น Retrieved Context of 1158 tokens
โœ” Succesfully processed query: what is ocean infohub
INFO:     192.168.202.109:42404 - "POST /api/query HTTP/1.1" 200 OK
โœ˜ WebSocket Error: type object 'GeneratePayload' has no attribute
'model_validate_json'
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/fils/src/git/Verba/goldenverba/server/api.py", line 581, in websocket_generate_stream
    payload = GeneratePayload.model_validate_json(data)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'GeneratePayload' has no attribute 'model_validate_json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fils/.local/lib/python3.11/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 254, in run_asgi
    result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fils/.local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fils/.local/lib/python3.11/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 149, in __call__
    await self.app(scope, receive, send)
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 75, in __call__
    await self.app(scope, receive, send)
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/fils/.local/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/fils/.local/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/routing.py", line 341, in handle
    await self.app(scope, receive, send)
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/routing.py", line 82, in app
    await func(session)
  File "/home/fils/.local/lib/python3.11/site-packages/fastapi/routing.py", line 325, in app
    await dependant.call(**values)
  File "/home/fils/src/git/Verba/goldenverba/server/api.py", line 598, in websocket_generate_stream
    await websocket.send_json(
  File "/home/fils/.local/lib/python3.11/site-packages/starlette/websockets.py", line 171, in send_json
    text = json.dumps(data, separators=(",", ":"))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/usr/lib64/python3.11/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type AttributeError is not JSON serializable
ERROR:    closing handshake failed
Traceback (most recent call last):
  File "/home/fils/.local/lib/python3.11/site-packages/websockets/legacy/server.py", line 248, in handler
    await self.close()
  File "/home/fils/.local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 766, in close
    await self.write_close_frame(Close(code, reason))
  File "/home/fils/.local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1232, in write_close_frame
    await self.write_frame(True, OP_CLOSE, data, _state=State.CLOSING)
  File "/home/fils/.local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1205, in write_frame
    await self.drain()
  File "/home/fils/.local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1194, in drain
    await self.ensure_open()
  File "/home/fils/.local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 935, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received
INFO:     connection closed
INFO:     192.168.202.109:51200 - "GET /api/health HTTP/1.1" 200 OK
โ„น Document ID received: 8a66ba1a-5b25-4c74-83f0-d30c7b7affb0

Raise exception when loading a document fail in the verba_manager

Currently when I call manager.import_data( I can get the following issue:

โœ˜ {'errors': {'error': [{'message': 'update vector: unmarshal response
body: json: cannot unmarshal number into Go struct field
openAIApiError.error.code of type string'}]}, 'status': 'FAILED'}

It seems to be related to OpenAI and happens randomly. Any idea?
It would be nice to be able to check if the import went successfully because the method won't fail in that case.

Either lock input while previous is processing or disable enter/submit

Hi to whoever is reading this! ๐Ÿค— BTW great work with verba, I've been playing around with it and I find it really useful

Issue

I've realised that you can actually send more than one input sequentially while the first one sent is still processing, so I think that you could either lock the input bar so that the users cannot write anything while the previous request is being processed in that session, or just disable the enter/submit so that when trying to send another message a toast message appears saying that you cannot submit sequential inputs, similarly to what any other chatbot out there may be.

Thanks in advance!

Add support for LiteLLM

Hi there,
I am getting familiar with the source code but I want to have the ability to change the settings of the embeddings and generation to an OpenAI PROXY server: https://docs.litellm.ai/docs/simple_proxy

We would just need 2 environment settings, the first one is the API KEY you already have it and the second one is the URL to point the OpenAI class.

openai.api_key = "anything"             # this can be anything, we set the key on the proxy
openai.api_base = "http://0.0.0.0:8000" # set api base to the proxy from step 1

Those are exactly the same as the library as you would use it, of course normally the api_base poins to Azure:

import openai

# optional; defaults to `os.environ['OPENAI_API_KEY']`
openai.api_key = '...'

# all client options can be configured just like the `OpenAI` instantiation counterpart
openai.base_url = "https://..."
openai.default_headers = {"x-foo": "true"}

Let me know.
Cheers!

REST API to push external sources

Hi there,
would be nice to have a simple REST API where I can push text from external sources.
For instance I have a large source of documents coming from Youtube captions, Whisper conversations (from audio) and you get the idea.
Would like to see also a basic capability to tag the source so I know where it was coming from: a URL and a list of tags at least.
Cheers!

Loading data failed Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Hi,
Ive followed quick installation from PIP. I use [HuggingFace] version.
I have laptop with 2 GPU (AMD and Nvidia).
Gui starts flawlessly but when I wand to upload any simpliest TXT document, I am getting following error:


Vectorizing document chunks:   0%|                                                                                                   | 0/1 [00:00<?, ?it/s]
โœ˜ Loading data failed Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu! (when checking argument for
argument index in method wrapper_CUDA__index_select)

Where I should force GPU usage?
Preety please help.

context length exceeded

Hi,
I tried to use my data (text file with 5k lines), but every time I ask for something I get an error:
Something went wrong! This model's maximum context length is 4097 tokens. However, your messages resulted in 5747 tokens. Please reduce the length of the messages.|
image

Docker-compose: container verba starts but API does not work

I'm trying to run verba using docker-compose, when I run it (I fixed Dockerfile to use Python 3.10 BTW) I get

verba-verba-1     | /usr/local/lib/python3.10/site-packages/PyPDF2/__init__.py:21: DeprecationWarning: PyPDF2 is deprecated. Please move to the pypdf library instead.
verba-verba-1     |   warnings.warn(
verba-verba-1     | INFO:     Will watch for changes in these directories: ['/Verba']
verba-verba-1     | INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
verba-verba-1     | INFO:     Started reloader process [1] using WatchFiles
verba-verba-1     | /usr/local/lib/python3.10/site-packages/PyPDF2/__init__.py:21: DeprecationWarning: PyPDF2 is deprecated. Please move to the pypdf library instead.
verba-verba-1     |   warnings.warn(
verba-verba-1     | /usr/local/lib/python3.10/site-packages/fastapi/openapi/models.py:55: DeprecationWarning: `general_plain_validator_function` is deprecated, use `with_info_plain_validator_function` instead.
verba-verba-1     |   return general_plain_validator_function(cls._validate)
verba-verba-1     | /usr/local/lib/python3.10/site-packages/pydantic_core/core_schema.py:3902: DeprecationWarning: `general_plain_validator_function` is deprecated, use `with_info_plain_validator_function` instead.
verba-verba-1     |   warnings.warn(
verba-verba-1     | INFO:     Started server process [39]
verba-verba-1     | INFO:     Waiting for application startup.
verba-verba-1     | INFO:     Application startup complete.
verba-verba-1     | โš  No module named 'torch'
verba-verba-1     | โ„น Setting up client
verba-verba-1     | โ„น No Auth information provided
verba-verba-1     | โœ” Connected to Weaviate
verba-verba-1     | โ„น New Config initialized
verba-verba-1     | โœ” Saved Config
verba-verba-1     | โœ” Set Reader to SimpleReader
verba-verba-1     | โœ” Set Chunker to TokenChunker
verba-verba-1     | โœ” Set Retriever to WindowRetriever
verba-verba-1     | โœ” Saved Config
verba-verba-1     | INFO:     127.0.0.1:45968 - "HEAD / HTTP/1.1" 200 OK
verba-verba-1     | INFO:     127.0.0.1:45976 - "HEAD / HTTP/1.1" 200 OK

In spite of this I can't access verba on localhost:8000. I've waited for couple of minutes. The weaviate seems to work and it returns

{"links":[{"href":"/v1/meta","name":"Meta information about this instance/cluster"},{"documentationHref":"https://weaviate.io/developers/weaviate/api/rest/schema","href":"/v1/schema","name":"view complete schema"},{"documentationHref":"https://weaviate.io/developers/weaviate/api/rest/schema","href":"/v1/schema{/:className}","name":"CRUD schema"},{"documentationHref":"https://weaviate.io/developers/weaviate/api/rest/objects","href":"/v1/objects{/:id}","name":"CRUD objects"},{"documentationHref":"https://weaviate.io/developers/weaviate/api/rest/classification,https://weaviate.io/developers/weaviate/api/rest/classification#knn-classification","href":"/v1/classifications{/:id}","name":"trigger and view status of classifications"},{"documentationHref":"https://weaviate.io/developers/weaviate/api/rest/well-known#liveness","href":"/v1/.well-known/live","name":"check if Weaviate is live (returns 200 on GET when live)"},{"documentationHref":"https://weaviate.io/developers/weaviate/api/rest/well-known#readiness","href":"/v1/.well-known/ready","name":"check if Weaviate is ready (returns 200 on GET when ready)"},{"documentationHref":"https://weaviate.io/developers/weaviate/api/rest/well-known#openid-configuration","href":"/v1/.well-known/openid-configuration","name":"view link to openid configuration (returns 404 on GET if no openid is configured)"}]}

Python 3.10

Hello,
In the very last version, on the main branch, you have added some python 3.10 code but the Dockerfile is still in 3.9.

Add CI/CD

Having CI/CD would allow for automated linting and testing in the pipeline which would make it easier to contribute to the project, ensuring proper Quality, reducing number of bugs and so on.
GitHub Actions would do the job just fine.

Import Error

โ„น Setting up client
โœ” Client connected to Weaviate Cluster
โ„น All schemas available
โ„น Reading data\minecraft\minecraft_guide.txt
โ„น Reading data\minecraft\minecraft_wiki.txt
โœ” Loaded 2 files
โ„น Converted data\minecraft\minecraft_guide.txt
Traceback (most recent call last):
File "\?\C:\Users\xxxx\AppData\Local\Programs\Python\Python310\Scripts\verba-script.py", line 33, in
sys.exit(load_entry_point('goldenverba', 'console_scripts', 'verba')())
File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "c:\users\xxxx\desktop\train\verba\goldenverba\ingestion\cli.py", line 68, in import_data_command
import_data(path, model)
File "c:\users\xxxx\desktop\train\verba\goldenverba\ingestion\import_data.py", line 65, in import_data
documents = convert_files(client, file_contents, nlp=nlp)
File "c:\users\xxxx\desktop\train\verba\goldenverba\ingestion\preprocess.py", line 178, in convert_files
if not check_if_file_exits(client, file_name):
File "c:\users\xxxx\desktop\train\verba\goldenverba\ingestion\preprocess.py", line 207, in check_if_file_exits
if results["data"]["Get"]["Document"]:
KeyError: 'data'

C:\Users\xxxx\Desktop\train\Verba>

requests.exceptions.ConnectionError - when importing data

When executing the import data command I get an connection error, but when I run the same command again, it passes. However, it appears that the application does read in the information from the data.txt.

I don't set VERBA_URL, VERBA_API_KEY

Command that I execute:
verba import --path "/home/snow/Documents/Projects/weaviate-verba/resources/" --model "gpt-3.5-turbo" --clear True

Traceback:

verba import --path "/home/snow/Documents/Projects/weaviate-verba/resources/" --model "gpt-3.5-turbo" --clear True

===================== Creating Document and Chunk class =====================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 200093
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:11:07+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:11:07+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_d8Z8DfRI0d6H","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:07+02:00","took":140330}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_ZFmXy4onugDM","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:07+02:00","took":87042}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_0dHcUXJompMJ","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:07+02:00","took":90100}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_aJSePokVcdxW","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:07+02:00","took":104632}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:11:07+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:11:07+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:07+02:00"}
โœ” Client connected to local Weaviate server
Document class already exists, do you want to overwrite it? (y/n): y
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"document_at2uxArUgLG1","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:11+02:00","took":182906}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"chunk_L4xfAO6z3nSu","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:11+02:00","took":170259}
โœ” 'Document' and 'Chunk' schemas created
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:11:11+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:11+02:00"}
โ„น Done

============================ Creating Cache class ============================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 200235
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:11:11+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:11:11+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_d8Z8DfRI0d6H","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:11+02:00","took":59862}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_L4xfAO6z3nSu","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:11+02:00","took":38679}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_at2uxArUgLG1","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:11+02:00","took":38062}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_aJSePokVcdxW","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:11+02:00","took":35334}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:11:11+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:11:11+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:11+02:00"}
โœ” Client connected to local Weaviate server
Cache class already exists, do you want to overwrite it? (y/n): y
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"cache_n3X4UDflnGSR","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:13+02:00","took":108556}
โœ” 'Cache' schema created
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:11:13+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:13+02:00"}
โ„น Done

========================= Creating Suggestion class =========================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 200307
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:11:13+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:11:13+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_n3X4UDflnGSR","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:13+02:00","took":56962}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_L4xfAO6z3nSu","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:13+02:00","took":55732}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_at2uxArUgLG1","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:13+02:00","took":84016}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_aJSePokVcdxW","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:13+02:00","took":60599}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:11:13+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:11:13+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:13+02:00"}
โœ” Client connected to local Weaviate server
Suggestion class already exists, do you want to overwrite it? (y/n): y
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"suggestion_JvWuPHTJUNPN","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:14+02:00","took":199752}
โœ” 'Suggestion' schema created
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:11:14+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:14+02:00"}
โ„น Done

============================ Starting data import ============================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 200378
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:11:14+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:11:14+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_n3X4UDflnGSR","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:14+02:00","took":39691}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_L4xfAO6z3nSu","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:14+02:00","took":54810}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_at2uxArUgLG1","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:14+02:00","took":34729}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_JvWuPHTJUNPN","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:11:14+02:00","took":48080}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:11:14+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:11:14+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:14+02:00"}
โœ” Client connected to local Weaviate server
โ„น All schemas available
โ„น Reading
/home/snow/Documents/Projects/weaviate-verba/resources/data.txt
โœ” Loaded 1 files
โ„น Converted
/home/snow/Documents/Projects/weaviate-verba/resources/data.txt
โœ” All 1 files successfully loaded
โ„น Starting splitting process
โœ” Successful splitting (total 0)
โ„น (1/1) Importing document
/home/snow/Documents/Projects/weaviate-verba/resources/data.txt
โœ” Imported all docs
โœ” Imported all chunks
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:11:14+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:11:14+02:00"}
Exception in thread batchSizeRefresh:
Traceback (most recent call last):
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/connectionpool.py", line 496, in _make_request
    conn.request(
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/connection.py", line 395, in request
    self.endheaders()
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/connection.py", line 243, in connect
    self.sock = self._new_conn()
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/connection.py", line 218, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f5eadaec370>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=6666): Max retries exceeded with url: /v1/nodes (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5eadaec370>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/weaviate/cluster/cluster.py", line 62, in get_nodes_status
    response = self._connection.get(path=path)
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/weaviate/connect/connection.py", line 539, in get
    return self._session.get(
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6666): Max retries exceeded with url: /v1/nodes (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5eadaec370>: Failed to establish a new connection: [Errno 111] Connection refused'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/weaviate/batch/crud_batch.py", line 442, in periodic_check
    status = cluster.get_nodes_status()
  File "/home/snow/venv/weaviate-verba/lib/python3.10/site-packages/weaviate/cluster/cluster.py", line 64, in get_nodes_status
    raise RequestsConnectionError(
requests.exceptions.ConnectionError: Get nodes status failed due to connection error

after simply repeating the command:

===================== Creating Document and Chunk class =====================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 205223
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:14:46+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:14:46+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_n3X4UDflnGSR","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:46+02:00","took":192734}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_L4xfAO6z3nSu","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:46+02:00","took":180536}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_at2uxArUgLG1","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:46+02:00","took":261801}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_JvWuPHTJUNPN","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:46+02:00","took":173080}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:14:46+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:14:46+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:46+02:00"}
โœ” Client connected to local Weaviate server
Document class already exists, do you want to overwrite it? (y/n): y
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"document_4KF2RtSb7AzF","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:49+02:00","took":50649}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"chunk_afMWBLITi7Hg","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:49+02:00","took":36439}
โœ” 'Document' and 'Chunk' schemas created
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:14:49+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:49+02:00"}
โ„น Done

============================ Creating Cache class ============================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 205337
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:14:49+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:14:49+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_n3X4UDflnGSR","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:49+02:00","took":53770}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_afMWBLITi7Hg","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:49+02:00","took":49957}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_4KF2RtSb7AzF","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:49+02:00","took":48970}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_JvWuPHTJUNPN","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:49+02:00","took":51621}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:14:49+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:14:49+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:49+02:00"}
โœ” Client connected to local Weaviate server
Cache class already exists, do you want to overwrite it? (y/n): y
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"cache_dE7829wgUrT0","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:50+02:00","took":80944}
โœ” 'Cache' schema created
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:14:50+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:50+02:00"}
โ„น Done

========================= Creating Suggestion class =========================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 205405
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:14:50+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:14:50+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_dE7829wgUrT0","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:50+02:00","took":41406}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_afMWBLITi7Hg","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:50+02:00","took":49932}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_4KF2RtSb7AzF","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:50+02:00","took":47882}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_JvWuPHTJUNPN","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:50+02:00","took":49368}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:14:50+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:14:50+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:50+02:00"}
โœ” Client connected to local Weaviate server
Suggestion class already exists, do you want to overwrite it? (y/n): y
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"suggestion_nRDqcQr9MEAo","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:51+02:00","took":186550}
โœ” 'Suggestion' schema created
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:14:51+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:51+02:00"}
โ„น Done

============================ Starting data import ============================
โ„น Setting up client
โ„น VERBA_URL environment variable not set. Using Weaviate Embedded
Started ./.verba/cache/weaviate-embedded: process ID 205474
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-09-08T00:14:51+02:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-09-08T00:14:51+02:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"cache_dE7829wgUrT0","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:51+02:00","took":196427}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"chunk_afMWBLITi7Hg","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:51+02:00","took":272944}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"document_4KF2RtSb7AzF","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:51+02:00","took":226739}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"suggestion_nRDqcQr9MEAo","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-09-08T00:14:51+02:00","took":157232}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-09-08T00:14:51+02:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-09-08T00:14:51+02:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:51+02:00"}
โœ” Client connected to local Weaviate server
โ„น All schemas available
โ„น Reading
/home/snow/Documents/Projects/weaviate-verba/resources/data.txt
โœ” Loaded 1 files
โ„น Converted
/home/snow/Documents/Projects/weaviate-verba/resources/data.txt
โœ” All 1 files successfully loaded
โ„น Starting splitting process
โœ” Successful splitting (total 0)
โ„น (1/1) Importing document
/home/snow/Documents/Projects/weaviate-verba/resources/data.txt
โœ” Imported all docs
โœ” Imported all chunks
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-09-08T00:14:52+02:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-09-08T00:14:52+02:00"}

Load and display PDF files

This is a feature request to add native support for PDF files.

My understanding is that Verba currently only handles text files such as .txt and .md. We can convert PDF files to text files to use them with Verba, but this will remove images and affect formatting. It would be great if we can load .pdf files directly into Verba and for the UI to display the PDF file with highlighted sections.

Since many other file formats can be converted to PDF without losing their formatting, this would greatly expand the use cases of Verba.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.