Giter VIP home page Giter VIP logo

Comments (11)

coreation avatar coreation commented on August 26, 2024 1

@igiloh-pinecone ... I found the issue after letting the code ponder in the back of my head :) The default encoder has changed to the latest openai embedding (small) model, while my embeddings were still on ada embeddings. I see that in a previous canopy install, in all likelihood, the one running canopy server, the default encoder is still pointing to ada-002... So that explains the trash results my knowledge base gave me, while at the same time the built-in canopy server is returning decent RAG based results.

I'll see if I find the time to make a documentation PR so that the encoder is explicitly passed in the advanced example in the library.md.

from canopy.

igiloh-pinecone avatar igiloh-pinecone commented on August 26, 2024 1

The default encoder has changed to the latest openai embedding (small) model, while my embeddings were still on ada embeddings.

Thanks for the detailed response @coreation !!
That's definitely an oversight by us. We shouldn't have changed the default like that without at least highlighting it as a breaking change. I will change the issue's name to make it more discoverable by other users encountering the same problem.

from canopy.

igiloh-pinecone avatar igiloh-pinecone commented on August 26, 2024 1

Gist for other people encountering this problem:
Before version 0.7.0, Canopy's default RecordEncoder was OpenAI(model_name='text-embedding-ada-002'). In version 0.7.0, the default was changed to use OpenAI's new embedding model (text-embedding-3-small).

If you have inserted your documents in the past using an older Canopy version, then upgraded Canopy and tried using the query() or chat() functions - your newly loaded instance would be using a different embedding model than the one used for inserting documents.

To fix this problem:

  1. run canopy create-config <path> command to generate Canopy's default config templates in your desired <path>
  2. Edit the default.yaml file, changing the embedding model_name to text-embedding-ada-002.
  3. Run canopy with the new config: canopy start --config <path>/default.yaml

from canopy.

igiloh-pinecone avatar igiloh-pinecone commented on August 26, 2024

@coreation from the top of my mind, two immediate thoughts:

  1. Chunk size of 4000 is actually quite large. Chunk size is usually set in the area of 256-512 tokens. It would be very hard for an embedding model to represent such a long text into a single semantic representation. This might explain the poor retrieval results.
  2. You mentioned changing some default parameters (e.g. chunk_size). How did you change these parameters? By setting them in a config file? If so - when you tested manually with Python code, have you used the same config file and\or the same parameter values? Otherwise it had probably used the built-in defaults.

from canopy.

coreation avatar coreation commented on August 26, 2024

@igiloh-pinecone thanks for the quick thoughts,

  1. I found it a bit large as well, but I found the default in this file. The chunk_size there is set to 4000... Am I reading this wrong?

  2. I'm not using the embedding via canopy directly as I need to embed some things coming from a database, so I used the splitter, and thus the "4000" chunk size and then did the embedding and storing myself in Pinecone. I'll try to use 1024 as I've read a couple of articles, amongst which this one that point out there's no real "good size", but 1024 seems to be a good sweet spot. I'll re-embed everything and see if things get better. If we could get some feedback on the 4000 chunk_size in the langchain_text_splitter, then I'm happy to make a small PR changing it to 1024 or 512.

from canopy.

coreation avatar coreation commented on August 26, 2024

@igiloh-pinecone , it looks like the "chunk_size" in the splitter is actually referring to the amount of characters... I'm using the OpenAI tokenizer online to see how many tokens my pieces of text have and they're all around 700 tokens instead of 4000. So I've misinterpreted "chunk size" for "token size".

So the code to split up my text is as the default is described in the LangChain text splitter:

RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)

from canopy.

coreation avatar coreation commented on August 26, 2024

@igiloh-pinecone I've received a Github comment update but I don't see your update here on the thread. But just to answer the question you wrote, I'm using indeed the LangChain Recursive splitter, which I believe does take the parameter in terms of "characters", not tokens. If I take random samples of my vectors, the text is around 700 tokens.

Is your suggestion to lower that amount and use the Canopy Chunker?

from canopy.

igiloh-pinecone avatar igiloh-pinecone commented on August 26, 2024

I noticed your previous message where you stated that you use langchain directly, so I deleted mine as it was irrelevant.

My main point wasn't actually about the chunk size itself (I guess ~700 tokens is workable) - but rather a question of how do you configure your canopy server versus how did you configure the direct python ChatEngine. Are you sure you've used the same config \ params?

On more suggestion - can you please try repeating the same question 2-3 times in each scenario (server API vs direct python class)? Could it be that the underlying LLM is simply a bit "noisy", answering the same question differently every time?

from canopy.

coreation avatar coreation commented on August 26, 2024

hey @igiloh-pinecone what you see in the code example is the only configuration I use, it's the same as the variables I export before I star the canopy server. So that's simply the pinecone API key, OpenAI API key and index/namespace.

I've tried repeating the questions, but it seems like it's the same thing. See images on the bottom of the comment, one contains sources, the other does not.

Do you guys offer any paid support by chance? I'm sort of knowledgeable on a high-level about a couple of RAG frameworks, this is the first that does away with a lot of fluff because it's more tailored towards a use case. I can try again using LangChain, but the state of LangChain for so
meone who isn't day-in day-out on top it, is just too much to keep up with.

Responses using canopy server
using-canopy-server

Responses using the code example from library.md

with-custom-code

from canopy.

coreation avatar coreation commented on August 26, 2024

@igiloh-pinecone perhaps not unimportant, the "text" property in Pinecone is often displayed as type "[]" instead of text, is this because of the new lines from the chunking?

from canopy.

coreation avatar coreation commented on August 26, 2024

Thanks for the swift response @igiloh-pinecone!

from canopy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.