Giter VIP home page Giter VIP logo

Comments (5)

dosubot avatar dosubot commented on June 8, 2024 2

The issue you're encountering stems from using AutoTokenizer.from_pretrained, which returns a tokenizer, not an embeddings model. Tokenizers prepare text for processing by models but don't generate embeddings themselves. To generate embeddings, you need to use a model designed for this purpose, such as those available from Hugging Face's transformers library.

For your specific need to use open-source models for embeddings, consider using models from the Hugging Face transformers library that are designed for generating embeddings. Here's a simplified approach:

  1. Choose an Embeddings Model: Select a model from Hugging Face that is suitable for generating embeddings, such as sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.

  2. Load the Model and Tokenizer: Use the AutoModel and AutoTokenizer classes to load your model and tokenizer.

from transformers import AutoModel, AutoTokenizer

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
  1. Generate Embeddings: After tokenizing your text, you can pass the tokenized input to your model to obtain embeddings.
inputs = tokenizer("Your text here", return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)  # Example of obtaining embeddings

This approach allows you to directly use models from Hugging Face for generating embeddings, bypassing the need for a method like get_text_embedding_batch, which is not a standard method for Hugging Face models or tokenizers.

Remember, the key is to use a model that is capable of generating embeddings, as opposed to a tokenizer, which is only responsible for preparing text inputs for a model.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

from llama_index.

alineberry avatar alineberry commented on June 8, 2024 2

@logan-markewich I think you tagged me on accident

from llama_index.

logan-markewich avatar logan-markewich commented on June 8, 2024 1

Oops I did, whoops!

@NeevrajKB

from llama_index.

logan-markewich avatar logan-markewich commented on June 8, 2024

@alineberry you set the embedding model to be a tokenizer. Probably you should do this

pip install llama-index-embeddings-huggingface

from llama_index.embeddings.huggingface import HuggingFaceEmbedding 

embed_model = HuggingFaceEmbedding(model_name="<...>", embed_batch_size=10, device="<cpu or gpu>")

from llama_index.

NeevrajKB avatar NeevrajKB commented on June 8, 2024

😁 Man you really made my day 😂😂😂

from llama_index.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.