Giter VIP home page Giter VIP logo

relevanceai / vectorhub Goto Github PK

View Code? Open in Web Editor NEW
546.0 18.0 57.0 11.89 MB

Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)

Home Page: https://tryrelevance.com

License: Apache License 2.0

Makefile 0.28% Python 99.72%
python vector embeddings encodings vector-similarity transformers tfhub machine-learning deeplearning artificial-intelligence

vectorhub's Introduction

(Vectorhub is depreciated, and no longer maintained. We recommend using Sentence Transformer, TFHub and Huggingface directly. If you are looking to vectorize 100million+ data in a parallelized fashion, check out https://tryrelevance.com )



Release Website Hub Discord

Vector Hub is a library for publication, discovery, and consumption of State-of-the-art models to turn data into vectors. (Text2Vec, Image2Vec, Video2Vec, Face2Vec, Bert2Vec, Inception2Vec, Code2Vec, LegalBert2Vec, etc)



There are many ways to extract vectors from data. This library aims to bring in all the state of the art models in a simple manner to vectorise your data easily.

Vector Hub provides:

  • A low barrier of entry for practitioners (using common methods)
  • Vectorise rich and complex data types like: text, image, audio, etc in 3 lines of code
  • Retrieve and find information about a model
  • An easy way to handle dependencies easily for different models
  • Universal format of installation and encoding (using a simple encode method).

In order to provide an easy way for practitioners to quickly experiment, research and build new models and feature vectors, we provide a streamlined way to obtain vectors through our universal encode API.

Every model has the following:

  • encode allows you to turn raw data into a vector
  • bulk_encode allows you to turn multiple objects into multiple vectors
  • encode_documents returns a list of dictionaries with with an encoded field

For bi-modal models: Question Answering encoders will have:

  • encode_question
  • encode_answer
  • bulk_encode_question
  • bulk_encode_answer

Text Image Bi-encoders will have:

  • encode_image
  • encode_text
  • bulk_encode_image
  • bulk_encode_text

There are thousands of _____2Vec models across different use cases/domains. Vectorhub allows people to aggregate their work and share it with the community.


Powered By Relevance AI - Vector Experimentation Platform

Relevance AI is the vector platform for rapid experimentation. Launch great vector-based applications with flexible developer tools for storing, experimenting and deploying vectors.

Check out our Github repository here!

Github Banner


Quickstart:

Intro to Vectors | Model Hub | Google Colab Quickstart | Python Documentation


Installation:

To get started quickly install vectorhub:

pip install vectorhub

Alternatively if you require more up-to-date models/features and are okay if it is not fully stable, you can install the nightly version of VectorHub using:

pip install vectorhub-nightly

After this, our built-in dependency manager will tell you what to install when you instantiate a model. The main types of installation options can be found here: https://hub.getvectorai.com/

To install different types of models:

# To install transformer requirements
pip install vectorhub[text-encoder-transformers]

To install all models at once (note: this can take a while! We recommend searching for an interesting model on the website such as USE2Vec or BitMedium2Vec and following the installation line or see examples below.)

pip install vectorhub[all]

We recommend activating a new virtual environment and then installing using the following:

python3 -m pip install virtualenv 
python3 -m virtualenv env 
source env/bin/activate
python3 -m pip install --upgrade pip 
python3 -m pip install vectorhub[all]

Updates

Version 1.4

Previous issues with batch-processing:

If bulk fed in, would cause bulk error when really only 1 in 15 inputs were causing errors. Lack of reliability in bulk_encode meant most of the time bulk_encode was just a list comprehension. This meant we lost any speed enhancements we could be getting as we had to feed it through matrices every time.

The new design now lets us get the most out of multiple tensor inputs.


Google's Big Image Transfer model

from vectorhub.encoders.image.tfhub import BitSmall2Vec
image_encoder = BitSmall2Vec()
image_encoder.encode('https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png')
[0.47, 0.83, 0.148, ...]

Google's BERT model

from vectorhub.encoders.text.tfhub import Bert2Vec
text_encoder = Bert2Vec()
text_encoder.encode('This is sparta!')
[0.47, 0.83, 0.148, ...]

Google's USE QA model

from vectorhub.bi_encoders.text.tfhub import UseQA2Vec
text_encoder = UseQA2Vec()
text_encoder.encode_question('Who is sparta!')
[0.47, 0.83, 0.148, ...]
text_encoder.encode_answer('Sparta!')
[0.47, 0.83, 0.148, ...]

HuggingFace Transformer's Albert

from vectorhub.encoders.text import Transformer2Vec
text_encoder = Transformer2Vec('albert-base-v2')
text_encoder.encode('This is sparta!')
[0.47, 0.83, 0.148, ...]

Facebook Dense Passage Retrieval

from vectorhub.bi_encoders.qa.torch_transformers import DPR2Vec
text_encoder = DPR2Vec()
text_encoder.encode_question('Who is sparta!')
[0.47, 0.83, 0.148, ...]
text_encoder.encode_answer('Sparta!')
[0.47, 0.83, 0.148, ...]

Index and search your vectors easily on the cloud using 1 line of code!

#pip install vectorhub[encoders-text-tfhub]
from vectorhub.encoders.text.tfhub import USE2VEc
encoder = USE2Vec()

# You can request an api_key simply by using your username and email.
username = '<your username>'
email = '<your email>'
api_key = encoder.request_api_key(username, email)

# Index in 1 line of code
items = ['dogs', 'toilet', 'paper', 'enjoy walking']
encoder.add_documents(user, api_key, items)

# Search in 1 line of code and get the most similar results.
encoder.search('basin')

Add metadata to your search (information about your vectors)

# Add the number of letters of each word
metadata = [7, 6, 5, 12]
encoder.add_documents(user, api_key, items=items, metadata=metadata)

Using a document-orientated-approach instead:

from vectorhub.encoders.text import Transformer2Vec
encoder = Transformer2Vec('bert-base-uncased')

from vectorai import ViClient
vi_client = ViClient(username, api_key)
docs = vi_client.create_sample_documents(10)
vi_client.insert_documents('collection_name_here', docs, models={'color': encoder.encode})

# Now we can search through our collection 
vi_client.search('collection_name_here', field='color_vector_', vector=encoder.encode('purple'))

Easily access information with your model!

# If you want to additional information about the model, you can access the information below:
text_encoder.definition.repo
text_encoder.definition.description
# If you want all the information in a dictionary, you can call:
text_encoder.definition.create_dict() # returns a dictionary with model id, description, paper, etc.

Turn Off Error-Catching

By default, if encoding errors, it returns a vector filled with 1e-7 so that if you are encoding and then inserting then it errors out. However, if you want to turn off automatic error-catching in VectorHub, simply run:

import vectorhub
vectorhub.options.set_option('catch_vector_errors', False)

If you want to turn it back on again, run:

vectorhub.options.set_option('catch_vector_errors', True)

Instantiate our auto_encoder class as such and use any of the models!

from vectorhub.auto_encoder import AutoEncoder
encoder = AutoEncoder.from_model('text/bert')
encoder.encode("Hello vectorhub!")
[0.47, 0.83, 0.148, ...]

You can choose from our list of models:

['text/albert', 'text/bert', 'text/labse', 'text/use', 'text/use-multi', 'text/use-lite', 'text/legal-bert', 'audio/fairseq', 'audio/speech-embedding', 'audio/trill', 'audio/trill-distilled', 'audio/vggish', 'audio/yamnet', 'audio/wav2vec', 'image/bit', 'image/bit-medium', 'image/inception', 'image/inception-v2', 'image/inception-v3', 'image/inception-resnet', 'image/mobilenet', 'image/mobilenet-v2', 'image/resnet', 'image/resnet-v2', 'qa/use-multi-qa', 'qa/use-qa', 'qa/dpr', 'qa/lareqa-qa']

What are Vectors?

Common Terminologys when operating with Vectors:

  • Vectors (aka. Embeddings, Encodings, Neural Representation) ~ It is a list of numbers to represent a piece of data. E.g. the vector for the word "king" using a Word2Vec model is [0.47, 0.83, 0.148, ...]
  • ____2Vec (aka. Models, Encoders, Embedders) ~ Turns data into vectors e.g. Word2Vec turns words into vector


How can I use vectors?

Vectors have a broad range of applications. The most common use case is to perform semantic vector search and analysing the topics/clusters using vector analytics.

If you are interested in these applications, take a look at Vector AI.

How can I obtain vectors?

  • Taking the outputs of layers from deep learning models
  • Data cleaning, such as one hot encoding labels
  • Converting graph representations to vectors

How To Upload Your 2Vec Model

Read here if you would like to contribute your model!

Philosophy

The goal of VectorHub is to provide a flexible yet comprehensive framework that allows people to easily be able to turn their data into vectors in whatever form the data can be in. While our focus is largely on simplicity, customisation should always be an option and the level of abstraction is always up model-uploader as long as the reason is justified. For example - with text, we chose to keep the encoding at the text level as opposed to the token level because selection of text should not be applied at the token level so practitioners are aware of what texts go into the actual vectors (i.e. instead of ignoring a '[next][SEP][wo][##rd]', we are choosing to ignore 'next word' explicitly. We think this will allow practitioners to focus better on what should matter when it comes to encoding.

Similarly, when we are turning data into vectors, we convert to native Python objects. The decision for this is to attempt to remove as many dependencies as possible once the vectors are created - specifically those of deep learning frameworks such as Tensorflow/PyTorch. This is to allow other frameworks to be built on top of it.

Team

This library is maintained by the Relevance AI - your go-to solution for data science tooling with tvectors. If you are interested in using our API for vector search, visit https://relevance.ai or if you are interested in using API, check out https://relevance.ai its free for public research and open source.

Credit:

This library wouldn't exist if it weren't for the following libraries and the incredible machine learning community that releases their state-of-the-art models:

  1. https://github.com/huggingface/transformers
  2. https://github.com/tensorflow/hub
  3. https://github.com/pytorch/pytorch
  4. Word2Vec image - Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
  5. https://github.com/UKPLab/sentence-transformers

vectorhub's People

Contributors

actions-user avatar biogeek avatar boba-and-beer avatar danvass avatar jackykoh avatar withshubh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vectorhub's Issues

Tensorflow 2.4 Support

I'm running into an issue with tensorflow 2.4

import tensorflow as tf
tf.__version__
>>> 2.4.0

from vectorhub.encoders.text.tfhub import Bert2Vec
model = Bert2Vec()
>>> 
    ValueError: Could not find matching function to call loaded from the SavedModel. Got:
      Positional arguments (3 total):
        * {'input_word_ids': <tf.Tensor 'inputs_2:0' shape=(None, 512) dtype=int32>, 'input_mask': <tf.Tensor 'inputs:0' shape=(None, 512) dtype=int32>, 'input_type_ids': <tf.Tensor 'inputs_1:0' shape=(None, 512) dtype=int32>}
        * False
        * None
      Keyword arguments: {}

    Expected these arguments to match one of the following 4 option(s):

    Option 1:
      Positional arguments (3 total):
        * [TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/0'), TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/1'), TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/2')]
        * False
        * None
      Keyword arguments: {}

    Option 2:
      Positional arguments (3 total):
        * [TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/0'), TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/1'), TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/2')]
        * True
        * None
      Keyword arguments: {}

    Option 3:
      Positional arguments (3 total):
        * [TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids'), TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids')]
        * True
        * None
      Keyword arguments: {}

    Option 4:
      Positional arguments (3 total):
        * [TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids'), TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids')]
        * False
        * None
      Keyword arguments: {}

Any suggestion on how to fix that?

SentenceTransformer2Vec example from docs raising a ModelError

Hello,
I've tried following the docs for SentenceTransformer2Vec.
I followed the steps listed on the page, installed the package along with the model, then when trying to import and instantiate it as the docs suggest, I'm seeing the following error:

from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec 
model = SentenceTransformer2Vec('bert-base-uncased') 
model.encode("I enjoy taking long walks along the beach with my dog.") `

---------------------------------------------------------------------------
ModelError                                Traceback (most recent call last)
<ipython-input-360-23fbb894168d> in <module>
      1 from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
----> 2 model = SentenceTransformer2Vec('bert-base-uncased')
      3 model.encode("I enjoy taking long walks along the beach with my dog.")

~/opt/anaconda3/envs/kaggle/lib/python3.7/site-packages/vectorhub/encoders/text/sentence_transformers/sentence_auto_transformers.py in __init__(self, model_name)
     45     def __init__(self, model_name: str):
     46         self.list_of_urls = LIST_OF_URLS
---> 47         self.validate_model_url(model_name, LIST_OF_URLS)
     48         self.vector_length = LIST_OF_URLS[model_name]["vector_length"]
     49         self.model = SentenceTransformer(model_name)

~/opt/anaconda3/envs/kaggle/lib/python3.7/site-packages/vectorhub/base.py in validate_model_url(cls, model_url, list_of_urls)
     71             return True
     72         raise ModelError(
---> 73             message="We currently not support this url. If issue persist then contact us.")
     74 
     75     @classmethod

ModelError: 

It seems like bert-base-uncased is not present in the LIST_OF_URLS.

NameError: name 'sf' is not defined

is the soundfile correctly imported just curious I was trying the wav2vec, got this ?

Tried importing/uninstall and install but not fruitful!

Any help will be appreciated

WARNING: tensorflow: 11 out of the last 11 calls

Hello community,

I get this warning and going to ask you how to deal with it and what is the reason for it

WARNING:tensorflow:11 out of the last 11 calls to <function recreate_function.<locals>.restored_function_body at 0x7fea085935e0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
W

thank a lot!

embeddings of 1e-07 values

Hello community,

I have only the 1e-07 values as result by all img2vec-models (I took the same picture from the given example)

from vectorhub.encoders.image.tfhub import BitSmall2Vec
image_encoder = BitSmall2Vec()
# 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png is locally saved as pic
sample = image_encoder.read('./pic.png')
emb = image_encoder.encode(sample)
[1e-07, ... 1e-07]

what do I wrong?

'hub' is not definded

hello community,

I got this error

>>> from vectorhub.encoders.image.tfhub import BitSmall2Vec
>>> model=BitSmall2Vec()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ivaadmin/anaconda3/envs/x2vec/lib/python3.9/site-packages/vectorhub/encoders/image/tfhub/bit.py", line 27, in __init__
    self.init(model_url)
  File "/home/ivaadmin/anaconda3/envs/x2vec/lib/python3.9/site-packages/vectorhub/encoders/image/tfhub/bit.py", line 34, in init
    self.model = hub.load(self.model_url)
NameError: name 'hub' is not defined

what is missing?

Use cleora

Thanks for your very useful and handy project!
https://github.com/Synerise/cleora is a very fast and responsible project written in Rust? Can you leverage the power of this project in your library (just add it to your approaches for graph embeddings)?
I used Cleora on a large amount of data and it is very fast and memory efficient.

Other __2vecs

Congratulations and thank you for this amazing initiative!
How about adding cat2vec, for categorical, tabular data, and node2vec, for network graphs?
Best wishes,
Milcent

Typo in https://hub.getvectorai.com/

The code example for Dense Passage Retrieval doesn't work as there an s is missing:
from vectorhub.bi_encoder.text_text.torch_transformers import DPR2Vec should be changed as from vectorhub.bi_encoders.text_text.torch_transformers import DPR2Vec

CLIP2vec encode_image error

When I try the listed example for get image vectors from CLIP:

from vectorhub.bi_encoders.text_image.torch import Clip2Vec
model = Clip2Vec()
model.encode_image('https://getvectorai.com/assets/hub-logo-with-text.png')

I get the following trace:

/home/is2961/anaconda3/lib/python3.9/site-packages/vectorhub/base.py:62: UserWarning: Unable to encode. Filling in with dummy vector.
  warnings.warn("Unable to encode. Filling in with dummy vector.")
Traceback (most recent call last):
  File "/home/is2961/anaconda3/lib/python3.9/site-packages/vectorhub/base.py", line 42, in catch_vector
    return func(*args, **kwargs)
  File "/home/is2961/anaconda3/lib/python3.9/site-packages/vectorhub/bi_encoders/text_image/torch/clip.py", line 101, in encode_image
    return self.model.encode_image(image).detach().numpy().tolist()[0]
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/multimodal/model/multimodal_transformer/___torch_mangle_9591.py", line 19, in encode_image
    _0 = self.visual
    input = torch.to(image, torch.device("cuda:0"), 5, False, False, None)
    return (_0).forward(input, )
            ~~~~~~~~~~~ <--- HERE
  def encode_text(self: __torch__.multimodal.model.multimodal_transformer.___torch_mangle_9591.Multimodal,
    input: Tensor) -> Tensor:
  File "code/__torch__/multimodal/model/multimodal_transformer.py", line 20, in forward
    _4 = self.positional_embedding
    _5 = self.class_embedding
    _6 = (self.conv1).forward(input, )
          ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _7 = ops.prim.NumToTensor(torch.size(_6, 0))
    _8 = int(_7)
  File "code/__torch__/torch/nn/modules/conv/___torch_mangle_9366.py", line 8, in forward
  def forward(self: __torch__.torch.nn.modules.conv.___torch_mangle_9366.Conv2d,
    input: Tensor) -> Tensor:
    x = torch._convolution(input, self.weight, None, [32, 32], [0, 0], [1, 1], False, [0, 0], 1, False, False, True, True)
        ~~~~~~~~~~~~~~~~~~ <--- HERE
    return x
  def forward1(self: __torch__.torch.nn.modules.conv.___torch_mangle_9366.Conv2d,

Traceback of TorchScript, original code (most recent call last):
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py(420): _conv_forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py(423): forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(709): _slow_forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(725): _call_impl
/root/workspace/multimodal-pytorch/multimodal/model/multimodal_transformer.py(85): forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(709): _slow_forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(725): _call_impl
/root/workspace/multimodal-pytorch/multimodal/model/multimodal_transformer.py(221): visual_forward
/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py(940): trace_module
<ipython-input-1-40b054242c5d>(36): export_torchscript_models
<ipython-input-2-808c11c4d1cf>(3): <module>
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py(3418): run_code
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py(3338): run_ast_nodes
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py(3147): run_cell_async
/opt/conda/lib/python3.7/site-packages/IPython/core/async_helpers.py(68): _pseudo_sync_runner
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py(2923): _run_cell
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py(2878): run_cell
/opt/conda/lib/python3.7/site-packages/IPython/terminal/interactiveshell.py(555): interact
/opt/conda/lib/python3.7/site-packages/IPython/terminal/interactiveshell.py(564): mainloop
/opt/conda/lib/python3.7/site-packages/IPython/terminal/ipapp.py(356): start
/opt/conda/lib/python3.7/site-packages/traitlets/config/application.py(845): launch_instance
/opt/conda/lib/python3.7/site-packages/IPython/__init__.py(126): start_ipython
/opt/conda/bin/ipython(8): <module>
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [768, 3, 32, 32], but got 5-dimensional input of size [1, 1, 3, 224, 224] instead

Is there an easy fix for this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.