Giter VIP home page Giter VIP logo

semantweet-search's Introduction

SemanTweet Search

SemanTweet Search allows you to search over all your tweets from the Twitter archive using semantic similarity. A demo is available here.

It preprocesses your tweets, generates embeddings using OpenAI's small/large embedding model, stores the data and embeddings in LanceDB vector db, and provides a web interface to search and view the results.

You can do semantic search post pre-filtering by time, likes, retweets, media only or link only tweets too.

Pre-filtering by sql operations helps not only filter but also reduce the vector search space thus speeding up the search.

You can additionally use/edit projector.py and tensorflow projector to get a visualization of your tweets using t-sne algorithm as shown here

UPDATE

  • (25/3/2024) bge-small-en-v1.5 embedding support added. No API key required. It's 29th on MTEB leaderboard and 130 MB size, 384 dimension, sequence length 512. Note it's not multi-lingual.

    Technically, a lot of embeddings from sentence-transformers are possible. You can refer LanceDB docs here

  • (24/3/2024) Added CLIP based image searching on tweets_media folder. Check code app_image_search.py after installing requirements i.e till step 5. Does not require to run the setup bash scripts for this, it's a standalone program. First run will take roughly 5-10 minutes as it creates the embeddings. Subsequent runs will be instant.

This is also available as a standalone light weight repository as Embeddit where you can use any images folder.

Technologies Used:

  • Twitter archive for data
  • OpenAI text-embedding-3-large embeddings by default for semantic search
  • Lancedb for vector search and sql operations
  • Hybrid search provided by Lancedb combining BM25 + embedding search
  • Flask for server

Prerequisites

  • Python 3.x
  • OpenAI API key
  • Twitter archive data

Installation

  1. Clone the repository:

    git clone https://github.com/sankalp1999/semantweet-search.git
    
    cd semantweet-search/
    
  2. Download your Twitter archive (takes 2 days to be available)

    Go to: More (3 dot button) > Settings and Privacy > Your Account > Download an archive of your data.

    Extract it. Put the extracted folder at the root of this project and rename it to twitter-archive.

  3. Create a virtual environment:

    python3 -m venv venv
    

    Make sure you do this at the root of project.

  4. Activate the virtual environment:

    • For Unix/Linux:
      source venv/bin/activate
      
    • For Windows:
      venv\Scripts\activate
      
  5. Install the required dependencies:

    pip install -r requirements.txt
    

    If you want to try out image vector search, please run below command. I have not included this in requirements.txt as it downloads a 620 MB model and not everyone would like to do that by default.

    pip install open_clip_torch
    
  6. (Not required if using sentence transformers) Set up your OpenAI API key as an environment variable:

    export OPENAI_API_KEY=your_api_key
    
  7. By default, this repo uses openAI text-embedding-3-large

    You can change it to text-embedding-3-small. Change required at two places.

  1. Run the setup script:

    OpenAI

    chmod +x run_scripts.sh
    ./run_scripts.sh

    sentence-transformers

    chmod +x run_sentence_tf_scripts.sh
    ./run_sentence_tf_scripts.sh

    Uncomment line 15, 16 in app.py

    # db = lancedb.connect("data/bge_embeddings")
    # table = db.open_table("bge_table")
  2. Start the application:

    python app.py
    

    or

    flask run
    

Enjoy!

Flow of the program

graph TD
    A[Twitter Archive Data] --> B[preprocess_tweets_one.py]
    B --> C[Preprocessed Tweets CSV]
    C --> D[async_openai_embedding_two.py]
    D --> E[Embeddings CSV]
    E --> F[create_lance_db_table_openai_three.py]
    F --> G[LanceDB Database]
    G --> H[Web Interface]
    H --> I[Search Tweets]
    I --> J[View Results]
Loading

The OpenAI embedding flow consists of the following steps:

  1. preprocess_tweets_one.py: This script preprocesses the tweets from the Twitter archive, extracting relevant information and saving it to a CSV file.

  2. async_openai_embedding_two.py: This script reads the preprocessed tweets from the CSV file, generates embeddings using OpenAI's embedding model asynchronously, and saves the embeddings to a new CSV file.

  3. create_lance_db_table_openai_three.py: This script reads the generated embeddings from the CSV file, creates a LanceDB table using the specified schema, and stores the data in the database.

The run_scripts.sh script automates the execution of these steps in the correct order.

Additional Notes

  • The project uses the text-embedding-3-large model by default. You can change the model by modifying the MODEL_NAME variable

  • The batch size for generating embeddings is set to 32 to stay within the token limit. Adjust the batch size if needed.

  • The LanceDB database is stored in the data/openai_db directory.

  • The project also includes a synchronous version of the OpenAI embedding generation script (create_openai_embedding_sync_two.py), which can be used as an alternative to the asynchronous version.

License

This project is licensed under the MIT License. See the LICENSE file for details.

semantweet-search's People

Stargazers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.