Giter VIP home page Giter VIP logo

praveen76 / customize-word-embeddings-for-llms Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 868 KB

We'll use NQ (Natural Questions) dataset from the Google. We'll find weak negatives, and hard negatives first. Then we'll calculate word embeddings using OpenAI's text-embedding-ada-002 word embedding model to compare the accuracy and performance with customized word embeddings.

License: GNU General Public License v3.0

Jupyter Notebook 100.00%
large-language-models llms nlp nlq

customize-word-embeddings-for-llms's Introduction

Customize-Word-Embeddings-for-LLMs

We'll use NQ (Natural Questions) dataset from the Google. We'll find weak negatives, and hard negatives first. Then we'll calculate word embeddings using OpenAI's text-embedding-ada-002 word embedding model to compare the accuracy and performance with customized word embeddings.

Performance of Default Embeddings vs Customized Word Embeddings:

  • To get a visual representation of the 'behaviour' or performance of the default embeddings, we plot a distribution of cosine similarity.

  • The graphs show how much the overlap there is between the distribution of cosine similarities for similar and dissimilar pairs. If there is a high amount of overlap, that means there are some dissimilar pairs with greater cosine similarity than some similar pairs.

  • The accuracy computed here is the accuracy of a simple rule that predicts 'similar (1)' if the cosine similarity is above some threshold X and otherwise predicts 'dissimilar (0)'.

  • You can checkout similarity distribution plot BEFORE customization here: plot similarity distribution

  • You can checkout similarity distribution plot AFTER customization here:

    plot similarity distribution

Data:

The Natural Questions corpus represents a question-answering dataset, comprising 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each instance consists of a query originating from google.com and an associated Wikipedia page. Within each Wikipedia page, there is an annotated passage, often referred to as the "long answer," which serves as a potential response to the query. Additionally, one or more short spans from this annotated passage contain the actual answer to the query. However, it is important to note that the long and short answer annotations may be left empty. When both the long and short answer annotations are empty, it indicates that no answer is available on the page. If the long answer annotation is non-empty but the short answer annotation remains empty, it suggests that the annotated passage provides a response to the question, yet no explicit short answer can be identified. Lastly, approximately 1% of the documents include a passage annotated with a short answer of "yes" or "no" instead of a list of short spans.

Instructions for Installation

Please install following dependencies before procedding further in the project.

Dependencies:

  • openai: 0.28.1
  • torch: 1.13.1+cpu
  • numpy: 1.18.1
  • pandas: 1.0.1
  • plotly: 5.17.0
  • tqdm: 4.42.1
  • bs4: 4.8.2
  • sklearn: 0.22.1
  • re: 2.2.1
  • matplotlib: 3.5.3
  • sentence_transformers: 2.2.2
  • tiktoken
  • functools

The code has been tested on Windows system. It should work well on other distributions but has not yet been tested.

In case of any issue with installation or otherwise, please contact me on Linkedin

Steps involved:

Following are the steps involved in our experiment. You can run notebook on your own to get more details on the experiment.

    1. Data prep
    • transforming prepared data
      • negative sampling
        • i. Weak Negatives
        • ii. Hard Negatives
    1. Word Embeddings:
    • How to get embeddings
    • Visualizing Embeddings
      • Queries and Passages in Latent Space
    1. Model Training:
    • motivating example
    • performance before training
      • training
        • Training setup explanation
        • Explaining the 'Model' being trained
    1. Model performance after training
    • visualizing before/after
      • Queries and Passages in Latent Space

License

This repository and its contents are open-sourced under the MIT License. Feel free to use, modify, and distribute these projects in accordance with the terms specified in the license.

Issues:

If you encounter any issues or have suggestions for improvement, please open an issue in the Issues section of this repository.

Contributing

If you have a Data Science mini-project that you'd like to share, please follow the guidelines in CONTRIBUTING.md.

Code of Conduct

Please adhere to our Code of Conduct in all your interactions with the project.

Contact:

The code has been tested on Windows system. It should work well on other distributions but has not yet been tested. In case of any issue with installation or otherwise, please contact me on Linkedin

Happy coding!!

About Me:

Iโ€™m a seasoned Data Scientist and founder of TowardsMachineLearning.Org. I've worked on various Machine Learning, NLP, and cutting-edge deep learning frameworks to solve numerous business problems.

customize-word-embeddings-for-llms's People

Contributors

praveen76 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.