Giter VIP home page Giter VIP logo

Comments (9)

ShishirPatil avatar ShishirPatil commented on August 15, 2024 1

Thank you for your kind words and 893 out of 904 is a good match :) But yeah, you should get a 100% match - I think others have been able to reproduce it. For BM25, which variant of BM25 are you using? We used Okapi BM25 (BM25Okapi from rank_bm25). So we got the embedding for each of the documents (API's in our case) listed here and then did a cosine similarity match. Similarly, even for openai's embeddings, we didn't use the text-search, instead got the embeddings and compared for each query. Can you try using text-embedding-ada-002-v2 to get the embedding and then do a simple top-1 cosine similarity search? Let me know how it goes or if you run into any issues.

from gorilla.

zhilizju avatar zhilizju commented on August 15, 2024 1

Great! Thank you very much! It seems that some parts of the code were missing, so I added them and reproduced the experiment. However, I still found some differences in the results. For BM25, out of 904, there are 63 differences, and for GPT-index, out of 904, there are 99 differences.

from gorilla.

zhilizju avatar zhilizju commented on August 15, 2024 1

Of course, but I have been busy with the rebuttal and paper submission lately. I will submit a PR (pull request) after a while. Thank you once again.

from gorilla.

ShishirPatil avatar ShishirPatil commented on August 15, 2024

Hey @zhilizju thanks for raising this. What exactly are you referring to? Like how to build and use a retriever?

from gorilla.

zhilizju avatar zhilizju commented on August 15, 2024

Yes, I attempted to reproduce the results of BM25 and GPT-Index. When examining the Huggingface dataset, there were 11 instances out of 904 in my reproduced BM25 retrieval results that did not match with those in the file named 'questions_huggingface_bm25.jsonl'. Regarding GPT-Index, I noticed in the issue thread that you utilized Davinci v1 , which I also adopted. Specifically, 'text-search-davinci-query-001' was used for queries and 'text-search-davinci-doc-001' was applied for API docs. My approach involved the use of cosine similarity match, yet the results diverged considerably from those in 'questions_huggingface_gpt_index.jsonl'. Among the total of 904 instances, 328 were different. It seems that the results I've reproduced are closer to the oracle. Of course, this does not affect the conclusions drawn in the paper. However, I hope to follow your nice work, hence I am eager to accurately reproduce your results.

from gorilla.

zhilizju avatar zhilizju commented on August 15, 2024

For BM25, I use BM25Okapi from rank_bm25, too.
Code:

import json
from rank_bm25 import BM25Okapi


def load_data(file_name):
    with open(file_name, 'r') as f:
        data = [json.loads(line) for line in f]
    return data

def process_json_data(data):
    texts = []
    for item in data:
        text = json.dumps(item)
        texts.append(text)
    return texts

def init_bm25_model(texts):
    tokenized_corpus = [doc.split(" ") for doc in texts]
    bm25 = BM25Okapi(tokenized_corpus)
    return bm25

def search(query, bm25):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)
    best_doc = bm25.get_top_n(tokenized_query, texts, n=1)[0]
    return best_doc


data = load_data('data/api/huggingface_api.jsonl')  
texts = process_json_data(data)

bm25 = init_bm25_model(texts)

query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_bm25.jsonl')  
domains=[]
for item in query_data:
    query = item['text']
    best_doc = search(query, bm25)
    print(best_doc)

For GPT-index, I tried text-embedding-ada-002-v2. The result is indeed closer to 'questions_huggingface_gpt_index.jsonl' than text-search-davinci-doc-001. Out of 904, there are still 123 different ones. The code is below:

Obtain the embeddings of query and API:

import json
import openai
openai.api_key = '****' 

def load_data(file_name):
    with open(file_name, 'r') as f:
        data = [json.loads(line) for line in f]
    return data


query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_gpt_index.jsonl')  
querys=[]
for item in query_data:
    query = item['text']
    querys.append(query)

def text_to_embedding(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    embeddings = response['data'][0]['embedding']
    return embeddings

embeddings=[]
for query in querys:
    print(query)
    embedding = text_to_embedding(query)
    embeddings.append(embedding)  import pickle

def store_text_and_embeddings(texts, embeddings, filename):
    assert len(texts) == len(embeddings), "The length of texts and embeddings must be the same

    data = {
        'texts': texts,
        'embeddings': embeddings
    }

    with open(filename, 'wb') as f:
        pickle.dump(data, f)
        
store_text_and_embeddings(querys, embeddings, 'ada_query_texts_and_embeddings.pkl')


def process_json_data(data):
    texts = []
    for item in data:
        text = json.dumps(item)
        texts.append(text)
    return texts


API_data = load_data('data/api/huggingface_api.jsonl')  
API_texts = process_json_data(API_data)
store_text_and_embeddings(texts, embeddings, 'ada_huggingface_api_texts_and_embeddings.pkl')

Calculate similarity:

from numpy import dot
from numpy.linalg import norm

def cosine_similarity(query_embedding, text_embeddings):
    query_norm = norm(query_embedding)
    similarities = []
    for text_embedding in text_embeddings:
        text_norm = norm(text_embedding)
        cosine_sim = dot(query_embedding, text_embedding) / (query_norm * text_norm)
        similarities.append(cosine_sim)
    
    return similarities

def find_most_similar_texts(query_embedding, texts, text_embeddings):
    similarities = cosine_similarity(query_embedding, text_embeddings)
    max_similarity_index = similarities.index(max(similarities))

    return texts[max_similarity_index]

import pickle
def load_texts_and_embeddings(filename):
    with open(filename, 'rb') as f:
        data = pickle.load(f)
        
    return data['texts'], data['embeddings']
texts, text_embeddings = load_texts_and_embeddings('ada_huggingface_api_texts_and_embeddings.pkl')
querys, query_embeddings= load_texts_and_embeddings('ada_query_texts_and_embeddings.pkl')  



for i in range(len(querys)):
    query=querys[i]
    query_embedding=query_embeddings[i]
    similar_text=find_most_similar_texts(query_embedding, texts, text_embeddings)
    print(similar_text)
    


I don't know why. Can you give me some advice to reduce the difference? Perhaps you have different data preprocessing?

from gorilla.

ShishirPatil avatar ShishirPatil commented on August 15, 2024

Hey @zhilizju I just merged #61 where we release our retriever code. We were able to verify 100% match. Can you try this? Thanks!

from gorilla.

ShishirPatil avatar ShishirPatil commented on August 15, 2024

Hey @zhilizju do you mind sharing the code that you found missing as a PR :) Would welcome contributions!

from gorilla.

ShishirPatil avatar ShishirPatil commented on August 15, 2024

Thanks @zhilizju and good luck with the submissions :) Will close this for now!

from gorilla.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.