<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Great! Thank you very much! It seems that some parts of the <a href="https://github.co

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

For BM25, I use BM25Okapi from rank_bm25, too. Code: <div class="snippet-clipb

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

The bm25 and gpt-index scripts ? about gorilla HOT 9 CLOSED

shishirpatil commented on August 15, 2024

The bm25 and gpt-index scripts ?

from gorilla.

Comments (9)

ShishirPatil commented on August 15, 2024 1

Thank you for your kind words and 893 out of 904 is a good match :) But yeah, you should get a 100% match - I think others have been able to reproduce it. For BM25, which variant of BM25 are you using? We used Okapi BM25 (BM25Okapi from rank_bm25). So we got the embedding for each of the documents (API's in our case) listed here and then did a cosine similarity match. Similarly, even for openai's embeddings, we didn't use the text-search, instead got the embeddings and compared for each query. Can you try using text-embedding-ada-002-v2 to get the embedding and then do a simple top-1 cosine similarity search? Let me know how it goes or if you run into any issues.

from gorilla.

zhilizju commented on August 15, 2024 1

Great! Thank you very much! It seems that some parts of the code were missing, so I added them and reproduced the experiment. However, I still found some differences in the results. For BM25, out of 904, there are 63 differences, and for GPT-index, out of 904, there are 99 differences.

from gorilla.

zhilizju commented on August 15, 2024 1

Of course, but I have been busy with the rebuttal and paper submission lately. I will submit a PR (pull request) after a while. Thank you once again.

from gorilla.

ShishirPatil commented on August 15, 2024

Hey @zhilizju thanks for raising this. What exactly are you referring to? Like how to build and use a retriever?

from gorilla.

zhilizju commented on August 15, 2024

Yes, I attempted to reproduce the results of BM25 and GPT-Index. When examining the Huggingface dataset, there were 11 instances out of 904 in my reproduced BM25 retrieval results that did not match with those in the file named 'questions_huggingface_bm25.jsonl'. Regarding GPT-Index, I noticed in the issue thread that you utilized Davinci v1 , which I also adopted. Specifically, 'text-search-davinci-query-001' was used for queries and 'text-search-davinci-doc-001' was applied for API docs. My approach involved the use of cosine similarity match, yet the results diverged considerably from those in 'questions_huggingface_gpt_index.jsonl'. Among the total of 904 instances, 328 were different. It seems that the results I've reproduced are closer to the oracle. Of course, this does not affect the conclusions drawn in the paper. However, I hope to follow your nice work, hence I am eager to accurately reproduce your results.

from gorilla.

zhilizju commented on August 15, 2024

For BM25, I use BM25Okapi from rank_bm25, too.
Code:

import json
from rank_bm25 import BM25Okapi


def load_data(file_name):
    with open(file_name, 'r') as f:
        data = [json.loads(line) for line in f]
    return data

def process_json_data(data):
    texts = []
    for item in data:
        text = json.dumps(item)
        texts.append(text)
    return texts

def init_bm25_model(texts):
    tokenized_corpus = [doc.split(" ") for doc in texts]
    bm25 = BM25Okapi(tokenized_corpus)
    return bm25

def search(query, bm25):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)
    best_doc = bm25.get_top_n(tokenized_query, texts, n=1)[0]
    return best_doc


data = load_data('data/api/huggingface_api.jsonl')  
texts = process_json_data(data)

bm25 = init_bm25_model(texts)

query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_bm25.jsonl')  
domains=[]
for item in query_data:
    query = item['text']
    best_doc = search(query, bm25)
    print(best_doc)

For GPT-index, I tried text-embedding-ada-002-v2. The result is indeed closer to 'questions_huggingface_gpt_index.jsonl' than text-search-davinci-doc-001. Out of 904, there are still 123 different ones. The code is below:

Obtain the embeddings of query and API：

import json
import openai
openai.api_key = '****' 

def load_data(file_name):
    with open(file_name, 'r') as f:
        data = [json.loads(line) for line in f]
    return data


query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_gpt_index.jsonl')  
querys=[]
for item in query_data:
    query = item['text']
    querys.append(query)

def text_to_embedding(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    embeddings = response['data'][0]['embedding']
    return embeddings

embeddings=[]
for query in querys:
    print(query)
    embedding = text_to_embedding(query)
    embeddings.append(embedding)  import pickle

def store_text_and_embeddings(texts, embeddings, filename):
    assert len(texts) == len(embeddings), "The length of texts and embeddings must be the same

    data = {
        'texts': texts,
        'embeddings': embeddings
    }

    with open(filename, 'wb') as f:
        pickle.dump(data, f)
        
store_text_and_embeddings(querys, embeddings, 'ada_query_texts_and_embeddings.pkl')


def process_json_data(data):
    texts = []
    for item in data:
        text = json.dumps(item)
        texts.append(text)
    return texts


API_data = load_data('data/api/huggingface_api.jsonl')  
API_texts = process_json_data(API_data)
store_text_and_embeddings(texts, embeddings, 'ada_huggingface_api_texts_and_embeddings.pkl')

Calculate similarity：

from numpy import dot
from numpy.linalg import norm

def cosine_similarity(query_embedding, text_embeddings):
    query_norm = norm(query_embedding)
    similarities = []
    for text_embedding in text_embeddings:
        text_norm = norm(text_embedding)
        cosine_sim = dot(query_embedding, text_embedding) / (query_norm * text_norm)
        similarities.append(cosine_sim)
    
    return similarities

def find_most_similar_texts(query_embedding, texts, text_embeddings):
    similarities = cosine_similarity(query_embedding, text_embeddings)
    max_similarity_index = similarities.index(max(similarities))

    return texts[max_similarity_index]

import pickle
def load_texts_and_embeddings(filename):
    with open(filename, 'rb') as f:
        data = pickle.load(f)
        
    return data['texts'], data['embeddings']
texts, text_embeddings = load_texts_and_embeddings('ada_huggingface_api_texts_and_embeddings.pkl')
querys, query_embeddings= load_texts_and_embeddings('ada_query_texts_and_embeddings.pkl')  



for i in range(len(querys)):
    query=querys[i]
    query_embedding=query_embeddings[i]
    similar_text=find_most_similar_texts(query_embedding, texts, text_embeddings)
    print(similar_text)

I don't know why. Can you give me some advice to reduce the difference? Perhaps you have different data preprocessing?

from gorilla.

ShishirPatil commented on August 15, 2024

Hey @zhilizju I just merged #61 where we release our retriever code. We were able to verify 100% match. Can you try this? Thanks!

from gorilla.

ShishirPatil commented on August 15, 2024

Hey @zhilizju do you mind sharing the code that you found missing as a PR :) Would welcome contributions!

from gorilla.

ShishirPatil commented on August 15, 2024

Thanks @zhilizju and good luck with the submissions :) Will close this for now!

from gorilla.

The bm25 and gpt-index scripts ? about gorilla HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent