iryna-kondr / scikit-llm Goto Github PK

View Code? Open in Web Editor NEW

3.0K 35.0 235.0 246 KB

Seamlessly integrate LLMs into scikit-learn.

Home Page: https://beastbyte.ai/

License: MIT License

Python 100.00%

chatgpt deep-learning llm machine-learning scikit-learn transformers

scikit-llm's Introduction

Hello there, I'm Iryna 👋

I'm a Data Scientist currently focusing on open-source projects.

🔭 Currently Working on ...

I'm currently contributing to three open-source libraries designed to streamline and simplify various aspects of data science. Our aim is to empower users, regardless of their technical background, to harness the power of data for analysis, modeling, and decision-making.

🤝 Support our projects by giving a star ⭐

🚀 Scikit-LLM

Scikit-LLM logo

Scikit-LLM is a collection of scikit-learn compatible wrappers around large language models, allowing to treat them as regular sklearn estimators.

🌐 Visit the project: Scikit-LLM

📝 Check out the Medium article: Medium Article: Scikit-LLM

🦅 Falcon AutoML

Falcon logo

Falcon is a Python library designed for effortless training of production-ready machine learning models. Our primary objective is to simplify both the training process and the deployment of models to any target platform. To achieve this, we have incorporated native support for exporting the entire pipelines into the ONNX format, enabling seamless deployment of models across various environments. With Falcon, training and deploying your models becomes as easy as a single line of code.

🌐 Visit the project: Falcon AutoML

📝 Check out the Medium article: Medium Article: Falcon AutoML

🦊 Agent Dingo

AgentDingo logo

Dingo is a lightweight microframework designed for streamlining the development of LLM pipelines and autonomous agents.

🌐 Visit the project: Agent Dingo

📝 Check out the Medium article: Medium Article: Agent Dingo

📧 How to reach me: LinkedIn

scikit-llm's People

Contributors

Stargazers

Watchers

Forkers

gsgithub17 fatimamhelmy hangj11 meesala-bfrs01946 yashasdevasurmutt lkafle nadiaelkhodja kp-forks saulocatharino kenichi-segawa techthiyanes sunnyly2016 ai-jie01 akashmavle5 tikna123 djiwandou sidjain24 advit200 statsgary mec-is vanamayaswanth anupqindia khemanta tomakk fran-gen tomkallo marcosoares-92 hector1993prog dsupertramp balakkvj lozanocelia mekongdelta-mind junaidiqbalsyed acedesci davidwhiting dingchaoz jcmo-research mobs75 ai-ml-cv clement-lelievre areyesan jaedukseo josimarviana yehocoh benwaldner henokb mshaek shrahimim pandinosaurus deekshithadprakash evdcush debjyoti003 mzamini92 gyanachand1 abhishekdconviction iamollas roo-shy o7s8r6 ansariparvej hosseinghafarian junaidsheroz rafiulbiswas gilsondsouza sarbashis madeehrehman yiranvang mforootan tspannhw lapisco shweta2146 fsndzomga deyh2020 muneebhashmi7712 msocko420 ishandutta2007 louderthanthunderx1 hhy5277 rpiryani ramondch iuriimattos2 cotp27 puspendra114 davydw pingyangtiaer veerumehta awesome-software eduhayon ukaserge mz0in joshurbandavis maralzar ssahgal glaceage jrcribb richardsonjf dmsama99 julianaguama hoanbi1812000 coulibaly-b redonengineer

scikit-llm's Issues

Safe openai version to work on?

Hi, I try to use the few-shot classifier in the sample code. However, it seems that the openai package is restructuring their codes: https://community.openai.com/t/attributeerror-module-openai-has-no-attribute-embedding/484499.

Here are the error codes:
Could not obtain the completion after 3 retries: `APIRemovedInV1 ::

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run openai migrate to automatically upgrade your codebase to use the 1.0.0 interface.
...
A detailed migration guide is available here: openai/openai-python#742
`
None
Could not extract the label from the completion: 'NoneType' object is not subscriptable

So, is there an version of the openai package that is safe to run?

Update Readme for Azure users

I was facing some issues configuring the Api keys and the Api base for my deployed resource on Azure. I was able to solve it using the azure documentation, i think adding these changes would help someone trying to configure it faster.

Below is the proposed change to the ReadMe.

Using Azure OpenAI

from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("<YOUR_KEY>")  # use azure key instead
SKLLMConfig.set_azure_api_base("<API_BASE>") # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/

# start with "azure::" prefix when setting the model name
model_name = "azure::<model_name>"
# e.g. ZeroShotGPTClassifier(openai_model="azure::gpt-3.5-turbo")

Note:

Azure OpenAI is not supported by the preprocessors at the moment.
The the openai_model should be the deployment name for the resource.
To find the API_KEY and the Azure Api Base : Azure OpenAi Documentation

Dependency VertexAI breaks Python installation

First of all, it seems that vertexai is not supposed to be used and merely a placeholder for google-cloud-aiplatform.
Secondly, the current wheel (at least on MacOS for me) is broken, as it contains a init.py file with the following content:

# ...

raise ImportError(
    "To use the Vertex AI SDK, install the google-cloud-aiplatform package."
)

However, this is not placed into site-packages/vertexai/init.py, BUT INSTEAD AT THE ROOT OF SITE-PACKAGES. This means every other module can NOT be imported anymore. I think this is an issue with the original wheel file, however since it is not needed I'd appreciate an updated version with the dependency removed.

'GPT4All' object has no attribute 'chat_completion'

if GPT4All is None:
raise ImportError(
"gpt4all is not installed, try pip install scikit-llm[gpt4all]"
)
if model not in _loaded_models.keys():
_loaded_models[model] = GPT4All(model)

return _loaded_models[model].chat_completion(
    messages, verbose=False, streaming=False, temp=1e-10
)

Generating Natural Language

Hi,

First of all, great work and highly commendable.

Is there any way to generate a natural language response like gpt-3.5?

Contribute.MD

We should have a contribute.Md file for users who want to contribute. I would also like to contribute to this project

is there a way to turn off the tqdm progress bar ?

Hi,

I wonder if there is a way to suppress the tqdm progress bar during fitting.

Thank you

ZeroShotGPTClassifier Error

I am running the example code:

from skllm.config import SKLLMConfig
import os

SKLLMConfig.set_openai_key(os.getenv("OPENAI_API_KEY"))
SKLLMConfig.set_openai_org(os.getenv("OpenAI-org"))

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, _ = get_classification_dataset()

clf = ZeroShotGPTClassifier()
clf.fit(None, ["positive", "negative", "neutral"])
labels = clf.predict(X)

I get the following error:

  3%|▎         | 1/30 [00:09<04:21,  9.00s/it]Could not obtain the completion after 3 retries: `AttributeError :: module 'openai' has no attribute 'ChatCompletion'`
None
Could not extract the label from the completion: 'NoneType' object is not subscriptable
  3%|▎         | 1/30 [00:15<07:15, 15.00s/it]

What is the `fit` method actually doing?

Hi,
Great work! I have 3 questions:

Refer to your example on Readme. As part of your fit method in ZeroShotGPTClassifier with gpt-3.5-turbo as the model, are you basically freezing the ada-02 embeddings and then adding some layer on top for the classification task? I'm asking this question because OpenAI APIs support fine-tuning only till GPT-3.
Or, are you simply using it as a zero-shot classifier, and no real training is happening? That is, fit method is only mapping to some prompts that is relevant for a classification task?
How to use scikit-llm for fine-tuning (on private data) for tasks such as summarization or question-answering?

Thanks!

Supporting local LLM api server and vLLM

Thanks for your great work!
Since https://github.com/lm-sys/FastChat can initiate a local server on llama2/vicuna, which api is quite similar to openai, it is possible to support FastChat api server, so we can inference with a local api server?

Besides, is there any plan to support batch inference with https://github.com/vllm-project/vllm? The tabular data examples are similar, so batch inference with vLLM could speed up the whole process than gpt4all

Dynamic FewShot GPTClassifier: does it cache Embd locally?

I wonder:

if DynamicFewShotGPTClassifier will cache embeddings by OpenAI locally for the 1st time calling it.
And can we access them as embeds can be used in other cases, so that we can save some budget?

Documentation:

It may be better if we set the Open API via environment variables.

from skllm.config import SKLLMConfig
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())

SKLLMConfig.set_openai_key(os.getenv('OPENAI_API_KEY'))

Documentation

Hi @iryna-kondr

Re: Documentation: Improve the project's documentation, including code comments and README files.

I would love to help document this project including code comments and possibly adding some use case notebooks etc.

Let me know if this is something you're open to or if you have some pointers on where to start?

Thanks!

Web Interface like scikit-learn

Developing a web interface like scikit-learn will help others understand better. The Ui may contain

Model Name
Model Specs
Model Description
Model snippet
Example

Unable to use gpt4all

I followed all the steps as mentioned in the readme but I couldn't use gpt4all in my colab notebook. It keeps prompting me to install it (error: gpt4all is not installed, try 'pip install sckit-llm[gpt4all]') even though I added the installation code before running my code snippet.

It's happening just with gpt4all as gpt 3.5 turbo atleast loads and works.

`GPTVectorizer().fit_transform(X)` always returns `RuntimeError`

Hi! First of all very nice work!

I was trying the embedding utility with something as simple as:

from skllm.datasets import get_classification_dataset

X, _ = get_classification_dataset()
model = GPTVectorizer()
vectors = model.fit_transform(X)

however, I always get:

RuntimeError: Could not obtain the embedding after retrying 3 times. 
Last captured error: `<empty message>

I also tried with a custom dataset and with some simple strings.

Am I doing something wrong?

Prompt JSON response for DynamicFewShotGPTClassifier is blank

Hi,

I just started using skllm and I have tried to build a simple DynamicFewShotGPTClassifier with the following code:

from skllm import DynamicFewShotGPTClassifier

X = [
    "I love reading science fiction novels, they transport me to other worlds.",
    "A good mystery novel keeps me guessing until the very end.",
    "Historical novels give me a sense of different times and places.",
    "I love watching science fiction movies, they transport me to other galaxies.",
    "A good mystery movie keeps me on the edge of my seat.",
    "Historical movies offer a glimpse into the past.",
]

y = ["books", "books", "books", "movies", "movies", "movies"]

query = "I have fallen deeply in love with this sci-fi book; its unique blend of science and fiction has me spellbound."

clf = DynamicFewShotGPTClassifier(n_examples=1).fit(X, y)

prompt = clf._get_prompt(query)
print(prompt)

Everything seems to work just fine, but the Your JSON response is blank. Any ideas of what could be happening? Thanks

Prompt output:

List of categories: ['books', 'movies']

Training data:

Sample input:
```I love reading science fiction novels, they transport me to other worlds.```

Sample target: books


Sample input:
```I love watching science fiction movies, they transport me to other galaxies.```

Sample target: movies


Text sample: ```I have fallen deeply in love with this sci-fi book; its unique blend of science and fiction has me spellbound.```

Your JSON response:

prompt-engineering ?

hey fellas, according to this : link
prompting to classify can be made better, by using techniques like Chain of Thought and Few Shot training.

Want this incorporated ?

can you share link to Agent Dingo

Could we run those LLM models on CPU for inference?

Hi,

Many thanks for releasing this repo for using LLMs locally!

Could we know if it is possible to run these LLM models on CPU rather than GPU for inference?

Many thanks!

Is that possible not to raise error for few edge cases?

Hi, I am new to scikit-llm. Generally spekaing, ZeroShotGPTClassifier works for my daily work as follows.
clf= ZeroShotGPTClassifier(model=model_name)
clf.fit(X, y)
preds = clf.predict(X)
However, sometimes the input X may have few rows that are too long. It breaks my job by the error of 'context_length_exceeded', so I cannot get preds. Sometimes I fail to get predictions because few rows trigger the OpenAI error of 'content_filter'. (OpenAI's neural multi-class classification models believe my input text contains harmful content, but their predictions are false positives.)

I think the error comes from the retry function.

scikit-llm/skllm/utils.py

Line 92 in 0bdea94

raise RuntimeError(err_msg)

Is there a quick way that I can turn off this error arise in the version 1.0.0?
I am OK if the classification predictions are random if OpenAI API returns any error. Is that doable? My memory may not be accurate, but I remember the old manual of scikit-llm had something like the classifier still works even with an error, but the prediction will be random for that case.

Thank you in advance.

Long Documents for Summarization -> Zero-Shot Multi-Label Classification

How do I summarize an exceptionally long, book-sized document for summarization?
I want to create a summary and then use the LLM classifier on it. The "summary of summaries" approach takes too much time.
When working with a zero-shot multi-label classifier, are individual texts I want to classify treated as separate requests to the LLM API, or are multiple texts combined into one request? Specifically, when using list of lists , would all the text inputs be included in a single prompt, or would they remain separate? How do we manage text limits if they are all aggregated?

The scope of this repo is far beyond than it can be imagined

This is an amazing repo, I got so exited to see this. Because I have been thinking about similar ideas lately. I took a quick look into the repo, and it seems like the main aim (for now) is to support scikit learn datasets and make it some how fused with gpt prompts and make gpt models to do the work. But I think we there are more scopes to do just this.

Since scikit learn has been a huge ecosystem for ML and it is mostly used (till now and will be) by most of the organizations. When it comes to tabular data (100M row +) I do not think scikit LLM can work. But till now there are several real world problems that has this kind of tabular data with different varying patterns. And those kind of problems might not be solved with 'just' LLM.

Also at least with scikit learn models are much more LESS black box then these models are and hence easily interpretable where as these black box are far less interpretable and we might not be able to proof locally that why our models is generating this behavior for some data. Also these models are not deterministic, just change one small part of the input and the whole black box can turn into a different direction.

However, think of like this. Let's think in terms of BFF (Backend for Frontend) design, approach instead making GPT as the backend of computation, make it the front end part of the computation. Provide the dataset link, give it the 'sample dataset', provide the problem statement, give the meta data and using
langchain like tools, and existing awesome ecosystem of scikit learn, we can tell these models to do the computation on bare scikit learn and then come up with the predictions and even for example, if something is showing an anomaly in terms of behaviour, we can use explainable ai like LIME/SHAFT and use GPT in the top of these and may be generate awesome interpretable reports with these local 'fit' curves / graphs. In this way we can automate lot of process, keeping the reliability factor in check.

And then this can be used and 'deployed' in real world systems because the heavy lifting is still done by scikit learn, but with a front end of gpt. It then all boils down to fitting the right information and right instructions into right place to provide the results we want.

Some examples.

I provide a dataset link, the statement, metadata, it makes the ml model stores some where and provides the training report for each stage of ML training.

Then if I just provide a new data like, I have an user (not present in training data) with these 'unseen' feature what will be the prediction for that user. On the backend it might run prediction pipeline and then we can provide lot of follow ups like

why you predicted this
What if the input was this and how the output would have changed

Even systems which use real time ML can also incorporate, because real time ML is highly dependent on interpretable and light weight models as Speed and reliability both are indepedent. Using gpt as a query engine interface on top of it can be used for enhanced telemetry or something else, like automatically generating the use behavior from data drift or something else. All we might have to query with natural language.

In that way we are not using huge amount of tokens, can provide lesser black box results and also an use case of safe and less hallucinating AI. Please provide me your thought in general, I know this description got really big, but let me know, I am always up to discuss more on this, if my thought is aligned with yours.

Thanks

Missing support for using Azure OpenAI services.

Would love to see some support for using LLM in Azure OpenAI services. I couldn't find any such support but would be happy to build and collaborate with anyone interested.

InvalidRequestError POST /v1/openai/deployments/

Hello,
I have a InvalidRequestError when trying to do a ZeroShotGPTClassifier, whereas I can call a ChatCompletion with the same model, here is my code:

from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key(openai.api_key)
SKLLMConfig.set_azure_api_base(openai.api_base)

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X = df[['cause','solution']].values
y = df['Causes_FinalNature_ENG'].values

defining the model

clf = ZeroShotGPTClassifier(openai_model="azure::gpt-4-32k")

fitting the data

clf.fit(X, y)

predicting the data

labels = clf.predict(X)

This code crashes with that error:
ould not obtain the completion after 3 retries: InvalidRequestError :: Invalid URL (POST /v1/openai/deployments/gpt-4-32k/chat/completions)

But when I run this code it works well:
response = openai.ChatCompletion.create(
engine='gpt-4-32k',
messages=[
{"role": "user", "content": "Who won the world series in 2020?"}],
max_tokens=193,
temperature=0,
)

Same stuff with gpt-35-turbo and gpt-3.5-turbo.

Do you know what's wrong with my code ?

Thanks in advance

Supporting NER tasks

Could we add support for named entity recognition tasks to the library? I see that the user interface would not change much from what was applied in the multi-label classification method, with the difference that instead of the entire text input being classified as one or multiple labels, we would have distinct entities from the text being recognized and labeled (the user could also give a semantic list of possible entities to be recognized as input).

A/B Test LLMs?

In production, we don't know if Llama2 is going to provide:

good results
quickly

Would it be helpful to provide a way to easily A/B test between new models in production?

Context - I'm working on LiteLLM and we recently released a way to a/b test straight from the completion endpoint:

Tutorial: https://docs.litellm.ai/docs/tutorials/ab_test_llms

Dashboard:

predict

ZeroShotGPTClassifier.predict must return an np.array instead of a list.

Add flag to control whether unknown labels are returned as None

Additionally, Scikit-LLM will ensure that the obtained response contains a valid label. If this is not the case, a label will be selected randomly (label probabilities are proportional to label occurrences in the training set).

In many pipelines it is better to return a None label to such examples instead of choosing one at random.
Would want a flag to control this behavior:
either set a specific label (like -1) in those cases / set None / select label at random (correct behavior)

Sentiment

Having Sentiment.py in https://github.com/iryna-kondr/scikit-llm/tree/main/skllm/datasets. would it be needed and essential ? @iryna-kondr @Nadav-Barak

Evaluation

How to evaluate the zero shot classifier for multilabel task, can you make it adaptable to sikitlearn classification report

AttributeError: 'ZeroShotGPTClassifier' object has no attribute 'key'

I was trying to work on #55 a bit and ran into some other issues.
Calling any GPT classifier like below raises errors for key and org.

clf = ZeroShotGPTClassifier(model="gpt-3.5-turbo")
clf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ZeroShotGPTClassifier' object has no attribute 'org'

Expected behavior:

clf
ZeroShotGPTClassifier()

The likely cause is how you handle the key and org. You define key and org attributes in the ZeroShotGPTClassifier innit but then set them using _set_keys from the GPTMixin. GPTMixin, however creates 2 new attributes called openai_key and openai_org, the original attributes are never set.
The same issue arrises with FewShot, Dynamic and GPTVectorizer.

The easiest fix would be to do some refactoring in the GPTMixin.

Wrong output in GPTTranslator

Hi,

I have used GPTTranslator in some list of mixed english and spanish text, and the output that I get looks at follows (truncated).

I tried two times and the output ramains the same:

'```Requisitos para entidades culturales.``` \n```Requirements for cultural entities.``` \n```Requisitos para entidades culturales.```',
'Requisito de identificación de alcohol.',
'Ampliar plazo matrimonio. \n\nTranslation: Ampliar el plazo de matrimonio.',
'Bono de invierno incrementado.',
'Mejora del seguro de GPS.',
'Retiro de fondos.',
'Reingreso ilegal penalizado.',
'Modificación de protección del agua.',
'Autorización para portar armas.',
'Adición de reembolso de aerolínea.',
'Licencia adulterada.',
'Renombrar el aeropuerto en honor a Margot Duhalde.',
'Alquiler regulado.',
'Modificaciones laborales.',
'Ley de igualdad de edad.',
'Reforma constitucional.',
'Arborización por nacimiento.',
'Sanciones de armas más duras.',
'```Comités de seguridad.``` \n(Translated to Spanish: Comités de seguridad.)',
'Modificaciones legales para migrantes.',
'Criminalizando la negación de los derechos humanos.',
'Reforma de empresas estatales.',
'Exención de multa.',
'Prioridad de educación sexual de los padres.',
'Secretos bancarios ampliados.',
'Modificación de la ley del VIH.',
'Cambios en la ley de votación.',
'```Subsistema de Inteligencia.``` \n(Spanish)'

The input (truncated) is:

['Requisitos para entidades culturales.',
'Alcohol ID requirement.',
'Ampliar plazo matrimonio.',
'Bono Invierno incrementado.',
'GPS insurance improvement.',
'Retiro de fondos.',
'Illegal re-entry penalized.',
'Water protection modification.',
'Porte de armas autorizado.',
'Airline refund addition.',
'Licencia adulterada.',
'Rename airport after Margot Duhalde.',
'Arriendo regulado.',
'Modificaciones laborales.',
'Age equality law.',
'Constitutional reform.',
'Arborización por nacimiento.',
'Tougher weapon penalties.',
'Comités de seguridad.',
'Legal modifications for migrants.',
'Criminalizing human rights denial.',
'State-owned companies reform.',
'Exención de multa.',
"Parents' sexual education priority.",
'Secretos bancarios ampliados.',
'HIV law modification.',
'Voting law changes.',
'Subsistema de Inteligencia.']

Integration of other LLMs

Anything in pipeline to integrate other Large Language Models such as LLAMA, GPT-J, GPT4ALL -J, AUTO-GPT and others?
Quick Reference: https://github.com/ajayarunachalam/pychatgpt_gui

Closing as duplicate of #40

          Closing as duplicate of #40

Originally posted by @OKUA1 in #39 (comment)
>pip install annoy
WARNING: Ignoring invalid distribution -rotobuf (c:\users\sumeruinfra\appdata\local\programs\python\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -ymongo (c:\users\sumeruinfra\appdata\local\programs\python\python310\lib\site-packages)
Collecting annoy
Using cached annoy-1.17.3.tar.gz (647 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: annoy
Building wheel for annoy (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.10
creating build\lib.win-amd64-3.10\annoy
copying annoy_init_.py -> build\lib.win-amd64-3.10\annoy
copying annoy_init_.pyi -> build\lib.win-amd64-3.10\annoy
copying annoy\py.typed -> build\lib.win-amd64-3.10\annoy
running build_ext
building 'annoy.annoylib' extension
creating build\temp.win-amd64-3.10
creating build\temp.win-amd64-3.10\Release
creating build\temp.win-amd64-3.10\Release\src
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\include -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\Include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include /EHsc /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-3.10\Release\src/annoymodule.obj -D_CRT_SECURE_NO_WARNINGS -fpermissive -DANNOYLIB_MULTITHREADED_BUILD
cl : Command line warning D9002 : ignoring unknown option '-fpermissive'
annoymodule.cc
C:\Users\sumeruinfra\AppData\Local\Temp\pip-install-_z1x63lg\annoy_7c08e48d9418499c876cb0133458a99b\src\annoylib.h(19): fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for annoy
Running setup.py clean for annoy
Failed to build annoy
ERROR: Could not build wheels for annoy, which is required to install pyproject.toml-based projects

[notice] A new release of pip is available: 23.1.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip

C:\Users\sumeruinfra>

can it output probabilities

Hi,i must say this is an excellent work.While using it, is it possible to output probabilities?

Error for DynamicFewShotGPTClassifier when using OpenAI Api

Hello,

My codes for DynamicFewShotGPTClassifier worked well about a month ago, but for some reason, I'm getting error messages today. I still have my openAI account that is pay as you go. I tried running the code with a new API, and I still got errors.

Below is my code for DynamicFewShotGPTClassifier and the error:

from skllm import DynamicFewShotGPTClassifier
GPT_model2 = DynamicFewShotGPTClassifier("gpt-4", n_examples=3)
GPT_model2.fit(X,y)
GPT_model2_predict = GPT_model2.predict(X)

RuntimeError Traceback (most recent call last)
Cell In[45], line 3
1 from skllm import DynamicFewShotGPTClassifier
2 GPT_model2 = DynamicFewShotGPTClassifier(n_examples=3)
----> 3 GPT_model2.fit(X,y)
4 GPT_model2_predict = GPT_model2.predict(X)
6 few_shot = classification_report(y, GPT_model2_predict, output_dict=True)

File ~/anaconda3/lib/python3.10/site-packages/skllm/models/gpt/gpt_dyn_few_shot_clf.py:81, in DynamicFewShotGPTClassifier.fit(self, X, y)
79 partition = X[y == cls]
80 self.data_[cls]["partition"] = partition
---> 81 embeddings = self.embedding_model_.transform(partition)
82 index = AnnoyMemoryIndex(embeddings.shape[1])
83 for i, embedding in enumerate(embeddings):

File ~/anaconda3/lib/python3.10/site-packages/sklearn/utils/_set_output.py:142, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
140 @wraps(f)
141 def wrapped(self, X, *args, **kwargs):
--> 142 data_to_wrap = f(self, X, *args, **kwargs)
143 if isinstance(data_to_wrap, tuple):
144 # only wrap the first output for cross decomposition
145 return (
146 _wrap_data_with_container(method, data_to_wrap[0], X, self),
147 *data_to_wrap[1:],
148 )

File ~/anaconda3/lib/python3.10/site-packages/skllm/preprocessing/gpt_vectorizer.py:74, in GPTVectorizer.transform(self, X)
71 embeddings = []
72 for i in tqdm(range(len(X))):
73 embeddings.append(
---> 74 _get_embedding(X[i], self._get_openai_key(), self._get_openai_org())
75 )
76 embeddings = np.asarray(embeddings)
77 return embeddings

File ~/anaconda3/lib/python3.10/site-packages/skllm/openai/embeddings.py:48, in get_embedding(text, key, org, model, max_retries)
46 error_type = type(e).name
47 sleep(3)
---> 48 raise RuntimeError(
49 f"Could not obtain the embedding after {max_retries} retries: {error_type} :: {error_msg}"
50 )

RuntimeError: Could not obtain the embedding after 3 retries: InvalidRequestError :: Must provide an 'engine' or 'deployment_id' parameter to create a <class 'openai.api_resources.embedding.Embedding'>

Consider updating package requirements

Consider remaining compatible with the openai package.

Specs: MacBook Pro 2019

APIConnectionError

I am running this code:

from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("my key")
SKLLMConfig.set_azure_api_base("my url")

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = ZeroShotGPTClassifier(openai_model="azure::gpt-35-turbo")
clf.fit(X, y)
labels = clf.predict(X)

And i get this error:

1%|          | 1/109 [00:27<49:23, 27.44s/it]
Could not obtain the completion after 3 retries: `APIConnectionError :: Error communicating with OpenAI: HTTPSConnectionPool(host='gen-ai-sweden.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-05-15 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000018D602595D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))`
None
Could not extract the label from the completion: 'NoneType' object is not subscriptable

I am using openai version 0.28.0

ERROR: Failed building wheel for annoy

Building wheels for collected packages: annoy
Building wheel for annoy (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.10
creating build\lib.win-amd64-3.10\annoy
copying annoy_init_.py -> build\lib.win-amd64-3.10\annoy
copying annoy_init_.pyi -> build\lib.win-amd64-3.10\annoy
copying annoy\py.typed -> build\lib.win-amd64-3.10\annoy
running build_ext
building 'annoy.annoylib' extension
creating build\temp.win-amd64-3.10
creating build\temp.win-amd64-3.10\Release
creating build\temp.win-amd64-3.10\Release\src
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\include -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\Include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include /EHsc /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-3.10\Release\src/annoymodule.obj -D_CRT_SECURE_NO_WARNINGS -fpermissive -DANNOYLIB_MULTITHREADED_BUILD
cl : Command line warning D9002 : ignoring unknown option '-fpermissive'
annoymodule.cc
C:\Users\sumeruinfra\AppData\Local\Temp\pip-install-kyqaegqf\annoy_bbac4dc388c84ce883a7b6302751b8ed\src\annoylib.h(19): fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

Issues is related to Build Wheels Files
@iryna-kondr

How we can use custom train model for predictions

Let us assume we have build model on our own custom labeled data.
we can save model as pickle file , while testing we can load that particular pickle file and do predictions. Is this functionality available with the current implementation.if yes, please share me the sample notebook or code for it.

thanks
chandra

Text Vectorization and Dynamic Few-Shot Text Classification not working

Hello,

I have the following code in my jupyter notebook:

!pip install scikit-llm
from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("")
SKLLMConfig.set_azure_api_base("<my azure api base")

from skllm.datasets import get_multilabel_classification_dataset
X, y = get_classification_dataset()

from skllm import DynamicFewShotGPTClassifier
GPT_model3 = DynamicFewShotGPTClassifier(n_examples=3)
GPT_model3.fit(X, y)
GPT_labels3 = GPT_model3.predict(X)

from skllm.preprocessing import GPTVectorizer
model = GPTVectorizer()
vectors = model.fit_transform(X)

Both dynamic few shot classifier and GPTvectorizer are giving me the following issues:

RuntimeError Traceback (most recent call last)
Cell In[135], line 4
2 X, _ = get_classification_dataset()
3 model = GPTVectorizer()
----> 4 vectors = model.fit_transform(X)

File ~/anaconda3/lib/python3.10/site-packages/skllm/preprocessing/gpt_vectorizer.py:94, in GPTVectorizer.fit_transform(self, X, y, **fit_params)
79 def fit_transform(self, X: Optional[Union[np.ndarray, pd.Series, List[str]]], y=None, **fit_params) -> ndarray:
80 """
81 Fits and transforms a list of strings into a list of GPT embeddings.
82 This is modelled to function as the sklearn fit_transform method
(...)
92 embeddings : np.ndarray
93 """
---> 94 return self.fit(X, y).transform(X)

If you could help me figure out this issue, that would be great!

Literature for dynamic few shot prompting

I understand the reasoning for first retrieving the most relevant (similar) text samples, but I was wondering if you have some papers you could recommend for this topic?

[Feature Request]: Batched Async Prediction

Hi,

The current scikit-llm is implemented in a synchronous way - the prompts are sent to the api one-by-one.

This is not ideal when we have a large dataset and a high tier (high TPM/RPM) account. Is it possible to incorporate batched async feature?

Reference:

oaib

Azure OpenAI Embeddings

You have added support for Azure OpenAI GPT models. Please add support for Azure OpenAI embedding models too. Due to this problem, I can't use the GPTVectorizer as well as Dynamic Few Shot Classification.

是否支持加载本地大模型做文本向量化？

Failed building wheel for annoy In scikit-llm[gpt4all]

Building wheels for collected packages: annoy
Building wheel for annoy (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.10
creating build\lib.win-amd64-3.10\annoy
copying annoy_init_.py -> build\lib.win-amd64-3.10\annoy
copying annoy_init_.pyi -> build\lib.win-amd64-3.10\annoy
copying annoy\py.typed -> build\lib.win-amd64-3.10\annoy
running build_ext
building 'annoy.annoylib' extension
creating build\temp.win-amd64-3.10
creating build\temp.win-amd64-3.10\Release
creating build\temp.win-amd64-3.10\Release\src
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\include -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\Include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include /EHsc /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-3.10\Release\src/annoymodule.obj -D_CRT_SECURE_NO_WARNINGS -fpermissive -DANNOYLIB_MULTITHREADED_BUILD
cl : Command line warning D9002 : ignoring unknown option '-fpermissive'
annoymodule.cc
C:\Users\sumeruinfra\AppData\Local\Temp\pip-install-risjy5ab\annoy_0f7489caf9de4e62a7376c32e7609982\src\annoylib.h(19): fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

Feature request: setting seed parameter of OpenAI's chat completions API

Thank you for creating and maintaining this awesome project!

OpenAI recently introduced the seed parameter to make their models' text generation and chat completion behavior (more) reproducible (see https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter).

I think it would be great if you could enable users of your package to control this parameter when using OpenAI models as a backend (i.e., in the files here: https://github.com/iryna-kondr/scikit-llm/tree/main/skllm/models/gpt)

The seed parameter could be hard-coded

scikit-llm/skllm/llm/gpt/clients/openai/completion.py

Line 50 in 0bdea94

temperature=0.0, messages=messages, **model_dict

similar to setting temperature=0.0.

Alternatively, users could pass seed=<SEED> via **kwargs.

Predictions scores for the ZeroShotGPTClassifier

I am trying to use the ZeroShotGPTClassifier and evaluate the results based on the prediction score. Is there a way to get the prediction score?

The HuggingFace ZeroShotClassifier returns a dictionary with labels and scores as lists. I am looking for a way to score the labels. Otherwise, there is no way to evaluate the labels from the ZeroShotClassifier. Especially if it returns a random label when no label matches.

Is it possible to hybrid Bard and ChatGPT?

ChatGPT is good at being creative.
Bard is good at answering and finding good responses.

WDYT?