chenhunghan / ialacol Goto Github PK

View Code? Open in Web Editor NEW

140.0 3.0 17.0 256 KB

🪶 Lightweight OpenAI drop-in replacement for Kubernetes

License: MIT License

Dockerfile 0.94% Python 97.82% Metal 1.24%

ai helm kubernetes langchain llm python openai cloudnative ggml gpu

ialacol's Introduction

🤓 TypeScript/React developer, 🤗 everything cloud-native.

Blog.
Top blog post most recently I made a Copilot in Rust 🦀 , here is what I have learned... (as a TypeScript dev)
Taiwanese living in Finland.
Fun fact: I was a digital artist, portfolio.

ialacol's People

Contributors

Stargazers

Watchers

Forkers

sorokinvld ai-ar4s-dev ktaletsk hmilkovi jeefy rogervaas manniru junwenyin xiasiyu octag0no mvandermeulen ego worker-ants webclinic017 damianoneill donbale svenwal

ialacol's Issues

Error when trying to use Falcon-7B

I get the same error when Falcon 7B pod starts. Looks like the model file was deleted: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML/commit/7343df6eea4cfef162077380d075b49fdc9364ee

Deploy

helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml

Pod logs:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
Error downloading model: 404 Client Error. (Request ID: Root=1-64d11005-11d0783d650eb9652ee63f84)

Entry Not Found for url: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML/resolve/main/wizard-falcon-7b.ggmlv3.q4_1.bin.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Error when trying to use Starchat Beta

I get the following error when requesting anything from Starchat Beta model

Deploy

helm install starchat-beta ialacol/ialacol

Request and error

$ curl -X POST -H 'Content-Type: application/json' -d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "starchat-beta.ggmlv3.q4_0.bin", "stream": false}' http://starchat-beta:8000/v1/chat/completions
Internal Server Error

Pod logs:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
Downloading (…)beta.ggmlv3.q4_0.bin: 100%|██████████| 10.7G/10.7G [03:09<00:00, 56.6MB/s]
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
error loading model: unexpectedly reached end of file
llama_load_model_from_file: failed to load model
INFO:     <ip>:40350 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fastapi/routing.py", line 227, in app
    solved_result = await solve_dependencies(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fastapi/dependencies/utils.py", line 622, in solve_dependencies
    solved = await call(**sub_values)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/get_llm.py", line 45, in get_llm
    return AutoModelForCausalLM.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ctransformers/hub.py", line 157, in from_pretrained
    return LLM(
           ^^^^
  File "/usr/local/lib/python3.11/site-packages/ctransformers/llm.py", line 214, in __init__
    raise RuntimeError(
RuntimeError: Failed to create LLM 'llama' from './models/starchat-beta.ggmlv3.q4_0.bin'.

Any ideas if I'm doing something wrong (i.e. request structure is incorrect), or there is a legitimate issue with the image/deployment?

accelerator for project

Hi @chenhunghan, very cool project. was looking for something like this when i saw that Falcon was out. Any recs on appropriate accelerators for using the 7B and 40B models? I'm going to try your project out, and I'm wondering if you've already experimented with appropriate node pool specifications for running the deployment.

happy to add some tf-templates in the repo for some different providers if you want to collaborate

Allow to mount existing pvc

So the model bins can be reused.

Mix streamings and threads count for GPTQ Models bug

Hi @chenhunghan
When using streaming two seperate queries at once to the llm server the tokens from query 1 appears in query 2 response.
for example if i execute two queries at thesame time via two seperate python3 scripts
Script A: using streams.py example to ask "What is photosynthesis"
Script B: using streams.py example to ask "What is an airplane"

Sometimes the results from what is an airplane in Script B shows up in the result stream for Script A and vice-verca what could be the issue? could be THREADS value? right now i set to "1" according to the guide where is says use "1" for GPTQ models, i am using a very large AWS EC2 instance with many vcpus, should i bump up the threads or what is the solution to this?
Thank you @chenhunghan

Usage gpu_layers with ialacol-metal provides an error

Usage gpu_layers (gpu_layers>0) with ialacol-metal provides an error OSError: libcudart.so.12: cannot open shared object file: No such file or directory

Downloading models fail with timeouts, retry is not enabled.

When downloading a number of models (clearly from Hugging Face), I get the following error every time, seemingly at exactly the same point as well for each respective model that I have tried, even very small ones.
ERROR: Error downloading model: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.

I think the code needs to be modified to include retries.

It would also be helpful if the documentation described how we can download models manually from within the pod.

Helm install fails

gengwg@gengwg-mbp:~$ helm repo add ialacol https://chenhunghan.github.io/ialacol
"ialacol" has been added to your repositories
gengwg@gengwg-mbp:~$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ialacol" chart repository
...Successfully got an update from the "harbor" chart repository
Update Complete. ⎈Happy Helming!⎈
gengwg@gengwg-mbp:~$ helm install llama-2-7b-chat ialacol/ialacol
Error: INSTALLATION FAILED: template: ialacol/templates/deployment.yaml:29:29: executing "ialacol/templates/deployment.yaml" at <.Values.deployment.env.DEFAULT_MODEL_HG_REPO_ID>: nil pointer evaluating interface {}.DEFAULT_MODEL_HG_REPO_ID

Deployment fails to respond with errors

I have been able to deploy this to Okteto and the Volume has been created and the pod successfully loads a model.
However, when I port-forward into it, which kubectl says worked, and I connect to localhost:8000, I only get "{"detail":"Not Found"}" in my browser and log for the pod is as follows:
2023-08-24 17:14:36.55 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: Application startup complete.
2023-08-24 17:14:36.55 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: Uvicorn running on http://0.0.0.0:8000/ (Press CTRL+C to quit)
2023-08-24 17:15:28.35 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: 10.8.26.12:32790 - "GET / HTTP/1.1" 404 Not Found
2023-08-24 17:19:14.65 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: 127.0.0.1:58060 - "GET / HTTP/1.1" 404 Not Found
2023-08-24 17:19:14.91 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: 127.0.0.1:58060 - "GET /favicon.ico HTTP/1.1" 404 Not Found
2023-08-24 17:19:21.12 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: 127.0.0.1:58064 - "GET / HTTP/1.1" 404 Not Found
2023-08-24 17:19:22.11 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: 127.0.0.1:58064 - "GET / HTTP/1.1" 404 Not Found
2023-08-24 17:42:47.20 UTCllamacpp-64bdc45bc5-fk6h9llamacppINFO: 127.0.0.1:58222 - "GET / HTTP/1.1" 404 Not Found

How can I fix this or is this a known bug in the Image as it seems like the webserver is misconfigured.

Question: Does ialacol support multi-arch?

I am interested in running ialacol on m7 arm based instances in AWS. I was wondering if you build / release images for arm based CPU?

Unable to download HG model from specific branch in helm chart

Hi i am unable to set and download specific branch in the helm chart i get an invalid repo id error when i append :BRANCH_NAME
for example this doesnt work TheBloke/zephyr-7B-alpha-GPTQ:gptq-4bit-32g-actorder_True

How do i get it work what should be value of DEFAULT_MODEL_HG_REPO_ID and DEFAULT_MODEL_FILE

trying to use this 4bit 32g for best quality https://huggingface.co/TheBloke/zephyr-7B-alpha-GPTQ

Thanks @chenhunghan @ktaletsk

Plan to support AWQ models

Any plans to support the new AWQ models which are more performant than GPTQ? @chenhunghan

Support GPTQ model

To add support for GPTQ model, we will need to have CI to build/push image with GPTQ tag.

Support GPTQ via Transformer instead of Exllama/ctransformer

https://github.com/huggingface/blog/blob/main/gptq-integration.md

(if the "native" performance of Transformer is comparable to Exllama)

Issue with GPU-accelerated LLAMA 2

I cannot get GPU-accelerated LLAMA 2 working. This is the pod logs when the error happens:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
Downloading (…)chat.ggmlv3.q4_0.bin: 100%|██████████| 3.79G/3.79G [00:09<00:00, 393MB/s]
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
WARNING: failed to allocate 0.08 MB of pinned memory: CUDA driver version is insufficient for CUDA runtime version
CUDA error 35 at /home/runner/work/ctransformers/ctransformers/models/ggml/ggml-cuda.cu:4236: CUDA driver version is insufficient for CUDA runtime version

I am using g5.2xlarge EC2 instance with NVIDIA A10G GPU.
This is the output of nvidia-smi from within the pod:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   26C    P8    15W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Snippet of my Helm values

deployment:
  image: ghcr.io/chenhunghan/ialacol-cuda11:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/Llama-2-7B-Chat-GGML
    DEFAULT_MODEL_FILE: llama-2-7b-chat.ggmlv3.q4_0.bin
    GPU_LAYERS: 20

Any ideas where the mismatch is coming from?

Add langchain instructions

Please, add instructions on how to set up and use Langchain with ialacol

Support https://github.com/mckaywrigley/chatbot-ui

I would be interested in helping to make this happen. Also will link to an issue with the vice-versa.

Storage Class value named differently in PVC templates and documentation examples

There is a mismatch between docs and templates that I noticed:

Templates expect .Values.model.persistence.storageClassName:

ialacol/charts/ialacol/templates/pvc-cache.yaml

Lines 7 to 9 in 5cd87a8

{{- if .Values.cache.persistence.storageClassName }}

storageClassName: {{ .Values.cache.persistence.storageClassName }}

{{- end }}
ialacol/charts/ialacol/templates/pvc-model.yaml

Lines 7 to 9 in 5cd87a8

{{- if .Values.model.persistence.storageClassName }}

storageClassName: {{ .Values.model.persistence.storageClassName }}

{{- end }}

But the example values all have .Values.model.persistence.storageClass, i.e.

ialacol/examples/values/llama2-7b-chat.yaml

Line 14 in 5cd87a8

storageClass: ~

As a result, inattentive users might not notice the difference and use not the intended storage class, but the default one

Add default Liveness, Readiness and Startup probes

Images with :cuda and :metal tags

In order to add support for CUDA 11/Metal, we might want to add :cuda, :metal tags to the docker images. We might need to update the CI for building images.

Quickly get started with ialacol.

Hi So deployed the pod for openllama-7b, On a kind cluster. On a device that does not have any GPU. Will this model be able to run on a system with 16 GB ram. I have port forwarded the pod but with the curl not getting any responses. Any suggestions on what might be the issue?

What all i did:
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol

kubectl port-forward svc/openllama-7b 8000:8000

curl -X POST
-H 'Content-Type: application/json'
-d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "open_llama_7b-q4_0-ggjt.bin"}'
http://localhost:8000/v1/chat/completions

Auto detecting threads

Hi @chenhunghan,

nice project works good so far thanks for making this public available!

But i think the auto detecting threads feature does not work correctly, if i do not give the environment variable it runs on 8 threads although i have for example 16. After setting the env variable to 16 all threads are utilized.

Also one question, if i send requests in parallel is it normal that ialacol exits?

	{{- if .Values.cache.persistence.storageClassName }}
	storageClassName: {{ .Values.cache.persistence.storageClassName }}
	{{- end }}

	{{- if .Values.model.persistence.storageClassName }}
	storageClassName: {{ .Values.model.persistence.storageClassName }}
	{{- end }}

chenhunghan / ialacol Goto Github PK

ialacol's Introduction

🤓 TypeScript/React developer, 🤗 everything cloud-native.

ialacol's People

Contributors

Stargazers

Watchers

Forkers

ialacol's Issues

Deploy

Deploy

Request and error

Recommend Projects

Recommend Topics

Recommend Org