Comments (4)
Hi @jars101 ,
- Ollama does not support
max_loaded_maps
. - You may run the command below to enable
num_parallel
setting:export OLLAMA_NUM_PARALLEL=2 ./ollama serve
from bigdl.
Thank you @sgwhat , most recent versions of ollama do support both OLLAMA_MAX_LOADED_MAPS and OLLAMA_NUM_PARALELL for linux and windows. Running ollama through cuda(ipex-llm) does not seem to keep the settings since for every request on same model it reloads the model into memory. This behaviour on ollama for windows (standalone) does not occur.
llm-cpp snippet log:
base) C:\Users\Admin>conda activate llm-cpp
(llm-cpp) C:\Users\Admin>ollama serve
2024/06/05 04:57:07 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:C:\Users\Admin\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_TMPDIR:]"
time=2024-06-05T04:57:07.321-07:00 level=INFO source=images.go:704 msg="total blobs: 78"
time=2024-06-05T04:57:07.351-07:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-06-05T04:57:07.361-07:00 level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"
from bigdl.
My bad, OLLAMA_NUM_PARALELL does work but OLLAMA_MAX_LOADED_MAPS does not. I went ahead and deployed a new installation of llm-cpp+ollama and I see that now i can make use of both variables. Also, ollama ps is available as well. The only problem I see when setting OLLAMA_KEEP_ALIVE to 600 seconds for instance is the following error:
INFO [print_timings] total time = 38167.02 ms | slot_id=0 t_prompt_processing=2029.651 t_token_generation=36137.372 t_total=38167.023 task_id=3 tid="3404" timestamp=1717656974 [GIN] 2024/06/05 - 23:56:14 | 200 | 47.9829961s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:16685, func:operator() SYCL error: CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_set_tensor at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:16685 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error" [GIN] 2024/06/05 - 23:58:18 | 200 | 4.1697536s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17384, func:operator() SYCL error: CHECK_TRY_ERROR(g_syclStreams[sycl_ctx->device][0]->memcpy( (char *)tensor->data + offset, data, size).wait()): Meet error in this line code! in function ggml_backend_sycl_set_tensor_async at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17384 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
Further removing OLLAMA_KEEP_ALIVE and letting it be default of 5miunutes, im observiing the same issue.
from bigdl.
Could you please share the output of pip list
from your environment and also your GPU model? Additionally, it would be helpful for us to resolve the issue if you could provide more information from the Ollama Server
side.
from bigdl.
Related Issues (20)
- intel gpu and ollama error
- Qwen-7B-Chat on Xeon+ARC770 HOT 2
- 请适配GLM-4-9B模型
- vllm-cpu bug - Qwen2Attention' object has no attribute 'kv_scale' HOT 2
- starcoder2 use times out HOT 4
- LLaVA import error on CPU HOT 3
- Chatglm3 model convert to sym_int4 failed HOT 2
- Error: Failed to load the llama dynamic library. Segmentation fault HOT 6
- Add support for StableLM-2-12B on GPUs
- [Windows] Qwen1.5-7B 8K支持 HOT 4
- "can NOT allocate memory block with size larger than 4GB" on Arc A770 GPU when inference HOT 1
- [Windows] Qwen1.5-7B 性能优化
- Is there a way to run ollama with IPEX-LLM on CPU HOT 1
- ubuntu 22.04 MTL 165h benchmark Aborted (core dumped) HOT 2
- IPEX-LLM with Langchain-chatchat runs into httpcore.RemoteProtocolError in MTL with iGPU HOT 2
- IPEX-LLM(llama.cpp) met core dump when run Qwen-7B-Q4_K_M.gguf on Intel ARC770 HOT 4
- Quantized model loading method expects the model should be locally available. HOT 2
- Ollama Linux seg fault with GPU on Ubuntu 22.04 HOT 3
- Ollama on Windows not working HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bigdl.