It does not seem that ollama running on ipex-llm supports the most recent max_loaded_m

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Support for max_loaded_maps and num_parallel variables/parameter about bigdl HOT 4 OPEN

jars101 commented on August 21, 2024

Support for max_loaded_maps and num_parallel variables/parameter

from bigdl.

Comments (4)

sgwhat commented on August 21, 2024

Hi @jars101 ,

Ollama does not support max_loaded_maps.
You may run the command below to enable num_parallel setting:
```
export OLLAMA_NUM_PARALLEL=2
./ollama serve
```

from bigdl.

jars101 commented on August 21, 2024

Thank you @sgwhat , most recent versions of ollama do support both OLLAMA_MAX_LOADED_MAPS and OLLAMA_NUM_PARALELL for linux and windows. Running ollama through cuda(ipex-llm) does not seem to keep the settings since for every request on same model it reloads the model into memory. This behaviour on ollama for windows (standalone) does not occur.

llm-cpp snippet log:

base) C:\Users\Admin>conda activate llm-cpp

(llm-cpp) C:\Users\Admin>ollama serve
2024/06/05 04:57:07 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:C:\Users\Admin\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_TMPDIR:]"
time=2024-06-05T04:57:07.321-07:00 level=INFO source=images.go:704 msg="total blobs: 78"
time=2024-06-05T04:57:07.351-07:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-06-05T04:57:07.361-07:00 level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"

from bigdl.

jars101 commented on August 21, 2024

My bad, OLLAMA_NUM_PARALELL does work but OLLAMA_MAX_LOADED_MAPS does not. I went ahead and deployed a new installation of llm-cpp+ollama and I see that now i can make use of both variables. Also, ollama ps is available as well. The only problem I see when setting OLLAMA_KEEP_ALIVE to 600 seconds for instance is the following error:

INFO [print_timings] total time = 38167.02 ms | slot_id=0 t_prompt_processing=2029.651 t_token_generation=36137.372 t_total=38167.023 task_id=3 tid="3404" timestamp=1717656974 [GIN] 2024/06/05 - 23:56:14 | 200 | 47.9829961s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:16685, func:operator() SYCL error: CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_set_tensor at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:16685 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error" [GIN] 2024/06/05 - 23:58:18 | 200 | 4.1697536s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17384, func:operator() SYCL error: CHECK_TRY_ERROR(g_syclStreams[sycl_ctx->device][0]->memcpy( (char *)tensor->data + offset, data, size).wait()): Meet error in this line code! in function ggml_backend_sycl_set_tensor_async at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17384 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"

Further removing OLLAMA_KEEP_ALIVE and letting it be default of 5miunutes, im observiing the same issue.

from bigdl.

sgwhat commented on August 21, 2024

Could you please share the output of pip list from your environment and also your GPU model? Additionally, it would be helpful for us to resolve the issue if you could provide more information from the Ollama Server side.

from bigdl.

Support for max_loaded_maps and num_parallel variables/parameter about bigdl HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent