How are you running AnythingLLM? Docker (local) <h3 dir="auto"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

AnythingLLM with RAG and Vector Database Provider (Croma) Reloads LLM in VRAM about anything-llm HOT 7 CLOSED

Smocvin commented on July 19, 2024

AnythingLLM with RAG and Vector Database Provider (Croma) Reloads LLM in VRAM

from anything-llm.

Comments (7)

Smocvin commented on July 19, 2024

An update: The LLM reloads in VRAM every time only when using a third-party embedding provider (I use Ollama snow-flake-arctic-22m). There are no issues with the original AnythingLLM embedder.

from anything-llm.

timothycarambat commented on July 19, 2024

@Smocvin Are you using Ollama for your embedder or LLM or both? The issue only mentions you using Ollama as your LLM

from anything-llm.

Smocvin commented on July 19, 2024

When I use LLM -> Ollama and embedder "snow-flake-arctic-22m" -> Ollama = LLM in VRAM is uploading and dowloading
When I use LLM -> Ollama and embedder Original AnythingLLM = no issues with LLM uploading and dowloading in VRAM

from anything-llm.

timothycarambat commented on July 19, 2024

It is related to this issue and would be solved by this PR:

Ollama model unloads after 5 minutes #1585

The loading into vRAM is because we mlock it for future call performance. The timeout will resolve that matter however if you use Ollama for embedding and LLM and the system cannot hold both models in VRAM it will kick one out to make room for the one it needs it both cannot fit.

That is what is going on here - linked PR will fix that so moving discussion there

from anything-llm.

Smocvin commented on July 19, 2024

I see the problem. The issue is not with AnythingLLM; it's with Ollama. Ollama cannot hold two models simultaneously, even though I am using small models that should fit in the VRAM.

from anything-llm.

Smocvin commented on July 19, 2024

Solution to Ollama VRAM Issue
I found the solution to the issue:

Open the file /etc/systemd/system/ollama.service.
In the [Service] section, add the following line:

Environment="OLLAMA_MAX_LOADED_MODELS=3"

Save the file and close it.
Next, execute the following commands in the terminal:

systemctl daemon-reload
systemctl restart ollama

This configuration change allows Ollama to hold up to three models simultaneously, resolving the VRAM reloading issue.

from anything-llm.

timothycarambat commented on July 19, 2024

@Smocvin TIL! I did not know that was even a ENV key for ollama 😆

from anything-llm.

Recommend Projects

AnythingLLM with RAG and Vector Database Provider (Croma) Reloads LLM in VRAM about anything-llm HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent