I'm using it under Windwos 11 with alpaca 7B Ok, it's great overall, but I have a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

I see, thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hove

too slow?,about serge-chat/serge

Comments (31)

mokahless commented on August 16, 2024 17

Ubuntu 22. using 30B

While a significant amount of this issue appears to be CPU processing, I am watching resource usage and noticing every time it finishes an answer, it unloads the model from ram, and so for every question I ask it, it has to read the entire thing from disk into ram again. For my system, this wastes about 15 seconds every question. This portion seems to scale with the threads you give it, too. The same attempt with a quarter of the threads took ~25 seconds to move from disk to ram.

Seems to me this could be sped up by keeping the model in ram? Is that possible or is there something more complex going on that can't be improved upon?

from serge.

nsarrazin commented on August 16, 2024 11

Yes it's on the list!

from serge.

gaby commented on August 16, 2024 5

We can take this discussion to Discord

from serge.

manageseverin commented on August 16, 2024 4

On alpaca 7B model, it takes about a minute for the answer to start to appear. Win 11, Ryzen 5600G/16GB RAM
I thought that it depends on the history size it need to feed back, but no - even for the first time (when there is no history) it also takes the same time.

from serge.

magicmars35 commented on August 16, 2024 3

@nsarrazin Hi Nathan, do you think it's possible to load the model in RAM, and keep it in ram, as long as the user querries are submitted ?

from serge.

y12studio commented on August 16, 2024 2

@manageseverin

Try this docs URL. Serge - Swagger UI http://localhost:8008/api/docs

from serge.

cyberius0 commented on August 16, 2024 2

Very easy to use, it works. But yes, very slow for me, too (30B model).
Takes about several minutes till the words begin to appear after I enter a prompt. But then the words appear quite after another.

edit: Looking at RAM consumption it seems the model is indeed unloaded after every response.
System: Windows 10 64-bit, Intel Core i7 8700K @ 3.70GHz, 32,0GB Dual-Channel DDR4 @ 1600MHz

from serge.

alph4b3th commented on August 16, 2024 2

I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes)

from serge.

voarsh2 commented on August 16, 2024 1

@voarsh2 you mentioned the default threads are 4. Where is this located? Is there a way to change it?

On the homepage there's "model settings" where you can change the number of threads

from serge.

johncadengo commented on August 16, 2024 1

I see, thanks @voarsh2. Let me know if you figure out a way to make it more performant on your CPUs since I have a few servers w/ similar CPUs as the ones you mentioned (Xeon E5-2600 v2 series).

from serge.

johncadengo commented on August 16, 2024 1

Looks like the latest PR incorporated updates for the new change to llama.cpp: #118

I'll try it out today and let you know if it helps @voarsh2

from serge.

alph4b3th commented on August 16, 2024 1

could you explain to me in detail how bitcoin works? I would like a technical article in a language for laymen.

4 threads, using 13B model took about 6 minutes to show any text. Excluding the initial read from disk as I had sent a chat before (but it still loads/unloads lots of memory, but not reading from disk......)

About 2 minutes to print out this incomplete text: "Bitcoins are digital currency that can be used as payment online or offline, just like cash and credit cards today. Bitcoins use peer-to-peer technology to operate with no central authority; managing transactions and the issuing of bitcoins is carried out collectively by the network.Bit"

it took 16 minutes here with 6 threads to generate this text: "Surely, Bitcoins work through cryptography and blockchain technology that enables them to be transferred from one user's wallet to another without any middleman or central authority involved. It is an open-source software which can run on anyone's computer hardware with a high degree of security as it uses peer-to-peer networking, making the transactions transparent and verifiable for all users in real time through distributed ledger technology (DLT)."

from serge.

alph4b3th commented on August 16, 2024

hey, it's really slow! amd epyc 16gb ram, and a lot of delay. more than a minute to load the model to ram. (SSD)

from serge.

alph4b3th commented on August 16, 2024

wtf? I have a powerful server! how heavy is this?

from serge.

futurepr0n commented on August 16, 2024

I am also fairly beefy specs - I find it slow to run but I am guessing its due to the issue pointed out offloading the model. If we could keep it persistent it would work best, but I could see the reason for not wanting that maybe that its tricky to manage multiple sessions if you implement it right now? I don't know - maybe we could have a superuser option if we have the available ram to keep it loaded full time?

from serge.

alph4b3th commented on August 16, 2024

the problem is not serge, it's in llama.cpp. Something doesn't work well with docker, I saw 'npx dalai serve' run and the model responded in 3-5 seconds.. with docker here on my server, it took between 18-60min to initialize and load to ram, and 8 minutes for the model to finish its relatively small response.

from serge.

Mattssn commented on August 16, 2024

I am playing with the 30B Model, and I am having the same thing, I am running this on Docker on a pretty beefy box, but getting pretty slow response times. Something to maybe have a variable to keep the model alive for 10 minutes then shut it down for inactivity, tbh, my server really never passes like 14gb of memory usage, an option to keep it loaded as long as the docker container is running might be cool too, I know you said you are working on it, just wanted to give my feedback :)

from serge.

voarsh2 commented on August 16, 2024

Quite unusable to constantly need to load/unload 4GB's in RAM. Not everyone is on SSD either, so then you have to contend with waiting on disks as well to load in RAM for every chat submission.

I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes)

Similar experience.
I've tried 7B and 30. I've fed it 50GB's of RAM, giving it 32 cores of CPU (32 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2 Sockets)) and on a 48 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (2 Sockets) - always takes several minutes to load the model back into memory after each response. Realistically, it takes a good 5-10 minutes between each simple response and even longer for typing out multi-line responses. I have a Ryzen 5x (12 x AMD Ryzen 5 3600X 6-Core Processor (1 Socket)) system, that doesn't meet the RAM requirement (a kubernetes node with existing workload taking up available RAM) so I won't bother even trying - the 2 CPU E5-2697 v2 system is on PAR with the Ryzen CPU system. I am not sure why the default threads are 4, increasing the threads for the model doesn't seemingly make responses any faster, but grind the CPU to starvation....

I have an RTX 260 GPU that I can't use for this project... a shame. Would much rather use a GPU for this type of task, even modern day CPU's struggle with this project.

Given there's no multi-server support/workers, I can't make use of the Kubernetes deployment beyond single node compute........ (there's no multi-server worker logic)

--- edit:
I did test on AMD Ryzen 5 3600X 6-Core - not much faster (but faster at printing the text out word by word. 🤷) by any means (2019 CPU vs 2013 CPU, although they are comparable).

from serge.

alph4b3th commented on August 16, 2024

I discovered that the problem is in the compilation of the new version of llama.cpp, in which the parameters passed by the compiler are making the software slower, and that an older version like: https://github.com/nomic-ai/gpt4all
it works faster because it doesn't have these optimizations

from serge.

johncadengo commented on August 16, 2024

@voarsh2 you mentioned the default threads are 4. Where is this located? Is there a way to change it?

from serge.

voarsh2 commented on August 16, 2024

I see, thanks @voarsh2. Let me know if you figure out a way to make it more performant on your CPUs since I have a few servers w/ similar CPUs as the ones you mentioned (Xeon E5-2600 v2 series).

Haha sure, hopefully the maintainer can work it out, from this issue. I'm genuinely curious to know how the maintainer got the response speed he gets from his gif/demo...... It's not like anyone in this issue is trying to run it on a Raspberry Pi. Lol. People are using EPYC and Ryzen CPU's.......

I am going to try dalai next - maybe some better luck with a different codebase.

I discovered that the problem is in the compilation of the new version of llama.cpp, in which the parameters passed by the compiler are making the software slower, and that an older version like: https://github.com/nomic-ai/gpt4all
it works faster because it doesn't have these optimizations

You might help the maintainer by providing some specifics on these optimisations and your proof?

from serge.

alph4b3th commented on August 16, 2024

you can try to read the thread

from serge.

psociety commented on August 16, 2024

Maybe this helps? I asked ChatGPT to refactor the code so the subprocess remains open rather than being started on each request.

api/src/serge/utils/generate.py:

import subprocess, os
from serge.models.chat import Chat, ChatParameters
import asyncio
import logging

logger = logging.getLogger(__name__)

async def generate(
    prompt: str,
    params: ChatParameters,
    procLlama: asyncio.subprocess.Process,
    CHUNK_SIZE: int
):
    await params.fetch_all_links()

    args = (
        "llama",
        "--model",
        "/usr/src/app/weights/" + params.model + ".bin",
        "--prompt",
        prompt,
        "--n_predict",
        str(params.max_length),
        "--temp",
        str(params.temperature),
        "--top_k",
        str(params.top_k),
        "--top_p",
        str(params.top_p),
        "--repeat_last_n",
        str(params.repeat_last_n),
        "--repeat_penalty",
        str(params.repeat_penalty),
        "--ctx_size",
        str(params.context_window),
        "--threads",
        str(params.n_threads),
        "--n_parts",
        "1",
    )

    logger.debug(f"Calling LLaMa with arguments", args)
    
    procLlama.stdin.write('\n'.join(args).encode() + b'\n')
    await procLlama.stdin.drain()
    
    while True:
        chunk = await procLlama.stdout.read(CHUNK_SIZE)

        if not chunk:
            return_code = await procLlama.wait()

            if return_code != 0:
                error_output = await procLlama.stderr.read()
                logger.error(error_output.decode("utf-8"))
                raise ValueError(f"RETURN CODE {return_code}\n\n"+error_output.decode("utf-8"))

        try:
            chunk = chunk.decode("utf-8")
        except UnicodeDecodeError:
            continue

        yield chunk


async def get_full_prompt_from_chat(chat: Chat, simple_prompt: str, procLlama: asyncio.subprocess.Process):
    await chat.fetch_all_links()
    
    await chat.parameters.fetch_link(ChatParameters.init_prompt)

    prompt = chat.parameters.init_prompt + "\n\n"
    
    if chat.questions != None:
        for question in chat.questions:
            if question.error != None: # skip errored out prompts
                continue
            prompt += "### Instruction:\n" + question.question + "\n"
            prompt += "### Response:\n" + question.answer + "\n"

    prompt += "### Instruction:\n" + simple_prompt + "\n"
    prompt += "### Response:\n"

    procLlama.stdin.write(prompt.encode() + b'\n')
    await procLlama.stdin.drain()

    return prompt


async def main():
    CHUNK_SIZE = 4
    procLlama = await asyncio.create_subprocess_exec(
        "llama",
        stdin=asyncio.subprocess.PIPE,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    prompt = "hello"
    params = ChatParameters()

    async for chunk in generate(prompt, params, procLlama, CHUNK_SIZE):
        print(chunk)

    prompt = "world"
    async for chunk in generate(prompt, params, procLlama, CHUNK_SIZE):
        print(chunk)

    procLlama.stdin.write(b"quit\n")
    await procLlama.stdin.drain()
    await procLlama.wait()

if __name__ == "__main__":
    asyncio.run(main())

I don't even know if this is the issue because i know nothing about Python and i just surfed the code.

from serge.

johncadengo commented on August 16, 2024

There's a new PR that was just merged in: ggerganov/llama.cpp#613

I was able to compile this on my servers (with Xeon E5-2600 v2 series CPUs) and have it work quite well. Is there any way we can get the latest version of llama.cpp in Serge? Might solve all the performance issues. Just have to make sure you use the script migrate-ggml-2023-03-30-pr613.py to migrate the models to work with the new file format.

More context here: ggerganov/llama.cpp#638 (comment)

from serge.

johncadengo commented on August 16, 2024

I'm pleased to report that as of the latest commit (cf84d0c) the performance is much better, at least on my CPUs, which were impossibly slow before.

cc @voarsh2, one thing to note is that by default is use 4 threads. I've increased that to the max number on my machines (32 in my particular case), and I started with GPT4All as the model, since it's a much smaller model and more performant. Getting great results with this test.

By the way, is there a way to get the default threads to be set according to the number of threads available?

from serge.

voarsh2 commented on August 16, 2024

cc @voarsh2, one thing to note is that by default is use 4 threads. I've increased that to the max number on my machines (32 in my particular case), and I started with GPT4All as the model, since it's a much smaller model and more performant. Getting great results with this test.

Hmmm. GPT4ALL is MUCH faster.......... 👁️
It's the same size as 7B though.......

Still rough around the edges. 🗡️
Hoping it gets faster or GPU support. Wish consumers had bigger VRAM for GPU's lol

But it does seem like cf84d0c and other upstream changes have helped a bit. Once it's done the RAM bit it's somewhat bearable (can't hurt for more performance tuning) - the major thing for me is the load/unloading in memory for every submission.

from serge.

alph4b3th commented on August 16, 2024

I'm pleased to report that as of the latest commit (cf84d0c) the performance is much better, at least on my CPUs, which were impossibly slow before.

cc @voarsh2, one thing to note is that by default is use 4 threads. I've increased that to the max number on my machines (32 in my particular case), and I started with GPT4All as the model, since it's a much smaller model and more performant. Getting great results with this test.

By the way, is there a way to get the default threads to be set according to the number of threads available?

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

from serge.

voarsh2 commented on August 16, 2024

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

"...giving it 32 cores of CPU (32 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2 Sockets)) and on a 48 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (2 Sockets) - I have a Ryzen 5x (12 x AMD Ryzen 5 3600X 6-Core Processor (1 Socket)) system ..."

"I have an RTX 260 GPU that I can't use for this project... a shame."

from serge.

alph4b3th commented on August 16, 2024

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

"...giving it 32 cores of CPU (32 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2 Sockets)) and on a 48 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (2 Sockets) - I have a Ryzen 5x (12 x AMD Ryzen 5 3600X 6-Core Processor (1 Socket)) system ..."

"I have an RTX 260 GPU that I can't use for this project... a shame."

from what I understand, your machine is not virtualized. Right? Well, how long can the model answer this question: "could you explain to me in detail how bitcoin works? I would like a technical article in a language for laymen." Could you test it for me?

from serge.

voarsh2 commented on August 16, 2024

could you explain to me in detail how bitcoin works? I would like a technical article in a language for laymen.

4 threads, using 13B model took about 6 minutes to show any text. Excluding the initial read from disk as I had sent a chat before (but it still loads/unloads lots of memory, but not reading from disk......)

About 2 minutes to print out this incomplete text:
"Bitcoins are digital currency that can be used as payment online or offline, just like cash and credit cards today. Bitcoins use peer-to-peer technology to operate with no central authority; managing transactions and the issuing of bitcoins is carried out collectively by the network.Bit"

7B-Native took about a minute to start printing finished after two minutes or so.
"Bitcoin is an innovative digital currency that uses cryptography to secure and verify transactions, creating what is known as a blockchain distributed ledger system. It operates through decentralized networks of computers which are constantly verifying the chain of past transactions in order to maintain accuracy and security. The network also continuously creates new blocks or “coins” when users send funds from one address to another. This allows for digital payments without any middleman, making it a truly peer-to-peer system with no central authority controlling its operations."

Ran again and it took about 3 minutes to start printing. Another 2 mins to finish.

from serge.

voarsh2 commented on August 16, 2024

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

Just noticed the "VPS" bit - I don't know if it's dedicated, that they dedicate cores or not... check the fine print, they may be throttling you in some way. This type of compute requires no trickery from the provider to get full speed needed. Also, IOPS. Naturally I am expecting you to have faster RAM than me, and I am on HDD, and even loading the model is faster than you with 2013 RAM (DDR3) and HDD...... (even worse, I am on Ceph, which is only realistically giving me 80 MB/s..... at best..... (distributed storage, 1Gbe limited, not native SAS/SATA speed, and a distributed storage system not known for... performance, but reliability and data safety). I wouldn't run this on Cloud compute unless you're paying a pretty buck to ensure you're getting the resources with no visible or invisible "tuning" from the provider (as they often overprovision, naturally)

from what I understand, your machine is not virtualized. Right?

I am running machine -> Debian/Proxmox -> VM -> Kubernetes/Docker (so the host will have other deployments running as well on K8) with Cephfs (I wouldn't normally use Cephfs, but the template used RWX) backed storage for the workload)

from serge.

too slow? about serge HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent