In a multi-turn conversation I see that the combination of llama-cpp-python and llama-

maybe related to this? <a class="issue-link js-issue-link" data-error-text="Failed

Slow processing of follow-up prompt about llama-cpp-agent HOT 3 OPEN

maximilian-winter commented on July 28, 2024

Slow processing of follow-up prompt

from llama-cpp-agent.

Comments (3)

woheller69 commented on July 28, 2024

It seems that the model always needs to evaluate its own previous answer as part of the prompt.
In the following examples my own new prompt was quite short every time.
Number of tokens in prompt eval is always the number of previous tokens in answer plus a few more:

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 157.69 ms / 59 runs ( 2.67 ms per token, 374.15 tokens per second)
llama_print_timings: prompt eval time = 143124.12 ms / 356 tokens ( 402.03 ms per token, 2.49 tokens per second)
llama_print_timings: eval time = 28492.60 ms / 58 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 62822.10 ms / 414 tokens
Inference terminated
Llama.generate: prefix-match hit

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 429.62 ms / 154 runs ( 2.79 ms per token, 358.46 tokens per second)
llama_print_timings: prompt eval time = 8277.50 ms / 88 tokens ( 94.06 ms per token, 10.63 tokens per second)
llama_print_timings: eval time = 75161.81 ms / 153 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 86088.57 ms / 241 tokens
Inference terminated
Llama.generate: prefix-match hit

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 195.41 ms / 68 runs ( 2.87 ms per token, 347.98 tokens per second)
llama_print_timings: prompt eval time = 15788.26 ms / 163 tokens ( 96.86 ms per token, 10.32 tokens per second)
llama_print_timings: eval time = 33127.41 ms / 67 runs ( 494.44 ms per token, 2.02 tokens per second)
llama_print_timings: total time = 50100.81 ms / 230 tokens

Shouldn`t the model already find a tokenization of its previous answer? Or can it be that the applied chat template differs a bit from what the model used in its answer so it does not recognize it?

from llama-cpp-agent.

woheller69 commented on July 28, 2024

maybe related to this?
abetlen/llama-cpp-python#893 (comment)
My guess is that the chat template differs from that used in the model response ( maybe just a \n or whatever) and it threrefore does not recognise it anymore.
So every answer has to be processed once when it is inserted via the template for a multiturn conversation.
Can't we just use the template only for new requests and keep a record of the history exactly as the model already knows it?
For the next request we send the exact history plus the new request wrapped by the template

from llama-cpp-agent.

Recommend Projects

Slow processing of follow-up prompt about llama-cpp-agent HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent