Comments (3)
It seems that the model always needs to evaluate its own previous answer as part of the prompt.
In the following examples my own new prompt was quite short every time.
Number of tokens in prompt eval is always the number of previous tokens in answer plus a few more:
llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 157.69 ms / 59 runs ( 2.67 ms per token, 374.15 tokens per second)
llama_print_timings: prompt eval time = 143124.12 ms / 356 tokens ( 402.03 ms per token, 2.49 tokens per second)
llama_print_timings: eval time = 28492.60 ms / 58 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 62822.10 ms / 414 tokens
Inference terminated
Llama.generate: prefix-match hit
llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 429.62 ms / 154 runs ( 2.79 ms per token, 358.46 tokens per second)
llama_print_timings: prompt eval time = 8277.50 ms / 88 tokens ( 94.06 ms per token, 10.63 tokens per second)
llama_print_timings: eval time = 75161.81 ms / 153 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 86088.57 ms / 241 tokens
Inference terminated
Llama.generate: prefix-match hit
llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 195.41 ms / 68 runs ( 2.87 ms per token, 347.98 tokens per second)
llama_print_timings: prompt eval time = 15788.26 ms / 163 tokens ( 96.86 ms per token, 10.32 tokens per second)
llama_print_timings: eval time = 33127.41 ms / 67 runs ( 494.44 ms per token, 2.02 tokens per second)
llama_print_timings: total time = 50100.81 ms / 230 tokens
Shouldn`t the model already find a tokenization of its previous answer? Or can it be that the applied chat template differs a bit from what the model used in its answer so it does not recognize it?
from llama-cpp-agent.
maybe related to this?
abetlen/llama-cpp-python#893 (comment)
My guess is that the chat template differs from that used in the model response ( maybe just a \n or whatever) and it threrefore does not recognise it anymore.
So every answer has to be processed once when it is inserted via the template for a multiturn conversation.
Can't we just use the template only for new requests and keep a record of the history exactly as the model already knows it?
For the next request we send the exact history plus the new request wrapped by the template
from llama-cpp-agent.
Related Issues (20)
- Where's the RAGColbertReranker usage in llama-cpp-agent? HOT 7
- FunctionCallingAgent.generate_response lacking return statement HOT 3
- Does llama-cpp-agent support prefix or other regular expression features ? HOT 1
- Multiple models context management like Ollama.
- Dependency Dashboard
- Stop LLM output on user request? HOT 11
- Some about documentation page of project. HOT 3
- Is it necessary add additional_fields to AgentChainElement ? HOT 3
- Chat format ignored HOT 5
- Return control after function executed HOT 9
- Using 01_Basics example, the model is not loading in GPU HOT 8
- llama-cpp-agent v0.2.1 does not stop generating text HOT 2
- Crash, when setting top_k, top_p, or repeat_penalty HOT 15
- Is it possible to add mermaid syntax grammar support? HOT 1
- llama_model.reset() Does Not Clear Context History (Phi-3 4k)
- Avoid generating newlines and spaces rules in the GBNF grammar HOT 3
- Stuck at output HOT 6
- Enable streaming for chains and structured output HOT 1
- Request for image input support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama-cpp-agent.