Comments (8)
I'm not sure how much grammar based sampling is affecting the performance. But this seems to be a huge impact. Can you send me the grammar generated itself? I just want to make sure I doesn't messed up the generation of the grammar.
from llama-cpp-agent.
Hi, I think this is what you mean
>>> from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import generate_gbnf_grammar_from_pydantic_models
>>> generate_gbnf_grammar_from_pydantic_models([Book])
'root ::= (" "| "\\n") grammar-models\ngrammar-models ::= book\nbook ::= "{" ws "\\"title\\"" ": " string "," ws "\\"author\\"" ": " string "," ws "\\"published_year\\"" ": " number "," ws "\\"keywords\\"" ": " book-keywords "," ws "\\"category\\"" ": " book-category "," ws "\\"summary\\"" ": " string ws "}"\nbook-keywords ::= "[" ws string ("," ws string)* ws "]" \nbook-category ::= "\\"Fiction\\"" | "\\"Non-Fiction\\""'
from llama-cpp-agent.
@this-josh I made some tests, I used the same prompt for generation with grammar and without. My results are that it is about 12 times faster without grammar. I still have to do some additional tests.
With grammar:
llama_print_timings: load time = 307.76 ms
llama_print_timings: sample time = 11026.43 ms / 141 runs ( 78.20 ms per token, 12.79 tokens per second)
llama_print_timings: prompt eval time = 307.34 ms / 248 tokens ( 1.24 ms per token, 806.92 tokens per second)
llama_print_timings: eval time = 5111.64 ms / 140 runs ( 36.51 ms per token, 27.39 tokens per second)
llama_print_timings: total time = 17923.20 ms / 388 tokens
Without grammar:
llama_print_timings: load time = 307.76 ms
llama_print_timings: sample time = 844.00 ms / 138 runs ( 6.12 ms per token, 163.51 tokens per second)
llama_print_timings: prompt eval time = 280.42 ms / 218 tokens ( 1.29 ms per token, 777.39 tokens per second)
llama_print_timings: eval time = 3609.99 ms / 137 runs ( 26.35 ms per token, 37.95 tokens per second)
llama_print_timings: total time = 5686.03 ms / 355 tokens
This is my full test code:
from enum import Enum
from llama_cpp import Llama
from pydantic import BaseModel, Field
from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import \
generate_gbnf_grammar_and_documentation
from llama_cpp_agent.llm_prompt_template import PromptTemplate
from llama_cpp_agent.llm_settings import LlamaLLMGenerationSettings
from llama_cpp_agent.messages_formatter import MessagesFormatterType, get_predefined_messages_formatter
from llama_cpp_agent.structured_output_agent import StructuredOutputAgent
settings = LlamaLLMGenerationSettings(stream=False)
main_model = Llama(
"../gguf-models/Meta-Llama-3-8B-Instruct.Q5_k_m_with_temp_stop_token_fix.gguf",
n_gpu_layers=-1,
use_mlock=False,
embedding=False,
n_threads=12,
n_batch=2048,
n_ctx=2048,
last_n_tokens_size=1024,
verbose=True,
seed=42,
stream=True
)
# Example enum for our output model
class Category(Enum):
Fiction = "Fiction"
NonFiction = "Non-Fiction"
# Example output model
class Book(BaseModel):
"""
Represents an entry about a book.
"""
title: str = Field(..., description="Title of the book.")
author: str = Field(..., description="Author of the book.")
published_year: int = Field(..., description="Publishing year of the book.")
keywords: list[str] = Field(..., description="A list of keywords.")
category: Category = Field(..., description="Category of the book.")
summary: str = Field(..., description="Summary of the book.")
structured_output_agent = StructuredOutputAgent(main_model, llama_generation_settings=settings,
messages_formatter_type=MessagesFormatterType.LLAMA_3,
debug_output=False)
text = """The Feynman Lectures on Physics is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the California Institute of Technology (Caltech), during 1961–1963. The book's co-authors are Feynman, Robert B. Leighton, and Matthew Sands."""
print(structured_output_agent.create_object(Book, text))
grammar, documentation = generate_gbnf_grammar_and_documentation(
[Book],
model_prefix="Response Model",
fields_prefix="Response Model Field",
)
sys_prompt_template = PromptTemplate.from_string(
"You are an advanced AI agent. You are tasked to assist the user by creating structured output in JSON format.\n\n{documentation}"
)
creation_prompt_template = PromptTemplate.from_string(
"Create an JSON response based on the following input.\n\nInput:\n\n{user_input}"
)
sys_prompt = sys_prompt_template.generate_prompt({"documentation": documentation})
msg = creation_prompt_template.generate_prompt({"user_input": text})
sys_msg = {"role": "system", "content": msg}
user_msg = {"role": "user", "content": msg}
msg_list = [sys_msg, user_msg]
formatter = get_predefined_messages_formatter(MessagesFormatterType.LLAMA_3)
prompt, role = formatter.format_messages(msg_list, "assistant")
settings_dic = settings.as_dict()
settings_dic["stop"] = settings_dic["stop_sequences"]
settings_dic.pop("stop_sequences")
settings_dic.pop("print_output")
main_model.reset()
print(main_model.create_completion(prompt, **settings.as_dict()))
from llama-cpp-agent.
After trying out different forms of grammar, I can say it always takes about 12.5 ms per token.
from llama-cpp-agent.
Thanks for looking int this, unfortunately, I'm a little confused by your analysis and do not see this issue as closed. I've updated to the latest version of the package and am now using what I believe is the same model Meta-Llama-3-8B.Q5_K_M.gguf.
You show some timings above of around 75ms and 6ms with and without grammar respectively, I can produce this approximate 10x ratio too using your code. But in the message after you say about 12.5ms per token, I'm not sure where this figure comes from?
So now we have three values sample time, none being 12.5ms (which is a comparatively high figure)
- structured_output_agent.create_object ~75 ms
- main_model.create_completion ~ 7ms
- main_model(text) ~ 0.7ms
So my question remains, why is the
create_object
sample time 100x slower than just providing the text to the model?
Perhaps you could expand on your tests.
from llama-cpp-agent.
Sorry I meant that I tried different versions of the grammar, optimized for performance, and the result is always around 12.5 tokens per second. Not 12.5 ms per token. The reason for that is, that the grammar based sampling takes a lot of time. The reason your main_context(text) is going this fast, is that it only generates 16 tokens and it only uses the text as prompt. My code (main_model.create_completion) uses the same prompt as the structured output agent, but without grammar sampling. It generates around 140 tokens. The generation slows down as longer the output gets. Without grammar I get 163.51 tokens per second
from llama-cpp-agent.
Ah understood.
So is it expected that grammar can increase the sampling time per token by an order of magnitude?
from llama-cpp-agent.
I would say it depends on the complexity of the grammar.
from llama-cpp-agent.
Related Issues (20)
- Where's the RAGColbertReranker usage in llama-cpp-agent? HOT 7
- FunctionCallingAgent.generate_response lacking return statement HOT 3
- Does llama-cpp-agent support prefix or other regular expression features ? HOT 1
- Multiple models context management like Ollama.
- Dependency Dashboard
- Stop LLM output on user request? HOT 11
- Some about documentation page of project. HOT 3
- Is it necessary add additional_fields to AgentChainElement ? HOT 3
- Chat format ignored HOT 5
- Slow processing of follow-up prompt HOT 3
- Return control after function executed HOT 9
- Using 01_Basics example, the model is not loading in GPU HOT 8
- llama-cpp-agent v0.2.1 does not stop generating text HOT 2
- Crash, when setting top_k, top_p, or repeat_penalty HOT 15
- Is it possible to add mermaid syntax grammar support? HOT 1
- llama_model.reset() Does Not Clear Context History (Phi-3 4k)
- Avoid generating newlines and spaces rules in the GBNF grammar HOT 3
- Stuck at output HOT 6
- Enable streaming for chains and structured output HOT 1
- Request for image input support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama-cpp-agent.