I'm running Llama 3 with two A40s and am finding the llama-c

Hi, I think this is what you mean <div class="highlight highlight-source-python no

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Large portion of time spent on sample time about llama-cpp-agent HOT 8 CLOSED

maximilian-winter commented on July 28, 2024

Large portion of time spent on sample time

from llama-cpp-agent.

Comments (8)

Maximilian-Winter commented on July 28, 2024

I'm not sure how much grammar based sampling is affecting the performance. But this seems to be a huge impact. Can you send me the grammar generated itself? I just want to make sure I doesn't messed up the generation of the grammar.

from llama-cpp-agent.

this-josh commented on July 28, 2024

Hi, I think this is what you mean

>>> from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import generate_gbnf_grammar_from_pydantic_models
>>> generate_gbnf_grammar_from_pydantic_models([Book])
'root ::= (" "| "\\n") grammar-models\ngrammar-models ::= book\nbook ::= "{"  ws "\\"title\\"" ": " string ","  ws "\\"author\\"" ": " string ","  ws "\\"published_year\\"" ": " number ","  ws "\\"keywords\\"" ": " book-keywords ","  ws "\\"category\\"" ": " book-category ","  ws "\\"summary\\"" ": " string ws "}"\nbook-keywords ::= "[" ws string ("," ws string)* ws "]" \nbook-category ::= "\\"Fiction\\"" | "\\"Non-Fiction\\""'

from llama-cpp-agent.

Maximilian-Winter commented on July 28, 2024

@this-josh I made some tests, I used the same prompt for generation with grammar and without. My results are that it is about 12 times faster without grammar. I still have to do some additional tests.

With grammar:
llama_print_timings:        load time =     307.76 ms
llama_print_timings:      sample time =   11026.43 ms /   141 runs   (   78.20 ms per token,    12.79 tokens per second)
llama_print_timings: prompt eval time =     307.34 ms /   248 tokens (    1.24 ms per token,   806.92 tokens per second)
llama_print_timings:        eval time =    5111.64 ms /   140 runs   (   36.51 ms per token,    27.39 tokens per second)
llama_print_timings:       total time =   17923.20 ms /   388 tokens

Without grammar:
llama_print_timings:        load time =     307.76 ms
llama_print_timings:      sample time =     844.00 ms /   138 runs   (    6.12 ms per token,   163.51 tokens per second)
llama_print_timings: prompt eval time =     280.42 ms /   218 tokens (    1.29 ms per token,   777.39 tokens per second)
llama_print_timings:        eval time =    3609.99 ms /   137 runs   (   26.35 ms per token,    37.95 tokens per second)
llama_print_timings:       total time =    5686.03 ms /   355 tokens

This is my full test code:

from enum import Enum

from llama_cpp import Llama
from pydantic import BaseModel, Field

from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import \
    generate_gbnf_grammar_and_documentation
from llama_cpp_agent.llm_prompt_template import PromptTemplate
from llama_cpp_agent.llm_settings import LlamaLLMGenerationSettings
from llama_cpp_agent.messages_formatter import MessagesFormatterType, get_predefined_messages_formatter
from llama_cpp_agent.structured_output_agent import StructuredOutputAgent

settings = LlamaLLMGenerationSettings(stream=False)
main_model = Llama(
    "../gguf-models/Meta-Llama-3-8B-Instruct.Q5_k_m_with_temp_stop_token_fix.gguf",
    n_gpu_layers=-1,
    use_mlock=False,
    embedding=False,
    n_threads=12,
    n_batch=2048,
    n_ctx=2048,
    last_n_tokens_size=1024,
    verbose=True,
    seed=42,
    stream=True
)
# Example enum for our output model
class Category(Enum):
    Fiction = "Fiction"
    NonFiction = "Non-Fiction"


# Example output model
class Book(BaseModel):
    """
    Represents an entry about a book.
    """
    title: str = Field(..., description="Title of the book.")
    author: str = Field(..., description="Author of the book.")
    published_year: int = Field(..., description="Publishing year of the book.")
    keywords: list[str] = Field(..., description="A list of keywords.")
    category: Category = Field(..., description="Category of the book.")
    summary: str = Field(..., description="Summary of the book.")


structured_output_agent = StructuredOutputAgent(main_model, llama_generation_settings=settings,
                                                messages_formatter_type=MessagesFormatterType.LLAMA_3,
                                                debug_output=False)

text = """The Feynman Lectures on Physics is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the California Institute of Technology (Caltech), during 1961–1963. The book's co-authors are Feynman, Robert B. Leighton, and Matthew Sands."""
print(structured_output_agent.create_object(Book, text))

grammar, documentation = generate_gbnf_grammar_and_documentation(
    [Book],
    model_prefix="Response Model",
    fields_prefix="Response Model Field",
)

sys_prompt_template = PromptTemplate.from_string(
    "You are an advanced AI agent. You are tasked to assist the user by creating structured output in JSON format.\n\n{documentation}"
)
creation_prompt_template = PromptTemplate.from_string(
    "Create an JSON response based on the following input.\n\nInput:\n\n{user_input}"
)
sys_prompt = sys_prompt_template.generate_prompt({"documentation": documentation})
msg = creation_prompt_template.generate_prompt({"user_input": text})

sys_msg = {"role": "system", "content": msg}
user_msg = {"role": "user", "content": msg}

msg_list = [sys_msg, user_msg]
formatter = get_predefined_messages_formatter(MessagesFormatterType.LLAMA_3)
prompt, role = formatter.format_messages(msg_list, "assistant")
settings_dic = settings.as_dict()
settings_dic["stop"] = settings_dic["stop_sequences"]
settings_dic.pop("stop_sequences")
settings_dic.pop("print_output")
main_model.reset()
print(main_model.create_completion(prompt, **settings.as_dict()))

from llama-cpp-agent.

Maximilian-Winter commented on July 28, 2024

After trying out different forms of grammar, I can say it always takes about 12.5 ms per token.

from llama-cpp-agent.

this-josh commented on July 28, 2024

Hi @Maximilian-Winter,

Thanks for looking int this, unfortunately, I'm a little confused by your analysis and do not see this issue as closed. I've updated to the latest version of the package and am now using what I believe is the same model Meta-Llama-3-8B.Q5_K_M.gguf.

You show some timings above of around 75ms and 6ms with and without grammar respectively, I can produce this approximate 10x ratio too using your code. But in the message after you say about 12.5ms per token, I'm not sure where this figure comes from?

So now we have three values sample time, none being 12.5ms (which is a comparatively high figure)

structured_output_agent.create_object ~75 ms
main_model.create_completion ~ 7ms
main_model(text) ~ 0.7ms

So my question remains, why is the create_object sample time 100x slower than just providing the text to the model?

Perhaps you could expand on your tests.

from llama-cpp-agent.

Maximilian-Winter commented on July 28, 2024

Sorry I meant that I tried different versions of the grammar, optimized for performance, and the result is always around 12.5 tokens per second. Not 12.5 ms per token. The reason for that is, that the grammar based sampling takes a lot of time. The reason your main_context(text) is going this fast, is that it only generates 16 tokens and it only uses the text as prompt. My code (main_model.create_completion) uses the same prompt as the structured output agent, but without grammar sampling. It generates around 140 tokens. The generation slows down as longer the output gets. Without grammar I get 163.51 tokens per second

from llama-cpp-agent.

this-josh commented on July 28, 2024

Ah understood.

So is it expected that grammar can increase the sampling time per token by an order of magnitude?

from llama-cpp-agent.

Maximilian-Winter commented on July 28, 2024

I would say it depends on the complexity of the grammar.

from llama-cpp-agent.

Large portion of time spent on sample time about llama-cpp-agent HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent