Giter VIP home page Giter VIP logo

Comments (8)

Maximilian-Winter avatar Maximilian-Winter commented on July 28, 2024

I'm not sure how much grammar based sampling is affecting the performance. But this seems to be a huge impact. Can you send me the grammar generated itself? I just want to make sure I doesn't messed up the generation of the grammar.

from llama-cpp-agent.

this-josh avatar this-josh commented on July 28, 2024

Hi, I think this is what you mean

>>> from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import generate_gbnf_grammar_from_pydantic_models
>>> generate_gbnf_grammar_from_pydantic_models([Book])
'root ::= (" "| "\\n") grammar-models\ngrammar-models ::= book\nbook ::= "{"  ws "\\"title\\"" ": " string ","  ws "\\"author\\"" ": " string ","  ws "\\"published_year\\"" ": " number ","  ws "\\"keywords\\"" ": " book-keywords ","  ws "\\"category\\"" ": " book-category ","  ws "\\"summary\\"" ": " string ws "}"\nbook-keywords ::= "[" ws string ("," ws string)* ws "]" \nbook-category ::= "\\"Fiction\\"" | "\\"Non-Fiction\\""'

from llama-cpp-agent.

Maximilian-Winter avatar Maximilian-Winter commented on July 28, 2024

@this-josh I made some tests, I used the same prompt for generation with grammar and without. My results are that it is about 12 times faster without grammar. I still have to do some additional tests.

With grammar:
llama_print_timings:        load time =     307.76 ms
llama_print_timings:      sample time =   11026.43 ms /   141 runs   (   78.20 ms per token,    12.79 tokens per second)
llama_print_timings: prompt eval time =     307.34 ms /   248 tokens (    1.24 ms per token,   806.92 tokens per second)
llama_print_timings:        eval time =    5111.64 ms /   140 runs   (   36.51 ms per token,    27.39 tokens per second)
llama_print_timings:       total time =   17923.20 ms /   388 tokens

Without grammar:
llama_print_timings:        load time =     307.76 ms
llama_print_timings:      sample time =     844.00 ms /   138 runs   (    6.12 ms per token,   163.51 tokens per second)
llama_print_timings: prompt eval time =     280.42 ms /   218 tokens (    1.29 ms per token,   777.39 tokens per second)
llama_print_timings:        eval time =    3609.99 ms /   137 runs   (   26.35 ms per token,    37.95 tokens per second)
llama_print_timings:       total time =    5686.03 ms /   355 tokens

This is my full test code:

from enum import Enum

from llama_cpp import Llama
from pydantic import BaseModel, Field

from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import \
    generate_gbnf_grammar_and_documentation
from llama_cpp_agent.llm_prompt_template import PromptTemplate
from llama_cpp_agent.llm_settings import LlamaLLMGenerationSettings
from llama_cpp_agent.messages_formatter import MessagesFormatterType, get_predefined_messages_formatter
from llama_cpp_agent.structured_output_agent import StructuredOutputAgent

settings = LlamaLLMGenerationSettings(stream=False)
main_model = Llama(
    "../gguf-models/Meta-Llama-3-8B-Instruct.Q5_k_m_with_temp_stop_token_fix.gguf",
    n_gpu_layers=-1,
    use_mlock=False,
    embedding=False,
    n_threads=12,
    n_batch=2048,
    n_ctx=2048,
    last_n_tokens_size=1024,
    verbose=True,
    seed=42,
    stream=True
)
# Example enum for our output model
class Category(Enum):
    Fiction = "Fiction"
    NonFiction = "Non-Fiction"


# Example output model
class Book(BaseModel):
    """
    Represents an entry about a book.
    """
    title: str = Field(..., description="Title of the book.")
    author: str = Field(..., description="Author of the book.")
    published_year: int = Field(..., description="Publishing year of the book.")
    keywords: list[str] = Field(..., description="A list of keywords.")
    category: Category = Field(..., description="Category of the book.")
    summary: str = Field(..., description="Summary of the book.")


structured_output_agent = StructuredOutputAgent(main_model, llama_generation_settings=settings,
                                                messages_formatter_type=MessagesFormatterType.LLAMA_3,
                                                debug_output=False)

text = """The Feynman Lectures on Physics is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the California Institute of Technology (Caltech), during 1961–1963. The book's co-authors are Feynman, Robert B. Leighton, and Matthew Sands."""
print(structured_output_agent.create_object(Book, text))

grammar, documentation = generate_gbnf_grammar_and_documentation(
    [Book],
    model_prefix="Response Model",
    fields_prefix="Response Model Field",
)

sys_prompt_template = PromptTemplate.from_string(
    "You are an advanced AI agent. You are tasked to assist the user by creating structured output in JSON format.\n\n{documentation}"
)
creation_prompt_template = PromptTemplate.from_string(
    "Create an JSON response based on the following input.\n\nInput:\n\n{user_input}"
)
sys_prompt = sys_prompt_template.generate_prompt({"documentation": documentation})
msg = creation_prompt_template.generate_prompt({"user_input": text})

sys_msg = {"role": "system", "content": msg}
user_msg = {"role": "user", "content": msg}

msg_list = [sys_msg, user_msg]
formatter = get_predefined_messages_formatter(MessagesFormatterType.LLAMA_3)
prompt, role = formatter.format_messages(msg_list, "assistant")
settings_dic = settings.as_dict()
settings_dic["stop"] = settings_dic["stop_sequences"]
settings_dic.pop("stop_sequences")
settings_dic.pop("print_output")
main_model.reset()
print(main_model.create_completion(prompt, **settings.as_dict()))

from llama-cpp-agent.

Maximilian-Winter avatar Maximilian-Winter commented on July 28, 2024

After trying out different forms of grammar, I can say it always takes about 12.5 ms per token.

from llama-cpp-agent.

this-josh avatar this-josh commented on July 28, 2024

Hi @Maximilian-Winter,

Thanks for looking int this, unfortunately, I'm a little confused by your analysis and do not see this issue as closed. I've updated to the latest version of the package and am now using what I believe is the same model Meta-Llama-3-8B.Q5_K_M.gguf.

You show some timings above of around 75ms and 6ms with and without grammar respectively, I can produce this approximate 10x ratio too using your code. But in the message after you say about 12.5ms per token, I'm not sure where this figure comes from?

So now we have three values sample time, none being 12.5ms (which is a comparatively high figure)

  • structured_output_agent.create_object ~75 ms
  • main_model.create_completion ~ 7ms
  • main_model(text) ~ 0.7ms

So my question remains, why is the create_object sample time 100x slower than just providing the text to the model?

Perhaps you could expand on your tests.

from llama-cpp-agent.

Maximilian-Winter avatar Maximilian-Winter commented on July 28, 2024

Sorry I meant that I tried different versions of the grammar, optimized for performance, and the result is always around 12.5 tokens per second. Not 12.5 ms per token. The reason for that is, that the grammar based sampling takes a lot of time. The reason your main_context(text) is going this fast, is that it only generates 16 tokens and it only uses the text as prompt. My code (main_model.create_completion) uses the same prompt as the structured output agent, but without grammar sampling. It generates around 140 tokens. The generation slows down as longer the output gets. Without grammar I get 163.51 tokens per second

from llama-cpp-agent.

this-josh avatar this-josh commented on July 28, 2024

Ah understood.

So is it expected that grammar can increase the sampling time per token by an order of magnitude?

from llama-cpp-agent.

Maximilian-Winter avatar Maximilian-Winter commented on July 28, 2024

I would say it depends on the complexity of the grammar.

from llama-cpp-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.