Giter VIP home page Giter VIP logo

Comments (1)

alexgravx avatar alexgravx commented on June 11, 2024

I have a similar issue with the /generate API endpoint, and Llama2 model (meta-llama/Llama-2-7b-chat-hf). I am using asynchronous requests in python using asyncio and aiohttp.

Here is my code (you'll have to set env variables).

async def post(string, session, temperature=0.7, max_new_tokens=50):
        # Load env variables
        load_dotenv()
        # Set url
        url = os.getenv("SERVER_IP") + "/generate"
        # Set headers
        headers = {
            "Content-Type": "application/json",
        }
        # Set data
        data = {
            "inputs": string,
            "parameters": {
                "temperature": temperature,
                "max_new_tokens": max_new_tokens,
            },
        }

        # Asynchronous request
        async with session.post(url=url, headers=headers, json=data) as response:
            resp = await response.json()
            return resp.get('generated_text')

async def main(String_List):
    async with aiohttp.ClientSession() as session:
        responses = await asyncio.gather(*(post(string, session) for string in String_List))
    return responses
    
asyncio.run(main(String_List))

This issue seems to happen when the server doesn't get the requests at the exact same time.

I don't have the issue with 2 simple simultaneous request, with String_List = ["Why is the sky blue ?", "Does magic exist?"]
Here is the result:

["\n\nThe sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight travels through the Earth's atmosphere. Blue light, which has a shorter wavelength, is scattered more than other colors,", " I don't know, but I do know that it's a powerful force that has captured the imagination of people for centuries. If you believe in magic, then you know that it's not just a trick or a illusion, but"]

Here is the server logs. We can see that the 2 request arrived at the exact same time:

2024-05-02T18:46:14.841112Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.03429266s" validation_time="119.75µs" queue_time="24.621µs" inference_time="3.034148508s" time_per_token="60.68297ms" seed="Some(12312515242247638074)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:46:14.841141Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.013721041s" validation_time="113.869µs" queue_time="41.184335ms" inference_time="2.972423037s" time_per_token="59.44846ms" seed="Some(2428907862398705128)"}: text_generation_router::server: router/src/server.rs:322: Success

However, I am making more complex request with a RAG, and I am there getting the same issue as you.
Here is the response, the first one is empty:

['', '\n\n Pour répondre à cela, nous allons procéder à une analyse des différentesapproches et formulations couramment utilisées pour le dimensionnement et l’évaluation d’architecturesavion. Nous all']

What is interesting is the logs. As you can see, the requests don't arrive at the same time (a few seconds apart). Because my RAG is complex and the input prompt is bigger, I think that it may be causing that delay. You can also see that the time per token his higher with the empty return (436ms vs 77ms which is the standard time for this model):

2024-05-02T18:45:09.400725Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="900.637516ms" validation_time="906.888µs" queue_time="463.669762ms" inference_time="436.061055ms" time_per_token="436.061055ms" seed="Some(6510953659954175863)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:45:12.363474Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.865515768s" validation_time="934.667µs" queue_time="31.45µs" inference_time="3.864549831s" time_per_token="77.290996ms" seed="Some(4827920659267836322)"}: text_generation_router::server: router/src/server.rs:322: Success

When I'm making 6 simultaneous requests, I have the same issue: the first response is empty with a higher time per token (about 500ms), while the 5 other are standard with about 70ms time per token.

I then changed the model to Mistral (mistralai/Mistral-7B-Instruct-v0.2) and the empty first string disappear, I can't tell why... However, the requests arrived at the same time, we can thus linked the issue to the time of arrival of the requests.

Logs with Mistral:

2024-05-02T19:45:56.665190Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.579128639s" validation_time="2.683284ms" queue_time="23.591µs" inference_time="6.576421954s" time_per_token="131.528439ms" seed="Some(8042501124986571198)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T19:45:56.665218Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.575659955s" validation_time="827.079µs" queue_time="474.720856ms" inference_time="6.10011227s" time_per_token="122.002245ms" seed="Some(18189346514334242789)"}: text_generation_router::server: router/src/server.rs:322: Success

from text-generation-inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.