System Info Running TGI docker with command <cod

One of two concurrent request generating empty text (Mistral 7B) about text-generation-inference HOT 1 OPEN

TysonHeart commented on June 11, 2024

One of two concurrent request generating empty text (Mistral 7B)

from text-generation-inference.

Comments (1)

alexgravx commented on June 11, 2024

I have a similar issue with the /generate API endpoint, and Llama2 model (meta-llama/Llama-2-7b-chat-hf). I am using asynchronous requests in python using asyncio and aiohttp.

Here is my code (you'll have to set env variables).

async def post(string, session, temperature=0.7, max_new_tokens=50):
        # Load env variables
        load_dotenv()
        # Set url
        url = os.getenv("SERVER_IP") + "/generate"
        # Set headers
        headers = {
            "Content-Type": "application/json",
        }
        # Set data
        data = {
            "inputs": string,
            "parameters": {
                "temperature": temperature,
                "max_new_tokens": max_new_tokens,
            },
        }

        # Asynchronous request
        async with session.post(url=url, headers=headers, json=data) as response:
            resp = await response.json()
            return resp.get('generated_text')

async def main(String_List):
    async with aiohttp.ClientSession() as session:
        responses = await asyncio.gather(*(post(string, session) for string in String_List))
    return responses
    
asyncio.run(main(String_List))

This issue seems to happen when the server doesn't get the requests at the exact same time.

I don't have the issue with 2 simple simultaneous request, with String_List = ["Why is the sky blue ?", "Does magic exist?"]
Here is the result:

["\n\nThe sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight travels through the Earth's atmosphere. Blue light, which has a shorter wavelength, is scattered more than other colors,", " I don't know, but I do know that it's a powerful force that has captured the imagination of people for centuries. If you believe in magic, then you know that it's not just a trick or a illusion, but"]

Here is the server logs. We can see that the 2 request arrived at the exact same time:

2024-05-02T18:46:14.841112Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.03429266s" validation_time="119.75µs" queue_time="24.621µs" inference_time="3.034148508s" time_per_token="60.68297ms" seed="Some(12312515242247638074)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:46:14.841141Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.013721041s" validation_time="113.869µs" queue_time="41.184335ms" inference_time="2.972423037s" time_per_token="59.44846ms" seed="Some(2428907862398705128)"}: text_generation_router::server: router/src/server.rs:322: Success

However, I am making more complex request with a RAG, and I am there getting the same issue as you.
Here is the response, the first one is empty:

['', '\n\n Pour répondre à cela, nous allons procéder à une analyse des différentesapproches et formulations couramment utilisées pour le dimensionnement et l’évaluation d’architecturesavion. Nous all']

What is interesting is the logs. As you can see, the requests don't arrive at the same time (a few seconds apart). Because my RAG is complex and the input prompt is bigger, I think that it may be causing that delay. You can also see that the time per token his higher with the empty return (436ms vs 77ms which is the standard time for this model):

2024-05-02T18:45:09.400725Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="900.637516ms" validation_time="906.888µs" queue_time="463.669762ms" inference_time="436.061055ms" time_per_token="436.061055ms" seed="Some(6510953659954175863)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:45:12.363474Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.865515768s" validation_time="934.667µs" queue_time="31.45µs" inference_time="3.864549831s" time_per_token="77.290996ms" seed="Some(4827920659267836322)"}: text_generation_router::server: router/src/server.rs:322: Success

When I'm making 6 simultaneous requests, I have the same issue: the first response is empty with a higher time per token (about 500ms), while the 5 other are standard with about 70ms time per token.

I then changed the model to Mistral (mistralai/Mistral-7B-Instruct-v0.2) and the empty first string disappear, I can't tell why... However, the requests arrived at the same time, we can thus linked the issue to the time of arrival of the requests.

Logs with Mistral:

2024-05-02T19:45:56.665190Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.579128639s" validation_time="2.683284ms" queue_time="23.591µs" inference_time="6.576421954s" time_per_token="131.528439ms" seed="Some(8042501124986571198)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T19:45:56.665218Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.575659955s" validation_time="827.079µs" queue_time="474.720856ms" inference_time="6.10011227s" time_per_token="122.002245ms" seed="Some(18189346514334242789)"}: text_generation_router::server: router/src/server.rs:322: Success

from text-generation-inference.

One of two concurrent request generating empty text (Mistral 7B) about text-generation-inference HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent