Is it possible to run simultaneously calls to API and have both call run concurrently

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Simultaneously calls to API about alltalk_tts HOT 17 CLOSED

mercuryyy commented on September 15, 2024

Simultaneously calls to API

from alltalk_tts.

Comments (17)

erew123 commented on September 15, 2024 1

Hi @gboross

It is something I can look at, though I want to be fair and clear that what I have suggested above is all a theory I have about how to make it work. There is probably 6-10 hours of building/testing something as a rough prototype to attempt to prove the theory and figure out how well it can perform. Obviously with multiple requests being sent into a GPU, I have lmited experience as to how well the NVidia will time-slice the GPU in practice. My research said it should 50/50 resources if 2x requests come in, or 33/33/33 resources if 3 simultaneous requests come in, and I assume this would be enough on a reasonable GPU to keep up with something like multiple streaming requests, though I imagine there is a breakpoint at somewhere, depending on the GPU in question e.g. a 4090 is going to outperform a 3060 and each of those hardware options will have a limitation somewhere down the line.

If I did get as far as testing a prototype, I would imagine from that point on, there is a another 40 hours minimum to build/test a working queue system. It would need to handle/set the maximum amount of Python instances you could start (so you could alter it based on different hardware), The queue engine would therefore need to handle anything from 2 to ??? amount of TTS engines being loaded in dynamically (again to handle different hardware scenarios). There would also be a need to handle the situation where all engines are currently in use and what to do in that situation, it may even be that there is a need to look at load balancing across multiple GPU's, or hold in the queue until one engine becomes available again etc, or maybe fall back to a lower quality TTS engine that can be processed on the CPU.

Then of course there will be a need to do something in the interface to manage/configure this system etc.

Finally there will be testing/debugging/possible re-work of bits, or even the possibility that for reasons it may just not work as expected.

All in, it could be bottom end 50 hours work and upper end potentially closer to 100, with no absolute guarantee of this will 100% work the way I have proposed. Its also something I would want to re-research and think over, just to make sure I am nailing the design before I would even touch some code.

I guess I would need to roll it around my head a little more and firm up my thoughts on it before committing to something.

from alltalk_tts.

erew123 commented on September 15, 2024

I have no buffering built in currently and as far as I am aware, it can only generate 1x thing at a time.... though in all honesty, I haven't tested. I've currently set no lock on the API to stop you trying it... meaning, if you send multiple requests simultaneously, there is nothing in the script to say "No, I'm already processing something, so I am not going to accept that request". I suspect it will queue up the request or cancel the current generation and start the new one, but I don't know for sure.

The honest truth is, I don't actually know for sure and its something I was going to look at, at some point.

from alltalk_tts.

mercuryyy commented on September 15, 2024

Thanks i'll run some tests see what happens

from alltalk_tts.

erew123 commented on September 15, 2024

Hi @mercuryyy. Have you found your answer? Would you like me to leave the ticket open?

from alltalk_tts.

erew123 commented on September 15, 2024

Hi @mercuryyy Im going to assume this closed for now, but feel free to re-open if needed. Thanks

from alltalk_tts.

GiulioZ94 commented on September 15, 2024

Simultaneous calls to the API currently mix chunks between requests, resulting in a mixed WAV file. Is there a way to handle simultaneous calls, possibly using a queue or similar method, to avoid this issue?

from alltalk_tts.

erew123 commented on September 15, 2024

Hi @GiulioZ94

I've run no further tests myself on this. Its theoretically possible to build a queue system into the API and Ive not performed any locking currently on the API, so if you dont want to handle queue management within any application your end, I would have to look at this.

What API endpoint are you calling?

from alltalk_tts.

GiulioZ94 commented on September 15, 2024

Hi @erew123

Thanks for helping, I'm using tts-generate-streaming api in get mode

from alltalk_tts.

erew123 commented on September 15, 2024

@GiulioZ94 So let me just confirm. Lets say you have 2x lots of audio to generate, A and B. You:

Are wanting AllTalk to generate the first set of streaming audio (A) before it it starts work on the next stream (B), but you would like to send the text of request B to AllTalk while A is in process of being streamed/generated.
Are not wanting simultaneous generation of A and B, being requested from 2 different source locations and being generated simultaneously and sent back to simultaneously to those 2 source locations at the same time.

I'm assuming you want 1, but I just want to be sure I've got your request correct.

If so, it will take a bit of thinking about as I suspect because of the single threaded nature of Python, it might require a slightly different start up script with uvicorn running multithreaded and also the python queue system coding in and then ??? amount of testing.

FYI, Im about to sign off for the day, so wont be responding again for a while.

Thanks

from alltalk_tts.

GiulioZ94 commented on September 15, 2024

@erew123, yes, it would be great if it's possible to handle requests simultaneously so 1. If not, at least ensure things don't break and handle requests one at a time.

from alltalk_tts.

erew123 commented on September 15, 2024

@GiulioZ94 Ill make a note of it. Its something I may or may not be able to get to figure out soon ish. I currently mid a major update of lots of things and have quite a decent chunk to work on+test on multiple OS's.

So bear with me on this. I've added it to the Feature requests #74 so its in there as a "to do/investigate"

Thanks

from alltalk_tts.

GiulioZ94 commented on September 15, 2024

@erew123 Sure take your time. Thanks for your work.

from alltalk_tts.

Swastik-Mantry commented on September 15, 2024

Hey @erew123, to solve this issue, I have tried having a pool of multiple models of XTTSv2(about 4 of them) and tried using different models for each request synthesis(Using a queue-like implementation), but simultaneous requests lead to error. [Source of idea]

At times of no errors, the audio got mixed up for simultaneous requests (audio of request 1 got mixed with request 2 or 3, eventually all the speech generated was gibberish garbage) while sometimes I got the following torch errors

"Assertion srcIndex < srcSelectDimSize failed"
"probability tensor contains either inf, nan or element < 0" or
"RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect"
The error mentioned in the source of idea link
( I think there must be some race condition to cause them)

I had also tried using executor.multi-threading / multi-processing to serve the requests i.e synthesis on the XTTSv2 model
concurrently and be free of those errors but it didn't work out.

I know you have just worked alot to release the beta version. So, please have a good rest, you really deserve it. Incase you work on this problem, please let me know if you come with a possible solution and your thought process for it. Thanks again.

from alltalk_tts.

erew123 commented on September 15, 2024

@Swastik-Mantry Thanks for info on this. Its certainly one of those problems that Im not sure if it is or isn't achievable in some way. Resource sharing, multi models and ensuring requests all go back to the correct source are potentially unsolvable issues and its highly possible Coqui's scripts wont be capable of supporting it, even if AllTalk can. Its something Ill take a look at, at some point, but I really appreciate having your experience on it! :) Its a nice head start and at least I can discount a couple of possible routes.

from alltalk_tts.

gboross commented on September 15, 2024

Hello everyone,

I’m reaching out to inquire if there have been any updates on the topic you discussed earlier. We are looking to set up an AWS system capable of handling multiple requests simultaneously, without queuing them, provided there are sufficient resources available.

Additionally, I came across these two Coqui variations, but I couldn’t find a solution to our issue there either. Could you possibly assist us with this matter?

https://github.com/idiap/coqui-ai-TTS
https://github.com/daswer123/xtts-api-server?tab=readme-ov-file

Thank you!

from alltalk_tts.

erew123 commented on September 15, 2024

Hi @gboross I've done a bit of investigation into this and there is a limitation of Transformers/CUDA where the Tensor cores do not fully segment requests within one Python instance. AKA, if 2x requests come in to 1x python instance, the data getting sent into the tensor cores from the 2x requests, get jumbled up into 1x block of data and come out as a mess.

I cannot recall all the exact things I looked into at that time, however, there potentially is a way to use a CUDA technology to segregate/track the tensor cores within Python (I believe) but, it will require an entire re-write of the whole Coqui inference scripts to handle that. Its not impossible, but its not just 10 lines of code and 1x script. You are looking at a lot of code to do that.

The alternative is to build a queue/multiplexing system where multiple XTTS engines get loaded by multiple different Python instances, and when one is busy, one of the free ones gets picked/used, which will maintain the segregation between tensor cores and the requests. Again, this is a decent amount of coding, but, it doesnt require re-jigging the Coqui scripts in this circumstance, just a multiplexing queue/tracking system.

from alltalk_tts.

gboross commented on September 15, 2024

Thanks for the detailed explanation! Sounds like the tensor cores are throwing a bit of a party with the data. 😂 Could you possibly work on the second option you mentioned? It sounds like a more feasible approach without needing a complete overhaul of the Coqui scripts.

Of course, we’d be willing to pay for your help if you're open to solving this issue.

Thanks a lot!

from alltalk_tts.

Simultaneously calls to API about alltalk_tts HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent