Python scripts to handle a two way voice conversation with Anthropic Claude, using ElevenLabs, Faster-Whisper, and Pygame.

License: MIT License

Python 100.00%

bidirectional_streaming_ai_voice's People

Contributors

Stargazers

Watchers

bidirectional_streaming_ai_voice's Issues

suggestion for expressive eyes option

Hi!
I know this is not directly the goal of this project, but I have been struggling to adapt the code to do this, and maybe it'd be of interest to some.

I've seen that on your videos, there is an effect of border moving while Claude is talking around the avatar.
I was willing to have a more "immersive" animation like the one you have on chatGPT app (the dots moving along the volume of the voice).

What I've been attempting to do for a while as a non coder is to have expressive eyes, like Eve in Wall-e movie, depending on the feeling Claude is expressing. Your code seems to be compatible with that because it chunks the elevenlabs answers.
We would have a flask server with a page accepting request, displaying normal eyes.

Here is my plan :
1-Instruct Claude to answer like it would be adding emojis before every paragraphe depending on the feeling it want to express
2-Parse the chunks of the answer for every emoji (instead of ., !, ?)
3-Send the chunk to eleven labs (without the emoji) and stream the voice
4-Send a post request to change the displayed feeling on the page
Back to 3

Event loop error on launch

Hi Chris,

Great work here (from your videos). I haven't been able to get it running myself. Ran into the keyboard permission issue on my Mac as well, had to run the script with elevated privileges. But then still ran into a persistent event loop error:

async_tasks.py:83>> got Future <Future pending> attached to a different loop")>
Traceback (most recent call last):

I ran it through GPT with the source code, along with more complete traceback, and it identified the following:

The repeated issue you're facing with the asyncio tasks being attached to a different loop is symptomatic of the way your event loop is managed and tasks are handled across different parts of your application. Given the traceback and the nature of the errors, it seems like there is a conflict between threads and the asyncio event loops they are supposed to manage.

Problem Diagnosis:

Mixing of Threads with Asyncio:
- Your script uses a combination of threading and asyncio, which can lead to complex interactions especially when managing event loops across different threads. Each thread in Python can only have one associated event loop.
- Your main async tasks are being initiated inside a thread (Thread-1 runs run_async_tasks). This approach requires careful management of the event loop, particularly making sure the right event loop is set and used for all asyncio operations within that thread.
Event Loop Handling:
- The errors indicate that the Future objects are being attached to a different event loop than the one currently running the await expression. This typically happens when Future or coroutine is created in one loop and awaited in another.

Solution Suggestions:

To resolve these issues, you can take the following approaches:

Proper Event Loop Initialization in Threads:

Ensure each thread that runs asyncio operations initializes its own event loop and sets it as the current event loop at the start of the thread. Here’s how you can modify your thread's target function to handle this properly:

def run_async_tasks():
    loop = asyncio.new_event_loop()  # Create a new event loop for the thread
    asyncio.set_event_loop(loop)     # Set the new event loop as the current event loop in this thread

    loop.run_until_complete(start_async_tasks(text_to_speech_queue))

    try:
        loop.run_until_complete(shutdown_event.wait())
    finally:
        loop.close()  # Properly close the loop when done

Error Handling in Async Tasks:

Modify your async tasks to properly handle and log exceptions to prevent unhandled task exceptions:

async def text_to_speech_consumer(text_to_speech_queue):
    while True:
        try:
            text = await text_to_speech_queue.get()
            await process_text_to_speech(text)
        except Exception as e:
            print(f"Error in text_to_speech_consumer: {e}")
        finally:
            text_to_speech_queue.task_done()

Handling Future Objects Correctly:
- Ensure that any Future or coroutine you create is attached and awaited in the same event loop. Use debug mode in asyncio to help identify where the mismatches occur:
```
import asyncio
asyncio.get_event_loop().set_debug(True)
```

These steps should help stabilize your application's concurrency model by ensuring that each thread manages its own event loop correctly, and that tasks and futures are not crossing these loop boundaries improperly. Additionally, running with debug mode enabled can provide more insight into where and why the event loop mismatches occur, leading to more targeted fixes.

no need for OPENAI key

README says we need an openai api key but nothing in the code seems to require openai.

system message is a hard-coded summary

For this project to be usable by multiple users, we will need a way to personalize the system prompt, which is presently hard-coded based on several hours of discussion with Claude (Quill?)

Meta-question for ccappetta: are you interested in maintaining & improving this project with an eye towards multiple users and contributors? If so, it would make sense to discuss how issues like persistence, summaries, and customization are going to be implemented.

I'm here because I was basically coding the same thing but yours is better and farther along. I am interested in seeing where this goes.

Can't run on MacOS with M1

Hi, when running on MacOS (M1) with Python 3.10, it throws below exception:

Traceback (most recent call last):
  File "/Users/test/Learning/bedrock/bidirectional_streaming_ai_voice/main.py", line 252, in <module>
    main()
  File "/Users/test/Learning/bedrock/bidirectional_streaming_ai_voice/main.py", line 227, in main
    recording, fs = record_audio()
  File "/Users/test/Learning/bedrock/bidirectional_streaming_ai_voice/main.py", line 81, in record_audio
    with sd.InputStream(callback=callback, samplerate=fs, channels=2, blocksize=int(fs * block_duration)):
  File "/Users/test/Learning/bedrock/bidirectional_streaming_ai_voice/installer-v2/venv/lib/python3.10/site-packages/sounddevice.py", line 1421, in __init__
    _StreamBase.__init__(self, kind='input', wrap_callback='array',
  File "/Users/test/Learning/bedrock/bidirectional_streaming_ai_voice/installer-v2/venv/lib/python3.10/site-packages/sounddevice.py", line 898, in __init__
    _check(_lib.Pa_OpenStream(self._ptr, iparameters, oparameters,
  File "/Users/test/Learning/bedrock/bidirectional_streaming_ai_voice/installer-v2/venv/lib/python3.10/site-packages/sounddevice.py", line 2747, in _check
    raise PortAudioError(errormsg, err)
sounddevice.PortAudioError: Error opening InputStream: Invalid number of channels [PaErrorCode -9998]

I tried to re-install sounddevice and PortAudio, it still doesn't work

Questionable Protocol Choice for ElevenLabs.

Hi! This is just more of a question.

ElevenLabs provides Websockets for TTS.
https://elevenlabs.io/docs/api-reference/websockets

What reasons exist for favoring the use of an asynchronous HTTP client over Websockets in this circumstance?

Using another LLM instead of Claude

I would like to use another LLM in this amazing application instead of Anthropic Claude.

I was thinking about Mixtral-8x7b-Groq as a LLM which has an insanely fast inference (because it runs in the cloud on specialized hardware, namely LPUs, Language Processing Units). Furthermore its intelligence is quite good I think.

So how would I have to change the code so that it uses Mixtral-8x7b-Groq as a LLM?

suggestions

Thank you for creating this. I've been attempting something similar. Yours however is better and more sophisticated. Sadly I wasn't able to get this to work on my linux system.
The keyboard module apparently requires installing with root privilages which isn't recommended in a virtual environment. I think instead of keyboard triggers changes in current noise levels relative to long term noise levels could serve to trigger transitions. Another suggestion is include a requirements.txt file.

Mac users don't have Cuda

Mac users change this line so that you use the CPU for speech to text.

compute_type = "float32"

segments, _ = WhisperModel(
model_size, device="cpu", compute_type=compute_type).transcribe(temp_file_path)

Reduce latency

Thought it made sense to collect thoughts about latency.

ccappetta did some profiling:
Spacebar press
User audio file is generated and faster-whisper transcribes it to text (~4.5 seconds)
Transcribed user text is sent to anthropic and the first LLm response token is streamed back (~3.5 seconds)
Enough LLM tokens are streamed back to generate a ~150 character chunk and send that text off to Elevenlabs (~1.25 seconds)
Elevenlabs generates and returns the audio file (~1.25 seconds)
Pygame playback begins (0.1 second)

IIUC presently the whisper model doesn't start transcribing until user input is finished (user hits spacebar). Transcription is atomic, all of user input is transcribed and then returned as a text string.

I also did some profiling, my results seem consistent with CC's. For transcription of input, there seems to be a fixed overhead of ~4 second even if user input is very short (2 words). A 60-word input took ~7 seconds. We can model this as t = 4 + w * 20 where t is seconds taken and w is # of words.

I'm seeing 1.5-2.0 sec response time from Anthropic

Pipeline STT:
The right approach seems to be to start transcribing user input speech asynchronously so it runs in parallel with recording. CC mentioned this project: https://github.com/KoljaB/RealtimeSTT
Another candidate (this one uses whisper): https://github.com/ufal/whisper_streaming

I also see considerable variability with elevenlabs response time, from about 1 second up to 3-4 seconds. It doesn't seem to correlate with the length of the response. I think this occasional delay was apparent in the last youtube chat.

Personally, I'm not stuck on elevenlabs -- TTS has been around a long time, I'm sure there are mature, optimized, open-source alternatives. I would prefer less "character" in my voices, it's OK for it to be a tad robotic. The emotion Elevenlabs voices express are algorithmically derived by their proprietary models. They are probably great for casual users to have fun with but I would be fine with a very "neutral voice" like HAL9000. Just my 2c

ccappetta / bidirectional_streaming_ai_voice Goto Github PK

bidirectional_streaming_ai_voice's People

Contributors

Stargazers

Watchers

Forkers

bidirectional_streaming_ai_voice's Issues

Problem Diagnosis:

Solution Suggestions:

Recommend Projects

Recommend Topics

Recommend Org