shirayu / whispering Goto Github PK

View Code? Open in Web Editor NEW

678.0 18.0 57.0 295 KB

Streaming transcriber with whisper

License: MIT License

Makefile 2.96% Python 97.04%

automatic-speech-recognition whisper

whispering's Introduction

Whispering

Streaming transcriber with whisper. Enough machine power is needed to transcribe in real time.

Notice

This repository has been archived. There are some alternatives.

Setup

pip install -U git+https://github.com/shirayu/[email protected]

# If you use GPU, install proper torch and torchaudio
# Check https://pytorch.org/get-started/locally/
# Example : torch for CUDA 11.6
pip install -U torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

If you get OSError: PortAudio library not found in Linux, install "PortAudio".

sudo apt -y install portaudio19-dev

Example of microphone

# Run in English
#  By the default, it needs to wait at least 30 seconds
whispering --language en --model tiny

--help shows full options
--model sets the model name to use. Larger models will be more accurate, but may not be able to transcribe in real time.
--language sets the language to transcribe. The list of languages are shown with whispering -h
--no-progress disables the progress message
-t sets temperatures to decode. You can set several like -t 0.0 -t 0.1 -t 0.5, but too many temperatures exhaust decoding time
--debug outputs logs for debug
--vad sets VAD (Voice Activity Detection) threshold. The default is 0.5. 0 disables VAD and forces whisper to analyze non-voice activity sound period. Try --vad 0 if VAD prevents transcription.
--output sets output file (Default: Standard output)
--frame: the number of minimum frames of mel spectrogram input for Whisper (default: 3000. i.e. 30 seconds)

Parse interval

By default, whispering performs VAD for every 3.75 second. This interval is determined by the value of -n and its default is 20. When an interval is predicted as "silence", it will not be passed to whisper. If you want to disable VAD, please make VAD threshold 0 by adding --vad 0.

By default, whispering does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds. This is because the original Whisper assumes that the inputs are 30 seconds segments. However, if silence segments appear 16 times (the default value of --max_nospeech_skip) after speech is detected, the analysis is performed. You can make the length of segments smaller with --frame option (default: 3000), but it sacrifices accuracy because this is not expected input for Whisper.

Example of web socket

⚠ No security mechanism. Please make secure with your responsibility.

Run with --host and --port.

Host

whispering --language en --model tiny --host 0.0.0.0 --port 8000

Client

whispering --host ADDRESS_OF_HOST --port 8000 --mode client

You can set -n and other options.

For Developers

Install Python and Node.js
Install poetry to use poetry command

Clone and install libraries

# Clone
git clone https://github.com/shirayu/whispering.git

# With poetry
poetry config virtualenvs.in-project true
poetry install --all-extras
poetry run pip install -U torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

# With npm
npm install

Run test and check that no errors occur
```
poetry run make -j4
```
Make fancy updates
Make style
```
poetry run make style
```
Run test again and check that no errors occur
```
poetry run make -j4
```
Check typos by using typos. Just run typos command in the root directory.
```
typos
```
Send Pull requests!

License

MIT License
Some codes are ported from the original whisper. Its license is also MIT License

whispering's People

Contributors

Stargazers

Watchers

whispering's Issues

'beam_size is not accepted option for server mode' - What gives?

Description

Followed the Readme to install with git and this is the error I received when I tried to start the server

beam_size is not accepted option for server mode

so I tried adding -b 0, but then got the same error except for temperature.

edit; whispering --language en --host 0.0.0.0 --port 8000 was the command

Environment

OS: Windows
Python Version: 3.9
Whispering version: 0.6.0/0.6.1

What am I doing wrong?

CVE-2022-42969

pytest-dev/pytest#10392
- The affected code is not used in pytest
pytest-dev/py#287

Remove multi language feature (Revert #20)

I read the whisper code and noticed that multilingual tokenizer is not supposed in Whisper.

When language is None, the tokenizer is not for all languages but for English (en) for "multilingual whisper models" (tiny, base, small, medium, large).

https://github.com/openai/whisper/blob/9e653bd0ea0f1e9493cb4939733e9de249493cfb/whisper/tokenizer.py#L295-L316

    if multilingual:
        tokenizer_name = "multilingual"
        task = task or "transcribe"
        language = language or "en"

Revert #20
Related to #21

Set proper value to ``-n``

Too small -n makes no response, while too large value consumes memory.
Set proper value to -n and wake warning for too small value.

FP16-fp32 argument for using --device cpu

Description

Got error :
transcriber._set_dtype:35 WARNING -> FP16 is not supported on CPU; using FP32 instead

when running :
whispering --language fr --model tiny --device cpu

Would be interesting to add argument --fp16
to make it work on cpu.

Environment is just:

virtualenv -p python38 venv
souce venv/bin/activate
pip install -U git+https://github.com/shirayu/[email protected]

Thank you very much!

No analysis of some speech before long long silent is performed when using VAD

Whispering passes 30 second voice segments to Whisper as default.
So, no analysis of some speech before long long silent is performed.

We may need to add an option of the "wait" limit.

Related to #5.

whispering: error: argument --language: invalid choice: 'multi'

When using whispering --language multi --model large -n 90 --allow-padding --host 0.0.0.0 --port 8000

Discard too old buffers

Currently, the buffer_tokens and buffer_mel is being used again in a later process.
However, when they become too large, the processing will take too long and exhaust VRAM.
Therefore, it is necessary to discard old buffers in them at the appropriate time.

VAD (Voice Activity Detection) to reduce repetition outputs for silent periods and call of `transcribe`.

Voice Activity Detection can reduce call of transcribe.
The VAD should be light-weight.

Could not initialize NNPACK! Reason: Unsupported hardware

hello!

I'm trying the following: whispering --language en --model tiny
But after I try to speak, I'm getting the following message:
Could not initialize NNPACK! Reason: Unsupported hardware

So i'm wondering what are the system requirements? I'm trying to run it under an old lenovo, intel i5 vPro + 8Gb Ram. Is it possible to run it there or do I need a GPU?

Thank you!

Allow whispering write to specified output

Hi @shirayu nice work done here,

I would like to add a feature to whispering.
It would be nice to allow user specify output file where the transcription can be written to like so

whispering --language en --model tiny --output ./index.txt

And in a case where output isn't specified, the default should be done.

Note
I would be very pleased to know your toughts on this issue and also very willing to implement this feature.

Cannot run whispering: Error opening InputStream: Invalid sample rate [PaErrorCode -9997]

[2023-01-07 02:32:33,394] cli.get_wshiper:225 DEBUG -> WhisperConfig: model_name='tiny' device='cpu' language='en' fp16=True
[2023-01-07 02:32:33,795] transcriber._set_dtype:35 INFO -> Using FP32 because FP16 is not supported on CPU
Using cache found in /home/kris/.cache/torch/hub/snakers4_silero-vad_master
[2023-01-07 02:32:34,701] cli.get_context:239 DEBUG -> Context: protocol_version=6003 timestamp=0.0 buffer_tokens=[] buffer_mel=None nosoeech_skip_count=None temperatures=[0.0, 0.2, 0.4, 0.6, 0.8, 1.0] patience=None compression_ratio_threshold=2.4 logprob_threshold=-1.0 no_captions_threshold=0.6 best_of=5 beam_size=5 no_speech_threshold=0.6 buffer_threshold=0.5 vad_threshold=0.5 max_nospeech_skip=16 mel_frame_min_num=3000 data_type='float32'
[2023-01-07 02:32:34,701] cli.transcribe_from_mic:56 INFO -> Ready to transcribe
Expression 'paInvalidSampleRate' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2050
Expression 'PaAlsaStreamComponent_InitialConfigure( &self->capture, inParams, self->primeBuffers, hwParamsCapture, &realSr )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2721
Expression 'PaAlsaStream_Configure( stream, inputParameters, outputParameters, sampleRate, framesPerBuffer, &inputLatency, &outputLatency, &hostBufferSizeMode )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2845
Traceback (most recent call last):
  File "/home/kris/.local/bin/whispering", line 8, in <module>
    sys.exit(main())
  File "/home/kris/.local/lib/python3.10/site-packages/whispering/cli.py", line 343, in main
    for text in transcribe_from_mic(
  File "/home/kris/.local/lib/python3.10/site-packages/whispering/cli.py", line 57, in transcribe_from_mic
    with sd.InputStream(
  File "/home/kris/.local/lib/python3.10/site-packages/sounddevice.py", line 1421, in __init__
    _StreamBase.__init__(self, kind='input', wrap_callback='array',
  File "/home/kris/.local/lib/python3.10/site-packages/sounddevice.py", line 898, in __init__
    _check(_lib.Pa_OpenStream(self._ptr, iparameters, oparameters,
  File "/home/kris/.local/lib/python3.10/site-packages/sounddevice.py", line 2747, in _check
    raise PortAudioError(errormsg, err)
sounddevice.PortAudioError: Error opening InputStream: Invalid sample rate [PaErrorCode -9997]

I tried adding another sample rate to ~/.asoundrc to mic(i hate alsa config though and this shouldnt be neccessary in the first place), but it didnt work. Not sure i am doing the right thing...

Language detection feature for websocket server

Description

Additional context

Return partial analysis result

Whisper assumes a 30-second interval as an input. So, whispering does not request analysis from whisper until 30 seconds have elapsed without --allow-padding.
However, it is useful to show temporary transcriptions for short intervals.

No ASR results macOS

Describe the bug

No ASR results are produced and error after a while

To Reproduce

whispering --language en --model tiny --debug

Logs

(whisper_streaming) fc@Claudios-MacBook-Pro whisper_streaming % whispering --language en --model tiny --debug
[2022-10-09 19:13:13,532] cli.get_wshiper:211 DEBUG -> WhisperConfig: model_name='tiny' device='cpu' language='en' fp16=True
[2022-10-09 19:13:14,103] transcriber._set_dtype:35 WARNING -> FP16 is not supported on CPU; using FP32 instead
Using cache found in /Users/fc/.cache/torch/hub/snakers4_silero-vad_master
[2022-10-09 19:13:16,014] cli.get_context:223 DEBUG -> Context: timestamp=0.0 buffer_tokens=[] buffer_mel=None vad=True temperatures=[0.0, 0.2, 0.4, 0.6, 0.8, 1.0] allow_padding=False patience=None compression_ratio_threshold=2.4 logprob_threshold=-1.0 no_captions_threshold=0.6 best_of=5 beam_size=5 no_speech_threshold=0.6 buffer_threshold=0.5 vad_threshold=0.5
[2022-10-09 19:13:16,014] cli.transcribe_from_mic:51 INFO -> Ready to transcribe
[2022-10-09 19:13:16,058] cli.transcribe_from_mic:62 DEBUG -> Audio #: 0, The rest of queue: 0
[2022-10-09 19:13:19,915] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:19,916] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:20,148] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:20,148] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:20,148] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:20,148] transcriber.transcribe:265 DEBUG -> mel.shape (375) - seek (0) < N_FRAMES (3000)
[2022-10-09 19:13:20,148] transcriber.transcribe:271 DEBUG -> No padding
[2022-10-09 19:13:20,148] transcriber.transcribe:326 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:20,148] cli.transcribe_from_mic:62 DEBUG -> Audio #: 1, The rest of queue: 0
[2022-10-09 19:13:23,595] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:23,595] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:23,785] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:23,785] transcriber.transcribe:256 DEBUG -> buffer_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:23,785] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 750])
[2022-10-09 19:13:23,785] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:23,785] transcriber.transcribe:265 DEBUG -> mel.shape (750) - seek (0) < N_FRAMES (3000)
[2022-10-09 19:13:23,785] transcriber.transcribe:271 DEBUG -> No padding
[2022-10-09 19:13:23,785] transcriber.transcribe:326 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 750])
[2022-10-09 19:13:23,785] cli.transcribe_from_mic:62 DEBUG -> Audio #: 2, The rest of queue: 0
[2022-10-09 19:13:27,425] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:27,425] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:27,474] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:27,475] transcriber.transcribe:256 DEBUG -> buffer_mel.shape: torch.Size([80, 750])
[2022-10-09 19:13:27,475] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 1125])
[2022-10-09 19:13:27,475] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:27,475] transcriber.transcribe:265 DEBUG -> mel.shape (1125) - seek (0) < N_FRAMES (3000)
[2022-10-09 19:13:27,475] transcriber.transcribe:271 DEBUG -> No padding
[2022-10-09 19:13:27,475] transcriber.transcribe:326 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 1125])
[2022-10-09 19:13:27,475] cli.transcribe_from_mic:62 DEBUG -> Audio #: 3, The rest of queue: 0
[2022-10-09 19:13:31,115] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:31,115] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:31,160] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:31,161] transcriber.transcribe:256 DEBUG -> buffer_mel.shape: torch.Size([80, 1125])
[2022-10-09 19:13:31,161] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 1500])
[2022-10-09 19:13:31,161] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:31,161] transcriber.transcribe:265 DEBUG -> mel.shape (1500) - seek (0) < N_FRAMES (3000)
[2022-10-09 19:13:31,161] transcriber.transcribe:271 DEBUG -> No padding
[2022-10-09 19:13:31,161] transcriber.transcribe:326 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 1500])
[2022-10-09 19:13:31,161] cli.transcribe_from_mic:62 DEBUG -> Audio #: 4, The rest of queue: 0
[2022-10-09 19:13:34,998] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:34,998] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:35,046] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:35,046] transcriber.transcribe:256 DEBUG -> buffer_mel.shape: torch.Size([80, 1500])
[2022-10-09 19:13:35,046] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 1875])
[2022-10-09 19:13:35,046] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:35,046] transcriber.transcribe:265 DEBUG -> mel.shape (1875) - seek (0) < N_FRAMES (3000)
[2022-10-09 19:13:35,046] transcriber.transcribe:271 DEBUG -> No padding
[2022-10-09 19:13:35,047] transcriber.transcribe:326 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 1875])
[2022-10-09 19:13:35,047] cli.transcribe_from_mic:62 DEBUG -> Audio #: 5, The rest of queue: 0
[2022-10-09 19:13:38,689] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:38,689] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:38,737] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:38,737] transcriber.transcribe:256 DEBUG -> buffer_mel.shape: torch.Size([80, 1875])
[2022-10-09 19:13:38,737] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 2250])
[2022-10-09 19:13:38,737] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:38,737] transcriber.transcribe:265 DEBUG -> mel.shape (2250) - seek (0) < N_FRAMES (3000)
[2022-10-09 19:13:38,737] transcriber.transcribe:271 DEBUG -> No padding
[2022-10-09 19:13:38,737] transcriber.transcribe:326 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 2250])
[2022-10-09 19:13:38,737] cli.transcribe_from_mic:62 DEBUG -> Audio #: 6, The rest of queue: 0
[2022-10-09 19:13:42,368] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:42,369] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:42,415] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:42,415] transcriber.transcribe:256 DEBUG -> buffer_mel.shape: torch.Size([80, 2250])
[2022-10-09 19:13:42,416] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 2625])
[2022-10-09 19:13:42,416] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:42,416] transcriber.transcribe:265 DEBUG -> mel.shape (2625) - seek (0) < N_FRAMES (3000)
[2022-10-09 19:13:42,416] transcriber.transcribe:271 DEBUG -> No padding
[2022-10-09 19:13:42,416] transcriber.transcribe:326 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 2625])
[2022-10-09 19:13:42,416] cli.transcribe_from_mic:62 DEBUG -> Audio #: 7, The rest of queue: 0
[2022-10-09 19:13:46,251] cli.transcribe_from_mic:77 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-09 19:13:46,251] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-09 19:13:46,298] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 19:13:46,298] transcriber.transcribe:256 DEBUG -> buffer_mel.shape: torch.Size([80, 2625])
[2022-10-09 19:13:46,299] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 3000])
[2022-10-09 19:13:46,299] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-09 19:13:46,299] transcriber.transcribe:280 DEBUG -> seek=0, timestamp=0.0, mel.shape: torch.Size([80, 3000]), segment.shape: torch.Size([80, 3000])
[2022-10-09 19:13:46,299] transcriber._decode_with_fallback:103 DEBUG -> DecodeOptions: DecodingOptions(task='transcribe', language='en', temperature=0.0, sample_len=None, best_of=None, beam_size=5, patience=None, length_penalty=None, prompt=[], prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=False, max_initial_timestamp=1.0, fp16=False)
Traceback (most recent call last):
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/bin/whispering", line 8, in
sys.exit(main())
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/whispering/cli.py", line 301, in main
for text in transcribe_from_mic(
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/whispering/cli.py", line 82, in transcribe_from_mic
for chunk in wsp.transcribe(audio=audio, ctx=ctx):
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/whispering/transcriber.py", line 284, in transcribe
result = self._decode_with_fallback(
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/whispering/transcriber.py", line 104, in _decode_with_fallback
decode_result = self.model.decode(
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/whisper/decoding.py", line 700, in decode
result = DecodingTask(model, options).run(mel)
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/whisper/decoding.py", line 472, in init
self.decoder = BeamSearchDecoder(
File "/Users/fc/.local/share/virtualenvs/whisper_streaming-htMqhiJ1/lib/python3.10/site-packages/whisper/decoding.py", line 283, in init
self.max_candidates: int = round(beam_size * (1.0 + patience))
TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

Environment

OS: macOS Monteray
Python Version: 3.10.3
Whispering version: 0.5.0

Choose audio sample rate

Description

It seems that whispering expects a sample rate of 16kHz which causes the VAD to fail when another sampling rate is provided.
Would it be possible to send the sample rate in the context and work with that for the rest of the transcription?

Thanks in advance!

Add "make style"

Description

Add "make style" to format

Additional context

Whispering not outputting any text

Description

After starting, debug prints out actions but there is no text output. I set the --output test.txt and it creates and empty file/
I assume the default should print out to the console in real time (after the set delay).

What do I do wrong ?

Logs (Optional)

whispering --language en --model tiny --debug
[2022-10-26 12:11:22,087] cli.get_wshiper:219 DEBUG -> WhisperConfig: model_name='tiny' device='cpu' language='en' fp16=True
[2022-10-26 12:11:22,478] transcriber._set_dtype:35 WARNING -> FP16 is not supported on CPU; using FP32 instead
Using cache found in C:\Users\Joe/.cache\torch\hub\snakers4_silero-vad_master
[2022-10-26 12:11:22,878] cli.get_context:232 DEBUG -> Context: protocol_version=6002 timestamp=0.0 buffer_tokens=[] buffer_mel=None nosoeech_skip_count=None temperatures=[0.0, 0.2, 0.4, 0.6, 0.8, 1.0] patience=None compression_ratio_threshold=2.4 logprob_threshold=-1.0 no_captions_threshold=0.6 best_of=5 beam_size=5 no_speech_threshold=0.6 buffer_threshold=0.5 vad_threshold=0.5 max_nospeech_skip=16 data_type='float32'
[2022-10-26 12:11:22,879] cli.transcribe_from_mic:56 INFO -> Ready to transcribe
[2022-10-26 12:11:22,890] cli.transcribe_from_mic:67 DEBUG -> Audio #: 0, The rest of queue: 0
[2022-10-26 12:11:26,761] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-26 12:11:26,762] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-26 12:11:26,933] vad.call:56 DEBUG -> VAD: 0.9772128462791443 (threshold=0.5)
[2022-10-26 12:11:26,936] transcriber.transcribe:266 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-26 12:11:26,937] transcriber.transcribe:273 DEBUG -> mel.shape: torch.Size([80, 375])
[2022-10-26 12:11:26,938] transcriber.transcribe:277 DEBUG -> seek: 0
[2022-10-26 12:11:26,939] transcriber.transcribe:282 DEBUG -> mel.shape (375) - seek (0) < N_FRAMES (3000)
[2022-10-26 12:11:26,940] transcriber.transcribe:288 DEBUG -> No padding
[2022-10-26 12:11:26,940] transcriber.transcribe:345 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 375])
[2022-10-26 12:11:26,941] cli.transcribe_from_mic:67 DEBUG -> Audio #: 1, The rest of queue: 0
[2022-10-26 12:11:30,582] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-10-26 12:11:30,584] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-26 12:11:30,696] vad.call:56 DEBUG -> VAD: 0.9962943196296692 (threshold=0.5)
[2022-10-26 12:11:30,697] transcriber.transcribe:266 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-26 12:11:30,698] transcriber.transcribe:270 DEBUG -> buffer_mel.shape: torch.Size([80, 375])
[2022-10-26 12:11:30,698] transcriber.transcribe:273 DEBUG -> mel.shape: torch.Size([80, 750])
[2022-10-26 12:11:30,699] transcriber.transcribe:277 DEBUG -> seek: 0
[2022-10-26 12:11:30,700] transcriber.transcribe:282 DEBUG -> mel.shape (750) - seek (0) < N_FRAMES (3000)
[2022-10-26 12:11:30,700] transcriber.transcribe:288 DEBUG -> No padding
[2022-10-26 12:11:30,701] transcriber.transcribe:345 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 750])
[2022-10-26 12:11:30,702] cli.transcribe_from_mic:67 DEBUG -> Audio #: 2, The rest of queue: 0
[2022-10-26 12:11:34,200] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0

Environment

OS: Windows 10
Python Version: 3.10.8
Whispering version: Latest (installed today)

Additional context

If writing to a file, should it do auto flush ?
What's the correct way to exit the program ?

Add WebSocket interface

Definitely a bit out of scope but was curious if anyone had ideas for how to forward microphone audio to a remote server in realtime to run this (since I don't have a local GPU)

Duplicated output bug

Sometimes duplicated results are shown.

80.40->84.40    もうこのすればの方をつけるために
[2022-09-23 23:29:33,283] transcriber._deal_timestamp DEBUG -> Length of buffer: 0
[2022-09-23 23:29:33,283] transcriber.transcribe DEBUG -> Last rest_start=None
[2022-09-23 23:29:33,283] cli.transcribe_from_mic DEBUG -> Segment: 12
[2022-09-23 23:29:33,284] transcriber.transcribe DEBUG -> seek=0, timestamp=84.4, rest_start=None
[2022-09-23 23:29:34,192] transcriber.transcribe DEBUG -> Result: temperature=0.00, no_speech_prob=0.25, avg_logprob=-0.47
[2022-09-23 23:29:34,192] transcriber._deal_timestamp DEBUG -> Length of consecutive: 0
84.40->88.40    もうこのすればの方をつけるために
[2022-09-23 23:29:34,192] transcriber._deal_timestamp DEBUG -> Length of buffer: 0
[2022-09-23 23:29:34,192] transcriber.transcribe DEBUG -> Last rest_start=None
[2022-09-23 23:29:34,192] cli.transcribe_from_mic DEBUG -> Segment: 13
[2022-09-23 23:29:34,193] transcriber.transcribe DEBUG -> seek=0, timestamp=88.4, rest_start=None
[2022-09-23 23:29:35,097] transcriber.transcribe DEBUG -> Result: temperature=0.00, no_speech_prob=0.25, avg_logprob=-0.47
[2022-09-23 23:29:35,097] transcriber._deal_timestamp DEBUG -> Length of consecutive: 0
88.40->92.40    もうこのすればの方をつけるために
[2022-09-23 23:29:35,098] transcriber._deal_timestamp DEBUG -> Length of buffer: 0
[2022-09-23 23:29:35,098] transcriber.transcribe DEBUG -> Last rest_start=None
[2022-09-23 23:29:35,098] cli.transcribe_from_mic DEBUG -> Segment: 14
[2022-09-23 23:29:35,099] transcriber.transcribe DEBUG -> seek=0, timestamp=92.4, rest_start=None
[2022-09-23 23:29:35,994] transcriber.transcribe DEBUG -> Result: temperature=0.00, no_speech_prob=0.25, avg_logprob=-0.47
[2022-09-23 23:29:35,994] transcriber._deal_timestamp DEBUG -> Length of consecutive: 0
92.40->96.40    もうこのすればの方をつけるために

First try, really interesting, but with some issues (

I just tested it, Windows 10 and Spanish language.
I notice phrases that I don't know where they appear in English (?¿) and repetitions of phrases when they are short.

The idea is very very interesting, if I can help, I will continue testing. :)

"No audio backend available" & VAD related error (Win10, Pytorch, worked on Sept27)

Hello Shirayu / Whispering contributors.

I've been testing out this fork for the past week without issue, but after upgrading to the latest release I am now getting errors and unable to use it because of a change in the audio backend? I am on Windows 10 21H2 and torch 1.12.1 (was on 1.13-dev before today). I usually update this software by downloading the github Zip and pip install ./ the directory, which has worked for the past few versions. I use VBAudioCable with '--mic 0' to have it listen to various videos running or audio sources for live transcription.

PC: RTX 3080Ti & Ryzen 5700X

In Powershell/Python: whispering --language en --model medium.en -n 80 --allow-padding --mic 0 --device cuda
is my full command line argument.

Some of the errors seen are:

torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.

torch\nn\modules\module.py:1130: UserWarning: operator () profile_node %1178 : int[] = prim::profile_ivalue(%1176)
does not have profile information (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\graph_fuser.cpp:108.)
return forward_call(*input, **kwargs)

After using --no-vad the errors ('profile information / operator error) went away and transcribing continued. This time though, I also have the 'transcriber.transcribe:269 WARNING -> Padding is not expected while speaking' warning which did not show up before, I assume because you are supposed to now use VAD instead of padding to keep the AI's 30 second window? happy

If someone could help me understand how the options now work (with all the development), for good/best 'live' transcription, I would appreciate it. I saw that adjusting -n from 160 (stock/ no -n?) to 80 has been decent speed with minimal errors, and that things were worse/cutting off at lower.

how can i use whispering with websocket client?

i have installed whispering on two different PC，and startup server and client .
pc1 is Host
whispering --language en --model tiny --host 0.0.0.0 --port 8000
pc2 is Client
whispering --host ADDRESS_OF_HOST --port 8000 --mode client
now what should i do to use client ?

How to get transcription language in websocket response ?

Hello, thank you for this project, it looks incredible!

When using websockets with websocket_client, is there a way to get the detected language in the transcription result? Something like:

{
  "start": 11.22,
  "end": 16.580000000000002, 
  "text": "This is a test sentence.",
  "detected_language": "en",  <- here
  "temperature": 0.0,
  "avg_logprob": -0.46804791405087426,
  "compression_ratio": 0.9428571428571428,
  "no_speech_prob": 0.23318801820278168
}

Thanks in advance !

Possible to use with a pipe?

I'm curious what the required input format is? I would like to point this to a stream and test it out. It's an mp3 stream, but I have access to the source audio and can encode in any format.

Running headless on a linux cloud server ideally.

Would you mind to add a demo video and docstring to functions?

Description

I really enjoy reading the code which is clean and well formatted. Just it doesn't have any docstring for the function which makes it a little slower for me to understand it. So would you mind adding docstring to the functions so that people can quickly understand what the functions and files are about and potentially start to contribute quickly?

Language designation from clients

Description

The available language is fixed at server startup, but should be able to be changed

Additional context

Related to #30

Multilanguage transcripts

Hi there,

I am very interested in your project as I am looking for a multilanguage transcript server that would use websockets.

I tried to add a "multilanguage" option that seems to work on this commit (there are also personal debug lines that I removed...) : 48991dc

Would you want me to make a PR ?

How do i calculate the neccessary hardware to run this?

I suppose i am asking more about whisper in general... if i want instant transcribing(in norwegian!) what kind of hardware am i looking for?

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

Installed correctly on macOS Monterey. Run as indicated in Readme, after the initial prompt "cli.transcribe_from_mic INFO -> Ready to transcribe", talking into mic causes:

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

Whisper works correctly in offline modality with CPU.

Bit slow

Hi there, for me transcription is bit slow, I have RTX 2070 with 8gb. Is expected, do I need to ramp up hardware?

Setting for real-time streaming ASR?

Hi, appreciate your excellent project! I tried running the server and the client successfully. I found that ASR responds slowly, although I set --frame to a smaller value (i.e., 100), --num_block to 80, and --vad to 0. Whether is it possible to apply your project for real-time streaming ASR? If possible, may I know how to set the parameters properly? Thank you.

Fix websocket connection close operation

Use in combination with faster-whisper

Description

Whilst openai/whisper may get faster in the future, it seems https://github.com/guillaumekln/faster-whisper/ says their "implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory". And their version did seem to be quite a lot faster when I tried it.

Perhaps it would be possible to make this project run with ctranslate2 the same way https://github.com/guillaumekln/faster-whisper has done it. Having the program process faster has obvious benefits for live translation which I probably don't need to explain.

I have no idea how to add it in myself

I am unable to get it running on my machine (CPU)

Description

I installed whispering and followed the instructions, however I am not able to get any output. All I get is "No speech", which is clearly not right

Logs (Optional)

[2022-11-04 15:23:27,443] vad.__call__:56 DEBUG -> VAD: 0.010574953630566597 (threshold=0.5)
[2022-11-04 15:23:27,443] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:27,443] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:27,443] cli.transcribe_from_mic:67 DEBUG -> Audio #: 7, The rest of queue: 0
[2022-11-04 15:23:31,274] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-11-04 15:23:31,275] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-04 15:23:31,310] vad.__call__:56 DEBUG -> VAD: 0.010565487667918205 (threshold=0.5)
[2022-11-04 15:23:31,310] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:31,310] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:31,310] cli.transcribe_from_mic:67 DEBUG -> Audio #: 8, The rest of queue: 0
[2022-11-04 15:23:34,948] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-11-04 15:23:34,948] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-04 15:23:34,979] vad.__call__:56 DEBUG -> VAD: 0.010574160143733025 (threshold=0.5)
[2022-11-04 15:23:34,979] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:34,979] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:34,979] cli.transcribe_from_mic:67 DEBUG -> Audio #: 9, The rest of queue: 0

Environment

Mac M1

OS:
Python Version: 3.9
Whispering version: 0.6.3

How can we force whisper to be always on punctuation enabled mode

I am currently struggling with punctuation being on and off

How can I force it to be completely on all the time?

I also asked the question here : openai/whisper#194

Also I am ok for slowness but I would like to get best results. What are the hyper parameters that I should provide

Thank you very much

Add CI (Windows, Mac OS)

Need more dependencies?
#17 (comment)
https://github.com/pytorch/audio/blob/0b7f2fbaf8c4e828ddc4c44fe4f9942102246a22/README.md#backend-dispatch

Only seems to work for me with `--no-vad` and `--allow-padding`

Hey, thanks for making this! I was looking around for something that did live STT and this seems to work well!

Reading through the code, I'm very confused by the allow_padding variable. I couldn't get the code to work at all without --allow-padding. Maybe document what this code is doing?

whispering/whispering/transcriber.py

Lines 264 to 272 in 9123181

 if mel.shape[-1] - seek < N_FRAMES: 

 logger.debug( 

 f"mel.shape ({mel.shape[-1]}) - seek ({seek}) < N_FRAMES ({N_FRAMES})" 

 ) 

 if ctx.allow_padding: 

 logger.warning("Padding is not expected while speaking") 

 else: 

 logger.debug("No padding") 

 break

Additionally, and maybe this is because my mic isn't loud enough, the VAD didn't seem to work super well. I got it working for a bit at the start of recording when I had --allow-padding but then it seemed to report 'No speech' no matter how loudly I spoke. I'll have to try and adjust my mic volume to see if I can fix that.

Logs

Here's a section of logging:

[2022-10-14 16:42:33,000] transcriber._deal_timestamp:227 DEBUG -> Length of buffer: 8
[2022-10-14 16:42:33,000] transcriber.transcribe:319 DEBUG -> new seek=3000, mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:33,000] transcriber.transcribe:322 DEBUG -> ctx.buffer_mel is None (torch.Size([80, 375]), 3000)
[2022-10-14 16:42:35,730] cli.transcribe_from_mic:75 DEBUG -> Audio #: 2, The rest of queue: 0
[2022-10-14 16:42:35,730] cli.transcribe_from_mic:90 DEBUG -> Got. The rest of queue: 0
[2022-10-14 16:42:35,730] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-14 16:42:35,733] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:35,733] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:35,733] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-14 16:42:35,733] transcriber.transcribe:265 DEBUG -> mel.shape (375) - seek (0) < N_FRAMES (3000)
[2022-10-14 16:42:35,733] transcriber.transcribe:269 WARNING -> Padding is not expected while speaking
[2022-10-14 16:42:35,733] transcriber.transcribe:280 DEBUG -> seek=0, timestamp=24.0, mel.shape: torch.Size([80, 375]), segment.shape: torch.Size([80, 3000])
[2022-10-14 16:42:35,734] transcriber._decode_with_fallback:103 DEBUG -> DecodeOptions: DecodingOptions(task='transcribe', language='en', temperature=0.0, sample_len=None, best_of=None, beam_size=5, patience=None, length_penalty=None, prompt=[50363, 18435, 11, 3387, 670, 13, 50463, 50463], prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=False, max_initial_timestamp=1.0, fp16=False)
[2022-10-14 16:42:37,208] transcriber.transcribe:288 DEBUG -> Result: temperature=0.00, no_speech_prob=0.06, avg_logprob=-0.66
[2022-10-14 16:42:37,209] transcriber._deal_timestamp:201 DEBUG -> Length of consecutive: 0, timestamps: tensor([50363, 50713])
[2022-10-14 16:42:37,209] transcriber._deal_timestamp:212 DEBUG -> segment_duration: 30.0, Duration: 7.0

Environment

OS: Arch Linux 5.19.13
Python Version: 3.10.8
Whispering version: 9123181

No ASR result return

hello,
When I deploy a service on windows and it has asr result， but there is no result on linux.
Can you take a look at it？

      command：
              whispering  --language en --model tiny --host 0.0.0.0 --port 8081 --debug
              whispering   --host 127.0.0.1 --port 8081 --no-vad --mode client
    
        linux bug log：

[2022-10-09 10:18:46,959] serve.serve_with_websocket_main:46 DEBUG -> Message size: 240000
[2022-10-09 10:18:46,959] transcriber.transcribe:236 DEBUG -> 60000
[2022-10-09 10:18:46,961] transcriber.transcribe:253 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 10:18:46,961] transcriber.transcribe:257 DEBUG -> buffer_mel.shape: torch.Size([80, 2250])
[2022-10-09 10:18:46,963] transcriber.transcribe:260 DEBUG -> mel.shape: torch.Size([80, 2625])
[2022-10-09 10:18:46,963] transcriber.transcribe:264 DEBUG -> seek: 0
[2022-10-09 10:18:46,963] transcriber.transcribe:266 DEBUG -> mel.shape (2625) - seek (0) < N_FRAMES (3000)
[2022-10-09 10:18:46,963] transcriber.transcribe:272 DEBUG -> No padding
[2022-10-09 10:18:46,963] transcriber.transcribe:327 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 2625])
[2022-10-09 10:18:46,963] serve.serve_with_websocket_main:23 DEBUG -> Audio #: 7
[2022-10-09 10:18:50,711] protocol.read_frame:1152 DEBUG -> < BINARY 00 00 d0 3a 00 00 ec 3a 00 00 c0 3a 00 00 a4 3a ... 00 00 f0 b9 00 00 28 ba [240000 bytes]
[2022-10-09 10:18:50,711] serve.serve_with_websocket_main:46 DEBUG -> Message size: 240000
[2022-10-09 10:18:50,712] transcriber.transcribe:236 DEBUG -> 60000
[2022-10-09 10:18:50,714] transcriber.transcribe:253 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-09 10:18:50,714] transcriber.transcribe:257 DEBUG -> buffer_mel.shape: torch.Size([80, 2625])
[2022-10-09 10:18:50,715] transcriber.transcribe:260 DEBUG -> mel.shape: torch.Size([80, 3000])
[2022-10-09 10:18:50,716] transcriber.transcribe:264 DEBUG -> seek: 0
[2022-10-09 10:18:50,735] transcriber.transcribe:281 DEBUG -> seek=0, timestamp=0.0, mel.shape: torch.Size([80, 3000]), segment.shape: torch.Size([80, 3000])
[2022-10-09 10:18:50,735] transcriber._decode_with_fallback:104 DEBUG -> DecodeOptions: DecodingOptions(task='transcribe', language='en', temperature=0.0, sample_len=None, best_of=None, beam_size=5, patience=None, length_penalty=None, prompt=[], prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=False, max_initial_timestamp=1.0, fp16=True)
[2022-10-09 10:18:50,735] server.handler:234 ERROR -> connection handler failed
Traceback (most recent call last):
  File "/root/anaconda3/envs/anio_whisper_webui/lib/python3.9/site-packages/websockets/legacy/server.py", line 232, in handler
    await self.ws_handler(self)
  File "/data/aino/code/whispering_flush/whispering/serve.py", line 57, in serve_with_websocket_main
    for chunk in g_wsp.transcribe(
  File "/data/aino/code/whispering_flush/whispering/transcriber.py", line 285, in transcribe
    result = self._decode_with_fallback(
  File "/data/aino/code/whispering_flush/whispering/transcriber.py", line 105, in _decode_with_fallback
    decode_result = self.model.decode(
  File "/root/anaconda3/envs/anio_whisper_webui/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/anio_whisper_webui/lib/python3.9/site-packages/whisper/decoding.py", line 699, in decode
    result = DecodingTask(model, options).run(mel)
  File "/root/anaconda3/envs/anio_whisper_webui/lib/python3.9/site-packages/whisper/decoding.py", line 472, in __init__
    self.decoder = BeamSearchDecoder(
  File "/root/anaconda3/envs/anio_whisper_webui/lib/python3.9/site-packages/whisper/decoding.py", line 283, in __init__
    self.max_candidates: int = round(beam_size * (1.0 + patience))
TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

Add "how to run dev version" in the readme

Description

Sorry for this, but I couldn't figure out how to run in development mode.
When I clone the repo, it throws whispering module not found, and when I install whispering it uses the files from the pip install, not the cloned repo.

My current fix is to manually edit the imports.

Could you please add some instructions on how to run it locally in dev mode?

Thanks in advance!

Can't install with Pip

Describe the bug

The following error happens when attempting to install whispering with pip.

To Reproduce

pip install -U git+https://github.com/shirayu/[email protected]
Error: Preparing metadata (pyproject.toml) did not run successfully.

Expected behavior

A proper installation

Logs

stas@stas-zenbook ~> pip install -U git+https://github.com/shirayu/[email protected]
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/shirayu/[email protected]
  Cloning https://github.com/shirayu/whispering.git (to revision v0.5.0) to /tmp/pip-req-build-zn5l2obc
  Running command git clone --filter=blob:none --quiet https://github.com/shirayu/whispering.git /tmp/pip-req-build-zn5l2obc
  Running command git checkout -q 77808b11825493a1a1b10947a2412df98c6fb511
  Resolved https://github.com/shirayu/whispering.git to commit 77808b11825493a1a1b10947a2412df98c6fb511
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [14 lines of output]
      Traceback (most recent call last):
        File "/home/stas/.local/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/home/stas/.local/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/home/stas/.local/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 164, in prepare_metadata_for_build_wheel
          return hook(metadata_directory, config_settings)
        File "/usr/lib/python3/dist-packages/poetry/core/masonry/api.py", line 43, in prepare_metadata_for_build_wheel
          poetry = Factory().create_poetry(Path(".").resolve(), with_dev=False)
        File "/usr/lib/python3/dist-packages/poetry/core/factory.py", line 43, in create_poetry
          raise RuntimeError("The Poetry configuration is invalid:\n" + message)
      RuntimeError: The Poetry configuration is invalid:
        - Additional properties are not allowed ('group' was unexpected)
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
stas@stas-zenbook ~ [1]>

Environment

OS: Ubuntu 22.04.1 LTS
Python Version: Python 3.10.6
Whispering version: [email protected]

	if mel.shape[-1] - seek < N_FRAMES:
	logger.debug(
	f"mel.shape ({mel.shape[-1]}) - seek ({seek}) < N_FRAMES ({N_FRAMES})"
	)
	if ctx.allow_padding:
	logger.warning("Padding is not expected while speaking")
	else:
	logger.debug("No padding")
	break

shirayu / whispering Goto Github PK

whispering's Introduction

Whispering

Notice

Setup

Example of microphone

Parse interval

Example of web socket

Host

Client

For Developers

License

whispering's People

Contributors

Stargazers

Watchers

Forkers

whispering's Issues

Description

Environment

Description

Description

Additional context

Describe the bug

To Reproduce

Logs

Environment

Description

Description

Additional context

Description

Logs (Optional)

Environment

Additional context

Description

Description

Additional context

Description

Description

Logs (Optional)

Environment

Logs

Environment

Description

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Recommend Projects

Recommend Topics

Recommend Org