systran / faster-whisper Goto Github PK
View Code? Open in Web Editor NEWFaster Whisper transcription with CTranslate2
License: MIT License
Faster Whisper transcription with CTranslate2
License: MIT License
Would it be possible to produce shorter segments? (some are way too long)
GPU: A100
wav:test.zip
code:
from tqdm import tqdm
import time
from faster_whisper import WhisperModel
import os
os.environ["OMP_NUM_THREADS"] = "4"
audio_path = 'test.wav'
gpu_model_path = "whisper-large-v2-ct2-float16/"
cpu_model_path = "whisper-large-v2-ct2-int8/"
# or run on GPU with INT8
gpu_model = WhisperModel(gpu_model_path, device="cuda", compute_type="float16")
# or run on CPU with INT8
cpu_model = WhisperModel(cpu_model_path, device="cpu", compute_type="int8")
startTime = time.time()
print('Transcribing with gpu model')
segments, info = gpu_model.transcribe(audio_path, beam_size=5, language='zh')
intermediateTime = time.time()
print('gpu model: %s' % (intermediateTime - startTime))
print('-'*100)
startTime = time.time()
print('Transcribing with cpu model')
segments, info = cpu_model.transcribe(audio_path, beam_size=5, language='zh')
intermediateTime = time.time()
print('cpu model: %s' % (intermediateTime - startTime))
log:
Transcribing with gpu model
gpu model: 0.012365579605102539
----------------------------------------------------------------------------------------------------
Transcribing with cpu model
cpu model: 0.006761789321899414
Whisper is using model="cuda:0"
or model="cuda:1"
With faster-whisper, I get this error:
ValueError: unsupported device cuda:1
❯ python3 transcribe.py
Traceback (most recent call last):
File "/home/user/git/faster-whisper/transcribe.py", line 13, in <module>
segments, info = model.transcribe("audio.opus", beam_size=5)
File "/home/user/git/faster-whisper/faster_whisper/transcribe.py", line 156, in transcribe
audio = decode_audio(
File "/home/user/git/faster-whisper/faster_whisper/audio.py", line 27, in decode_audio
fifo.write(new_frame)
File "av/audio/fifo.pyx", line 25, in av.audio.fifo.AudioFifo.write
File "av/audio/fifo.pyx", line 90, in av.audio.fifo.AudioFifo.write
File "av/error.pyx", line 336, in av.error.err_check
av.error.ValueError: [Errno 22] Invalid argument
target file is an opus file. Is mp3 the only supported filetype? what codecs should the file be?
edit: it seems to work on smaller/shorter files. The one with the error is a ~12 hour video I wanted to test on.
Hi,
ctranslate2 version 3.9.0 has the error below, I solved the problem by fixing version 3.8.* in the file: requirements.txt
Traceback (most recent call last):
File "/usr/local/bin/ct2-transformers-converter", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 863, in main
converter.convert_from_args(args)
File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
return self.convert(
File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/converter.py", line 89, in convert
model_spec = self._load()
File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 99, in _load
spec = loader(model, tokenizer)
File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 154, in __call__
self.set_config(spec.config, model, tokenizer)
File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 584, in set_config
range(config.decoder_layers // 2, config.decoder_layers),
AttributeError: 'WhisperConfig' object has no attribute 'decoder_layers'
Hello, I have never used async in python before, just wanted is it possible to get the result of the transcription once it's done instead of in async
instead of this
def generate_segments(self, features, language, options):
tokenized_segments = self.generate_tokenized_segments(
features, language, options
)
for start, end, tokens in tokenized_segments:
text = self.decode_text_tokens(tokens)
if not text.strip():
continue
yield Segment(
start=start,
end=end,
text=text,
)
how would I go about returning an array of all segments
def generate_segments(self, features, language, options):
tokenized_segments = self.generate_tokenized_segments(
features, language, options
)
res = []
for start, end, tokens in tokenized_segments:
text = self.decode_text_tokens(tokens)
if not text.strip():
continue
res.append(Segment(
start=start,
end=end,
text=text,
))
return res
I'm trying to reuse the same CTranslate2 model instance to handle multiple different 30s chunks of audio from different streams. In order to prepare for that, I lifted the WhisperModel.model , WhisperModel.tokenizer
out into a core instance so I can initialize a single WhisperModel from faster-whisper
per worker but have them share the CTranslate2 Whisper Model.
I'm also handling prompt in the .transcribe() call.
However, I'm getting wildly different results where wording from one thread is crossing over to the other thread. This continued even after I implemented a lock around WhisperModel's transcribe call and subsequent segments
iterator.. It's as if the CTranslate2 Whisper model has internal state and I'm failing to clear it.
Does anyone have suggestions for how this may be?
My code for reference:
def translationWorker(work_queue, language, primer: str, persist, task, core: WhisperModelCore, start_ts: int, id: str):
# model = whisper.load_model("large")
options = {
"language": language,
"task": task if task is not None else ('translate' if language != 'en' else 'transcribe')
}
stripped_primer = ""
if (primer is not None and len(primer) > 3):
stripped_primer = primer.strip() + " "
options["initial_prompt"] = stripped_primer
model = WhisperModel(core)
first_send = True;
print("OpenAI Whisper Ready")
while True:
audio_chunks = work_queue.get()
if (audio_chunks == _DONE):
print("Finishing up over here too")
return "ok"
first_chunk: ChunkRecord
last_chunk: ChunkRecord
first_chunk, last_chunk, audio = audio_chunks
# Transcribe audio into subtitles
unique = [];
with core.lock:
out, b = model.transcribe(audio, **options)
# out is of type Segment
for seg in out:
text: str
start = seg.start;
end = seg.end;
text = seg.text;
stripped_text = text.strip()
print(
colored("[" + str(first_chunk.chunk_id + start) + ":" +
str(first_chunk.chunk_id + end) + "]", "dark_grey"),
" :: ", colored(text, "green"))
persist.append({
"relstart": start,
"relend": end,
"start": first_chunk.chunk_id + start,
"end": first_chunk.chunk_id + end,
"text": stripped_text})
if(len(unique) == 0 or unique[-1]["text"] != stripped_text):
unique.append({
"relstart": start,
"relend": end,
"start": first_chunk.chunk_id + start,
"end": first_chunk.chunk_id + end,
"text": stripped_text})
else:
unique[-1]["end"] = first_chunk.chunk_id + end;
# process UNIQUE into pieces to prompt and prefix.
local_overlap_split_at = OVERLAP_LATENCY if OVERLAP_LATENCY > 0 and OVERLAP_LATENCY < EXPECTED_CHUNK_DURATION else EXPECTED_CHUNK_DURATION
# context for the next run is going to be between 0 and OVERLAP_LATENCY.
options["initial_prompt"] = stripped_primer + " ".join(map(
lambda x: x["text"],
filter(lambda x: x["relend"] < local_overlap_split_at, unique)))
# prefix for the next run will be between local_overlap_split_at til the end.
options["prefix"] = " ".join(map(
lambda x: x["text"],
filter(lambda x: x["relend"] >= local_overlap_split_at, unique)))
send_candidates = list(map(lambda u: {
"timestamp": (start_ts + u["start"]) * 1000,
"text": u["text"],
"duration": int((u["end"] - u["start"])*1000.0)
}, filter(lambda u: first_send or u['relstart'] > (EXPECTED_CHUNK_DURATION - local_overlap_split_at), unique)))
# context is going to be the OVERLAP level.
send_translation(id, 'tl' if options["task"] == 'translate' else 'tc', send_candidates)
# print(out)
time.sleep(0.0001)
I am testing the branch word-level-timestamps
dc780dc
I installed your fork of CTranslate2 https://github.com/guillaumekln/CTranslate2.git on branch whisper-align
.
When I try to run inference with word timestamps, I get:
Traceback (most recent call last):
[...]
segments, info = model.transcribe(audio)
File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 206, in transcribe
results = self.model.detect_language(input)
RuntimeError: No SGEMM backend on CPU
I have seen this issue OpenNMT/CTranslate2#646 but in my case it looks different.
I did the cmake with default options (also tried explicit option -DWITH_MKL=ON
) and I have MKL
-- Found MKL include directory: /opt/intel/oneapi/mkl/latest/include
-- Found MKL library directory: /usr/lib/x86_64-linux-gnu
Do you see what I can be missing?
Description:
While running the code, I encountered an error with the following message: "TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'". The error occurs when trying to transcribe an audio file with the 'word_timestamps' argument set to True.
Steps to Reproduce:
Expected Result:
The code should transcribe the audio file and print the start and end times of each word in the audio file.
Actual Result:
The code throws a TypeError with the message "WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'".
Error Message:
TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'
Code Snippet:
from faster_whisper import WhisperModel
import torch
model_path = "whisper-large-v2-ct2/"
# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if torch.cuda.is_available() else "int8"
model = WhisperModel(model_path, device=device, compute_type=compute_type)
# Transcribe video and translate to English
with torch.no_grad():
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))
Hey great job on this package. Already enjoying the improvements.
I found in your README the following:
Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, model.transcribe uses a default beam size of 1 but here we use a default beam size of 5.
A link below shows the default beam size from openAI to be 5 as well.
Hello, first of all thank you very much for your work on this project, it really was much faster and consumed less RAM and VRAM.
I'm testing and unfortunately a significant part of my audio has been cut.
My audio is in Portuguese and has 13 minutes, apparently the problem only occurred at the beginning of it.
Is there a way to solve this problem?
I used the following code:
from faster_whisper import WhisperModel
model_path = "whisper-medium-ct2/"
# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")
# or run on GPU with INT8
# model = WhisperModel(model_path, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_path, device="cpu", compute_type="int8")
segments, info = model.transcribe("jota.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%ds -> %ds] %s" % (segment.start, segment.end, segment.text))
The result I got running the standard version of Whisper on the same medium model:
[00:00.000 --> 00:03.500] Afinal de contas, imprimir dinheiro gera ou não gera inflação?
--------------------- CUT ---------------------
[00:03.500 --> 00:05.500] É isso que a gente vai responder neste vídeo.
[00:05.500 --> 00:10.500] Música
[00:10.500 --> 00:13.500] Muito bem, todos aqueles que estão chegando agora aqui no canal, meu nome é Fernando Urch,
[00:13.500 --> 00:16.500] aqui a gente fala de economia, mercados e investimentos, se vocês gostarem do conteúdo,
[00:16.500 --> 00:21.000] considerem se inscrever, ativando o sininho aqui embaixo e também compartilhando este vídeo.
[00:21.000 --> 00:24.500] Pois o assunto de inflação é recorrente aqui no canal pela sua importância,
[00:24.500 --> 00:30.500] o impacto que tem na nossa vida financeira, profissional, na economia, na vida em sociedade.
--------------------- CUT ---------------------
[00:30.500 --> 00:34.500] E o debate em torno da relação entre impressão de moeda e inflação,
[00:34.500 --> 00:38.500] ele ressurge de tempos em tempos, como foi lá no início da pandemia,
[00:38.500 --> 00:42.500] quando muitos economistas, banqueiros centrais, políticos,
[00:42.500 --> 00:47.500] de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[00:47.500 --> 00:50.500] afirmavam categoricamente que imprimir dinheiro
[00:50.500 --> 00:54.500] não geraria inflação naquele momento, naquelas circunstâncias.
[00:54.500 --> 00:57.500] E a verdade é que não é tão simples responder essa pergunta,
[00:57.500 --> 01:02.500] porque imprimir dinheiro não necessariamente vai gerar inflação,
[01:02.500 --> 01:05.500] depende de outros fatores, depende das circunstâncias.
[01:05.500 --> 01:09.500] Mas sim que imprimir dinheiro é sempre um fator inflacionário.
The result I got running this faster version of Whisper:
Detected language 'pt' with probability 0.996094
[0s -> 3s] Afinal de contas, imprimir dinheiro gera ou não gera inflação?
[30s -> 36s] E o debate em torno da relação entre impressão de moeda e inflação resurge de tempos em tempos,
[36s -> 42s] como foi lá no início da pandemia, quando muitos economistas, banqueiros centrais, políticos,
[42s -> 47s] de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[47s -> 54s] afirmavam categoricamente que imprimir dinheiro não geraria inflação naquele momento, naquelas circunstâncias.
[54s -> 58s] E a verdade é que não é tão simples responder essa pergunta, porque
[58s -> 65s] imprimir dinheiro não necessariamente vai gerar inflação, depende de outros fatores, depende das circunstâncias.
[65s -> 69s] Mas sim que imprimir dinheiro é sempre um fator inflacionário.
As you can see there was a part cut off at the beginning of my audio, in case you want to test my audio to see if you get my results: https://www.dropbox.com/s/m0q30hmzbx6mvt2/jota.mp3?dl=1
And another question, is it possible to get the return as an srt or vtt file, like the standard Whisper?
Thank you very much.
Hello!
Everything is fine, superb! I really like your code at my oldy Intel Icore-3 first generation (no cpp version working on so old cpu)
but
https://www.youtube.com/results?search_query=zaz+belle++live+
strange problem with zaz
whisper - ok, no problem, file, i have downloaded it again, but absolutely same.
but with faster-whisper something wrong.
not exactly with wrong recognition, it's something with time shift etc
also its not problem with frances - other zaz songs is fine.
thank u very mouch
Thank you for releasing the code
since this implementation require less memory than other implementation
adding VAD (Voice activity detection) should be more suitable
Voice activity detection make whisper more accurate especially for non english
will this possible to add ?
thank you
I recently tried this wonderful tool on CPU of my Windows 10 amchine and got quite good results. But when I tried on GPU via model = WhisperModel(model_path, device="cuda", compute_type="float16")
I received following error Requested float16 compute type, but the target device or backend do not support efficient float16 computation.
I have GTX1050 Ti and main driver is 31.0.15.1694. How can I fix this error and run on my GPU card?
Hi,
I have a simple script to run inference on a wav file. I noticed when word_timestamps=True
, the processing time is much longer.
I'm using the same wav file in each of these cases, you can see duration below for each:
model = WhisperModel(model_path, "cpu", compute_type="int8", cpu_threads=4)
segments, info = model.transcribe(input, word_timestamps=True, beam_size=1)
OMP_NUM_THREADS=4 python inference.py
duration: 310.73524594306946s
model = WhisperModel(model_path, "cpu", compute_type="int8", cpu_threads=4)
segments, info = model.transcribe(input, word_timestamps=False, beam_size=1)
OMP_NUM_THREADS=4 python inference.py
duration: 225.86893439292908s
Is this expected? Or is there some optimization that could be done for word-level timestamps.
Thanks again for this project - for context I'm testing it transcribing a live public radio stream, appreciate the rapid speed and low memory as it's most useful providing near-live transcription.
The radio stream is maybe 60% voice on studio mic, 30% phone voice, 5% voice talking over music, and 5% music.
I have a simple python script running 30s chunks from the live radio stream into faster-whisper continuously.
Using base model, on a cheap VPS with just 2GB RAM - I'm sure I could get better results with a higher spec machine but it's a proof of concept - would be useful to run across a large number of different streams here.
Most 30s chunks take between 6-8s to transcribe, which is perfect, but roughly 1 in 10 can blow out between 20-50s.
I haven't quite figured out what causes it, I wonder if it's when the 30s chunk has a mix of music and talk, or a mix of different audio sources? Was wondering if in your experience you could shed light on the reason? Would a larger model stop the blowouts?
Description
I noticed that the translation feature in Faster-Whisper does not seem to translate the language fully for certain audio/video files. It appears to (at random) only translate parts of the language into English, whereas Whisper is mostly capable of translating the entire language. Is there a reason for this difference in translation capabilities?
Reproduction Steps
from faster_whisper import WhisperModel
import torch
model_path = "whisper-large-v2"
# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if torch.cuda.is_available() else "int8"
# Run on GPU with INT8 or FP16
model = WhisperModel(model_path, device=device, compute_type=compute_type)
# Transcribe video and translate to English
with torch.no_grad():
segments, info = model.transcribe("test.mp4", beam_size=5, task="translate")
# Print transcription and translation segments
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("\n[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Expected Behavior
The entire language in the video file should be translated into English.
Actual Behavior
The translation feature only translates/transcribes the English portion of the video.
Additional Information
I have attached the video file used to test the translation feature.
16:52:07 kris ~/faster-whisper $ python test_transcription.py Traceback (most recent call last): File "/home/kris/faster-whisper/test_transcription.py", line 8, in <module> model = WhisperModel(model_path, device="cpu", compute_type="float16") # compute_type="float16" File "/home/kris/faster-whisper/faster_whisper/transcribe.py", line 71, in __init__ self.model = ctranslate2.models.Whisper( ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.
I get somewhat of the same error with whisper however it automatically changes to float32. Is there a possible fix for this?
In the transcribe.py
code I don't see the initial_prompt
parameter.
Is it somewhere?
Can it be added?
Sometimes, a sentence repeat itself multiple times at the end of a segment, and may continue to repeat in subsequent segments.
This was a known issue of openai/whisper (upstream issue openai/whisper#977 and openai/whisper#1059), and may be fixed by openai/whisper@38f2f4d
When I use "faster-whisper", I encountered the same sentence repetition. I found it's also reported on #35 (comment)
Could you please check if this commit openai/whisper@38f2f4d can/should be ported here?
If I want to host this to a server which type of server I should use , for CPU and GPU .
I researched a little for CPU i came across linode .
I have no experience hosting thinks like this , appreciate if anyone can answer .
Previously, I would like to say very grateful for this project. I have a fairly unique problem, namely the transcript results obtained from mono audio and stereo audio are quite different. The transcript results from stereo audio are better than mono audio, even though we know that before the transcript process, the audio will be converted to mono.
Parameters used:
compute_type : int8
model : large
device: gpu
beam_size: 5
language: 'id'
Link Audio:
In Here
Will this project support ".pt" file instead of bin model hold in hugging face?Or this project can storage translated models in some public storage.
Because the bin models in hugging face are very large and my network is pretty slow, it will be a large challenge for me to download this size of model.
Hi,
First, let me start by saying great job. This is awesome! It's crazy how much faster this is than openai/whisper
. I really appreciate the effort here.
I am running into a few issues and just want to better understand.
I am on the latest ctranslate2==3.6.0
and am using the medium
model both with a beam_size=5
set. On faster-whisper
, I get:
______ available?î Yeah. Speaker? ______, this is
And on openai/whisper
I get:
I am using [NAME]
here.
__________ Hello. __________ Hello, is [NAME] available? __________ Yes, speaking. __________ Hi,
I have other examples of this as well but need to redact the text before I can post them.
I've also noticed non-English characters in the translation as well for faster-whisper
for English-only audio. And even on 3.6.0
, I still appear to get the beginning of my audio chopped off as well (but not all the time). It's seemingly random.
Is it normal to have some differences? Is there come config difference I am missing between the two?
While the speed increases are great, the inconsistencies are enough that we can't really use this over openai/whisper
for our tasks. Anything I can do to help debug, let me know.
As a side note, I've been testing medium.en
and the random characters seem to not happen and accuracy appears to be better for faster-whisper
compared to the multi language model for faster-whisper
where as medium
for openai/whsiper
appears to work fine.
Thanks in advanced!
Hey, when I am providing language to transcribe method like segments, info = model.transcribe(file_name, language="english", beam_size=5)
i am getting the following error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 25
20 segments, info = model.transcribe(file_name, language="english", beam_size=5)
21
23 print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
---> 25 for segment in segments:
26 print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
(...)
File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:187, in WhisperModel.generate_segments(self, features, language, options)
182 def generate_segments(self, features, language, options):
183 tokenized_segments = self.generate_tokenized_segments(
184 features, language, options
185 )
--> 187 for start, end, tokens in tokenized_segments:
188 text = self.decode_text_tokens(tokens)
189 if not text.strip():
File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:224, in WhisperModel.generate_tokenized_segments(self, features, language, options)
216 previous_tokens = all_tokens[prompt_reset_since:]
217 prompt = self.get_prompt(
218 language,
219 previous_tokens,
220 task=options.task,
221 without_timestamps=options.without_timestamps,
222 )
--> 224 result, temperature = self.generate_with_fallback(segment, prompt, options)
226 if (
227 result.no_speech_prob > options.no_speech_threshold
228 and result.scores[0] < options.log_prob_threshold
229 ):
230 offset += segment.shape[-1]
File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:315, in WhisperModel.generate_with_fallback(self, segment, prompt, options)
309 kwargs = {
310 "beam_size": options.beam_size,
311 "patience": options.patience,
312 }
314 final_temperature = temperature
--> 315 result = self.model.generate(
316 features,
317 [prompt],
318 max_length=max_length,
319 return_scores=True,
320 return_no_speech_prob=True,
321 **kwargs,
322 )[0]
324 tokens = result.sequences_ids[0]
325 text = self.decode_text_tokens(tokens)
TypeError: generate(): incompatible function arguments. The following argument types are supported:
1. (self: ctranslate2._ext.Whisper, features: ctranslate2._ext.StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_no_speech_prob: bool = False, sampling_topk: int = 1, sampling_temperature: float = 1) -> Union[List[ctranslate2._ext.WhisperGenerationResult], List[ctranslate2._ext.WhisperGenerationResultAsync]]
Invoked with: <ctranslate2._ext.Whisper object at 0x7f19f702bf30>, <ctranslate2._ext.StorageView object at 0x7f19fef2df70>, [[50258, None, 50359]]; kwargs: max_length=448, return_scores=True, return_no_speech_prob=True, beam_size=5, patience=1
Hi, I really appreciate you sharing this implementation.
I found it to be very fast with accurate results.
I do not see word-level timestamps in the result. Are word level timestamps possible?
Being able to have an option to pass the audio buffer directly to faster-whisper
instead of creating a file would be really great.
I can see that now only 30 second chunks are supported by the CTranslate2 model. Shorter chunks are padded to 30s such that model.generate
can accept exclusively [batch_size, 80, 3000]
inputs.
In some real-time applications like shorter chunks may be used and the original Whisper model supports shorter chunks despite being trained on 30s. Would it be possible to allow shorter chunks for faster inference in contrast to when always padding to 30s?
Hello, I would like to know if it's possible to add the "--word_timestamps" option to Faster Whisper now, since this new option has been added to the official Whisper repository. It would be very helpful if this option could be included in Faster Whisper. Thank you in advance.
I can't thank you enough for creating this faster-whisper . this is faster than whisper cpp . this is awesome . thank you so much
I have been unable to convert the model
Regardless of whether I have tried with or without quantization, or different models - unfortunately, I have had no success.
192:faster-whisper$ ct2-transformers-converter --model openai/whisper-small --output_dir output
Segmentation fault: 11
192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny --output_dir output
Segmentation fault: 11
192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny.en --output_dir output Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.92k/1.92k [00:00<00:00, 486kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 151M/151M [00:04<00:00, 37.2MB/s] Segmentation fault: 11 192:faster-whisper$ /usr/local/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
^^ this one just freezes
192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny --output_dir output --quantization float16
Segmentation fault: 11
192:faster-whisper$ ct2-transformers-converter --model openai/whisper-small --output_dir output --quantization float16
Segmentation fault: 11
192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny.en --output_dir output --quantization float16
Segmentation fault: 11
Hi @guillaumekln ,
we've discussed this in #9 a bit but I think its worth to create an extra issue to keep track of it.
In my tests on Raspberry Pi 4 and Orange Pi 5 Whisper.cpp is actually slower than the original Whisper. Here is an excerpt of results:
Test date: 2023.02.17
Engine | Model | File | Threads | Stream | Time | RTF | Quality |
---|---|---|---|---|---|---|---|
Whisper original | tiny | 1 | 4 | - | 5.9s | 0.54 | perfect |
Whisper original | tiny | 2 | 4 | - | 4.3s | 1.19 | perfect |
Whisper Cpp | ggml-tiny | 1 | 4 | - | 9.1s | 0.83 | perfect |
Whisper Cpp | ggml-tiny | 2 | 4 | - | 8.6s | 2.39 | perfect |
Whisper Cpp (BLAS) | ggml-tiny | 1 | 4 | - | 8.4s | 0.76 | perfect |
Whisper Cpp (BLAS) | ggml-tiny | 2 | 4 | - | 8.0s | 2.22 | perfect |
Whisper CT2 | whisper-tiny-ct2 | 1 | 4 | - | 3.9s | 0.36 | perfect |
Whisper CT2 | whisper-tiny-ct2 | 2 | 4 | - | 3.2s | 0.90 | perfect |
Test date: 2023.02.19
Engine | Model | File | Threads | Stream | Time | RTF | Quality |
---|---|---|---|---|---|---|---|
Whisper original | tiny | 1 | 4 | - | 3.0s | 0.27 | perfect |
Whisper original | tiny | 2 | 4 | - | 1.9s | 0.53 | perfect |
Whisper Cpp (BLAS) | ggml-tiny | 1 | 4 | - | 3.7s | 0.34 | perfect |
Whisper Cpp (BLAS) | ggml-tiny | 2 | 4 | - | 3.5s | 0.97 | perfect |
Whisper CT2 | whisper-tiny-ct2 | 1 | 4 | - | 1.3s | 0.12 | perfect |
Whisper CT2 | whisper-tiny-ct2 | 2 | 4 | - | 1.4s | 0.39 | perfect |
I've repeated the tests yesterday on Orange Pi 5 with similar results.
Currently hitting this exception in this block of code. Looks like huggingface got rid of tiny.
tokenizer_file = os.path.join(model_path, "tokenizer.json")
if os.path.isfile(tokenizer_file):
self.hf_tokenizer = tokenizers.Tokenizer.from_file(tokenizer_file)
else:
self.hf_tokenizer = tokenizers.Tokenizer.from_pretrained(
"openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
)
Just wanted to say thanks for this great port of Whisper to CTranslate2.
I've done some tests and compared it to other ports like a TFlite version and a C++ version on Raspberry Pi 4. You can find the results >here<.
In conclusion it is as fast as the Tflite version, but smaller and has the better API right now 🙂 👍.
Hi again!
I am running into two different issues consistently, mainly with av
but I am not sure if you've seen this before.
av.error.InvalidDataError: [Errno 1094995529] Invalid data found when processing input
and
invalid new backstep -1
with libav.mp3float
Is there a newer version of av
to use or should I just do the conversion myself and pass down .wav
? Or is it possible my version of ffmpeg
is older since av
is just a binding?
Is there any possible way to assign more CPU on this script. Honestly, is's super fast on my windows machine. However, I discover that it only takes maybe 60%-70% CPU, so if there's any possible way to make fully use of my CPU. Or is there any other way to improve the speed without losing the quality.
Very appreciate for your project, just comparing with the whisper.cpp. Found the timestamp are not accuracy,seems the whisper.cpp is accurately. The model is whisper-large-v2.
whisper.cpp
[00:00:00.000 --> 00:00:03.440] [MUSIC PLAYING]
[00:00:03.440 --> 00:00:03.940] Impossible.
[00:00:03.940 --> 00:00:10.440] A woman leading a man's army.
[00:00:10.440 --> 00:00:16.440] It is my duty to fight for the kingdom.
[00:00:16.440 --> 00:00:25.940] The girl who has come to save the dynasty.
[00:00:27.380 --> 00:00:30.380] [SCREAMING]
[00:00:30.380 --> 00:00:35.380] You will die pretending to be something you are not.
[00:00:35.380 --> 00:00:40.380] Get here, I stand.
[00:00:40.380 --> 00:00:48.380] I'm Hua Mulan.
[00:00:48.380 --> 00:00:51.380] I will bring honor to us all.
[00:00:51.380 --> 00:00:53.880] Disney's Mulan, rated PG-13.
[00:00:53.880 --> 00:00:55.880] Streaming September 4th.
[00:00:55.880 --> 00:00:58.380] Exclusively available to Disney+ subscribers
[00:00:58.380 --> 00:01:00.740] with Premier Access.
faster-whisper
[0.00s -> 5.00s] Impossible.
[5.00s -> 11.00s] A woman leading a man's army.
[11.00s -> 17.00s] It is my duty to fight for the kingdom.
[17.00s -> 26.00s] A girl who has come to save the dynasty.
[26.00s -> 39.00s] You will die pretending to be something you're not.
[39.00s -> 47.00s] Yet here I stand.
[47.00s -> 51.00s] I'm Hua Mulan. I will bring honor to us all.
[51.00s -> 56.00s] Disney's Mulan. Rated PG-13. Streaming September 4th.
[56.00s -> 74.00s] Exclusively available to Disney Plus subscribers with Premiere Access.
Here is my test audio link. Please try it, you will found the timestamps are incorrect.
https://stream.lestream.cn/source.mp3
How are you supposed to run this? I'm on Windows 10.
Got error when transcribing segments.
Traceback (most recent call last):
File "E:\AI\faster-whisper\trans.py", line 102, in <module>
gensrt(segments, output_file, True)
File "E:\AI\faster-whisper\trans.py", line 55, in gensrt
for i, segment in enumerate(segments):
File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 389, in generate_segments self.add_word_timestamps(
File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 547, in add_word_timestamps
alignment = self.find_alignment(tokenizer, text_tokens, mel, num_frames)
File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 617, in find_alignment
start_times = jump_times[word_boundaries[:-1]]
IndexError: index 1 is out of bounds for axis 0 with size 1
Hugging Face provides finetune code:https://huggingface.co/blog/fine-tune-whisper
These files are obtained after finetune
When I use the following command, I get the following error
ct2-transformers-converter --model ./checkpoint-90 --output_dir ./tmp
I compared the model of the hugging face and found that there are many files missing:https://huggingface.co/openai/whisper-small/tree/main
Please tell me how to get the missing files
Seek help
Hello,
Thank you for this great library!
Is there any way we can chunk the initial audio into shorter samples, let's say 50 seconds each, run inference on those, and end up with a final reconstruction.
I came across this article and I wonder if it's possible to get it working here.
Any ideas if this is possible ?
OS: Arch Linux x86_64, python-numpy-1.24.2, CUDA v12.1, all other system packages at latest versions.
Using a GPU for transcription
cuda-12.1.0-1-x86_64.pkg.tar.zst
via pacman
Transcription begins as with CUDA v11.8 installed
An error message is shown in the output:
Traceback (most recent call last):
File "faster-whisper/faster.py", line 15, in <module>
segments, info = model.transcribe(sys.argv[1], beam_size=5)
File "faster_whisper/transcribe.py", line 207, in transcribe
results = self.model.detect_language(input)
RuntimeError: Library libcublas.so.11 is not found or cannot be loaded
cd /var/cache/pacman/pkg
# downgrade CUDA and dependecies to CUDA v11.8, using older package files
sudo pacman -U cuda-11.8.0-1-x86_64.pkg.tar.zst cudnn-8.6.0.163-1-x86_64.pkg.tar.zst
Is there any Google Colab Notebook for implementation? Would be very good for people that has no access to GPUs
First of all, this is amazing work. It mops the floor with vanilla Whisper on Apple M1 chips.
However, I noticed that it only supports float32 types on Apple M1. All types seem to fallback to that due to ctranslate2 presumably. I read that Apple Neural Engine supports fp16, int16 and int8 types. Any chance we can support those via ANE? That might take the performance to an even higher level
thanks!
Hello, great work! I experimented a bit with this and came across an anomaly. While transcribing George Bush Columbia talk, the memory stays around 2.5GB, but then I encounter a sudden spike beyond 3.5GB in VRAM in case of GPU usage and RAM in case of CPU usage, when using int8, when all spoken text was already out of the model. Is it due to silence at the end or some additional operations? Would you know why this happens and how to prevent this?
!wget https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
model_path = "whisper-large-v2-ct2/"
model = WhisperModel(model_path, device="cuda", compute_type="int8",)
segments, info = model.transcribe("./George_W_Bush_Columbia_FINAL.ogg", beam_size=1, language="en", condition_on_previous_text=False)
The output:
...
[183.06s -> 185.82s] are safely home.
[185.82s -> 192.62s] May God bless the grieving families and may God continue to bless America.
['transcribe /home/ubuntu/src/faster-whisper/run.py:10', 'time_delta', 44.965]
Traceback (most recent call last):
File "/home/ubuntu/src/faster-whisper/run.py", line 16, in <module>
for segment in segments:
File "/home/ubuntu/src/faster-whisper/faster_whisper/transcribe.py", line 285, in generate_segments
result, avg_log_prob, temperature = self.generate_with_fallback(
File "/home/ubuntu/src/faster-whisper/faster_whisper/transcribe.py", line 461, in generate_with_fallback
result = self.model.generate(
RuntimeError: CUDA failed with error out of memory
I installed the master version
pip install "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
and how to update it?
thank you.
Thanks for the excellent work on this package. A quick query, I've been looking at the following suggestions as to how to get word confidence from vanilla Whisper in grey mode. Could you provide some pointers as to how to implement this in the faster whisper / ctranslate2 implementation?
As the code here deviates significantly. github.com/openai/whisper/discussions/284
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.