systran / faster-whisper Goto Github PK

View Code? Open in Web Editor NEW

9.0K 9.0K 742.0 15 MB

Faster Whisper transcription with CTranslate2

License: MIT License

Python 100.00%

deep-learning inference openai quantization speech-recognition speech-to-text transformer whisper

faster-whisper's People

Contributors

Stargazers

Watchers

Forkers

kmc07 rexiome entn-at octoenergy agoulah royce-mathew fightseed avishai111 sphinxrave yangxiongj ssteo alienware software-trizzey ron-ubiquity instawork teampcshungary c00renut zakuru ukaserge deeenancy techventurebuilder rogervaas cellinlab techthiyanes suryatmodulus sergioudi scoutink alex-songs databill86 vcip2015 nyanta012 kappaseijin yas wqangzuoshuai dimq1 spladder87 kfsky versatilus gqwert123 lwppwl keimaruo haloha123 pheonis2 hiraikyo kengnemabou ringge veryquant hoperiver cnbeining zrh2104 palladium123 junshipeng regud joepetey mayeaux tabithachi3536 ning-quantvortex rhkdgh255 zth9730 kmizu spicylentils chenchy m8e dodysw mayanksinha900 yutohub chopen82 chainyo wakamenori heiland eleander56-b tekacs cocoakeith heimoshuiyu magicleo devenbl jiltseb cspchong dd-rongfa mu4farooqi quantvortex glaceage tngamemo jo-dean kldulatre zhangtong102778 wujian010382 taofeifei0215 liguangjie0423 gaolu1008 zhangdi12202023 wanglongzheng0313 tangyan120683 ry-watanabe keyman9848 crystal23herrera gudaostudio zhangxiaoyugithub studiovc python-popular-repos

faster-whisper's Issues

Shorter segments?

Would it be possible to produce shorter segments? (some are way too long)

How to select a GPU?

Whisper is using model="cuda:0" or model="cuda:1"
With faster-whisper, I get this error:
ValueError: unsupported device cuda:1

decode_audio error

❯ python3 transcribe.py
Traceback (most recent call last):
  File "/home/user/git/faster-whisper/transcribe.py", line 13, in <module>
    segments, info = model.transcribe("audio.opus", beam_size=5)
  File "/home/user/git/faster-whisper/faster_whisper/transcribe.py", line 156, in transcribe
    audio = decode_audio(
  File "/home/user/git/faster-whisper/faster_whisper/audio.py", line 27, in decode_audio
    fifo.write(new_frame)
  File "av/audio/fifo.pyx", line 25, in av.audio.fifo.AudioFifo.write
  File "av/audio/fifo.pyx", line 90, in av.audio.fifo.AudioFifo.write
  File "av/error.pyx", line 336, in av.error.err_check
av.error.ValueError: [Errno 22] Invalid argument

target file is an opus file. Is mp3 the only supported filetype? what codecs should the file be?

edit: it seems to work on smaller/shorter files. The one with the error is a ~12 hour video I wanted to test on.

ctranslate2 version 3.9.0 has an error

Hi,

ctranslate2 version 3.9.0 has the error below, I solved the problem by fixing version 3.8.* in the file: requirements.txt

Traceback (most recent call last):
  File "/usr/local/bin/ct2-transformers-converter", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 863, in main
    converter.convert_from_args(args)
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
    return self.convert(
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/converter.py", line 89, in convert
    model_spec = self._load()
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 99, in _load
    spec = loader(model, tokenizer)
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 154, in __call__
    self.set_config(spec.config, model, tokenizer)
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 584, in set_config
    range(config.decoder_layers // 2, config.decoder_layers),
AttributeError: 'WhisperConfig' object has no attribute 'decoder_layers'

How to wait for function to finish

Hello, I have never used async in python before, just wanted is it possible to get the result of the transcription once it's done instead of in async

instead of this

    def generate_segments(self, features, language, options):
        tokenized_segments = self.generate_tokenized_segments(
            features, language, options
        )

        for start, end, tokens in tokenized_segments:
            text = self.decode_text_tokens(tokens)
            if not text.strip():
                continue

            yield Segment(
                start=start,
                end=end,
                text=text,
            )

how would I go about returning an array of all segments

    def generate_segments(self, features, language, options):
        tokenized_segments = self.generate_tokenized_segments(
            features, language, options
        )
        
        res = []
        for start, end, tokens in tokenized_segments:
            text = self.decode_text_tokens(tokens)
            if not text.strip():
                continue

            res.append(Segment(
                start=start,
                end=end,
                text=text,
            ))
       
        return res

Mac M1, compute type is float32 . how can we transform a float32 model?

Crosstalk when multithreading.

I'm trying to reuse the same CTranslate2 model instance to handle multiple different 30s chunks of audio from different streams. In order to prepare for that, I lifted the WhisperModel.model , WhisperModel.tokenizer out into a core instance so I can initialize a single WhisperModel from faster-whisper per worker but have them share the CTranslate2 Whisper Model.

I'm also handling prompt in the .transcribe() call.

However, I'm getting wildly different results where wording from one thread is crossing over to the other thread. This continued even after I implemented a lock around WhisperModel's transcribe call and subsequent segments iterator.. It's as if the CTranslate2 Whisper model has internal state and I'm failing to clear it.

Does anyone have suggestions for how this may be?

My code for reference:

def translationWorker(work_queue, language, primer: str, persist, task, core: WhisperModelCore, start_ts: int, id: str):
    # model = whisper.load_model("large")
    options = {
        "language": language,
        "task": task if task is not None else ('translate' if language != 'en' else 'transcribe')
    }
    stripped_primer = ""
    if (primer is not None and len(primer) > 3):
        stripped_primer = primer.strip() + " "
        options["initial_prompt"] = stripped_primer
    model = WhisperModel(core)

    first_send = True;

    print("OpenAI Whisper Ready")
    while True:
        audio_chunks = work_queue.get()
        if (audio_chunks == _DONE):
            print("Finishing up over here too")
            return "ok"
        first_chunk: ChunkRecord
        last_chunk: ChunkRecord
        first_chunk, last_chunk, audio = audio_chunks
        # Transcribe audio into subtitles
        unique = [];
        with core.lock:
            out, b = model.transcribe(audio, **options) 
            # out is of type Segment

            for seg in out:
                text: str
                start = seg.start;
                end = seg.end;
                text = seg.text;
                stripped_text = text.strip()
                print(
                    colored("[" + str(first_chunk.chunk_id + start) + ":" +
                            str(first_chunk.chunk_id + end) + "]", "dark_grey"),
                    " :: ", colored(text, "green"))
                persist.append({
                    "relstart": start,
                    "relend": end,
                    "start": first_chunk.chunk_id + start,
                    "end": first_chunk.chunk_id + end,
                    "text": stripped_text})
                if(len(unique) == 0 or unique[-1]["text"] != stripped_text):
                    unique.append({
                        "relstart": start,
                        "relend": end,
                        "start": first_chunk.chunk_id + start,
                        "end": first_chunk.chunk_id + end,
                        "text": stripped_text})
                else:
                    unique[-1]["end"] = first_chunk.chunk_id + end;
    
        # process UNIQUE into pieces to prompt and prefix.
        local_overlap_split_at = OVERLAP_LATENCY if OVERLAP_LATENCY > 0 and OVERLAP_LATENCY < EXPECTED_CHUNK_DURATION else EXPECTED_CHUNK_DURATION
            # context for the next run is going to be between 0 and OVERLAP_LATENCY. 
        options["initial_prompt"] = stripped_primer + " ".join(map(
            lambda x: x["text"],
            filter(lambda x: x["relend"] < local_overlap_split_at, unique)))
        # prefix for the next run will be between local_overlap_split_at til the end.
        options["prefix"] = " ".join(map(
            lambda x: x["text"],
            filter(lambda x: x["relend"] >= local_overlap_split_at, unique)))
        send_candidates = list(map(lambda u: {
            "timestamp": (start_ts + u["start"]) * 1000,
            "text": u["text"],
            "duration": int((u["end"] - u["start"])*1000.0)
        }, filter(lambda u: first_send or u['relstart'] > (EXPECTED_CHUNK_DURATION - local_overlap_split_at), unique)))
        # context is going to be the OVERLAP level.
        send_translation(id, 'tl' if options["task"] == 'translate' else 'tc', send_candidates)

        # print(out)
        time.sleep(0.0001)

RuntimeError: No SGEMM backend on CPU

I am testing the branch word-level-timestamps dc780dc

I installed your fork of CTranslate2 https://github.com/guillaumekln/CTranslate2.git on branch whisper-align.

When I try to run inference with word timestamps, I get:

Traceback (most recent call last):
[...]
    segments, info = model.transcribe(audio)
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 206, in transcribe
    results = self.model.detect_language(input)
RuntimeError: No SGEMM backend on CPU

I have seen this issue OpenNMT/CTranslate2#646 but in my case it looks different.
I did the cmake with default options (also tried explicit option -DWITH_MKL=ON) and I have MKL

-- Found MKL include directory: /opt/intel/oneapi/mkl/latest/include
-- Found MKL library directory: /usr/lib/x86_64-linux-gnu

Do you see what I can be missing?

TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'

Description:
While running the code, I encountered an error with the following message: "TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'". The error occurs when trying to transcribe an audio file with the 'word_timestamps' argument set to True.

Steps to Reproduce:

Install the required dependencies for the code snippet: faster_whisper and torch.
Download the pre-trained WhisperModel from the given model_path.
Run the code snippet with an audio file named "audio.mp3".

Expected Result:
The code should transcribe the audio file and print the start and end times of each word in the audio file.

Actual Result:
The code throws a TypeError with the message "WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'".

Error Message:
TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'

Code Snippet:

from faster_whisper import WhisperModel
import torch

model_path = "whisper-large-v2-ct2/"

# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if torch.cuda.is_available() else "int8"

model = WhisperModel(model_path, device=device, compute_type=compute_type)   

# Transcribe video and translate to English
with torch.no_grad():
    segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

beam_size default setting

Hey great job on this package. Already enjoying the improvements.

I found in your README the following:

Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, model.transcribe uses a default beam size of 1 but here we use a default beam size of 5.

A link below shows the default beam size from openAI to be 5 as well.

https://github.com/openai/whisper/blob/a6b36ede1f060860d5676a543176a6439d91eae6/whisper/transcribe.py#L272

A part of the beginning of my audio was cut

Hello, first of all thank you very much for your work on this project, it really was much faster and consumed less RAM and VRAM.
I'm testing and unfortunately a significant part of my audio has been cut.
My audio is in Portuguese and has 13 minutes, apparently the problem only occurred at the beginning of it.
Is there a way to solve this problem?
I used the following code:

from faster_whisper import WhisperModel

model_path = "whisper-medium-ct2/"

# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")

# or run on GPU with INT8
# model = WhisperModel(model_path, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_path, device="cpu", compute_type="int8")

segments, info = model.transcribe("jota.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%ds -> %ds] %s" % (segment.start, segment.end, segment.text))

The result I got running the standard version of Whisper on the same medium model:

[00:00.000 --> 00:03.500]  Afinal de contas, imprimir dinheiro gera ou não gera inflação?
--------------------- CUT ---------------------
[00:03.500 --> 00:05.500]  É isso que a gente vai responder neste vídeo.
[00:05.500 --> 00:10.500]  Música
[00:10.500 --> 00:13.500]  Muito bem, todos aqueles que estão chegando agora aqui no canal, meu nome é Fernando Urch,
[00:13.500 --> 00:16.500]  aqui a gente fala de economia, mercados e investimentos, se vocês gostarem do conteúdo,
[00:16.500 --> 00:21.000]  considerem se inscrever, ativando o sininho aqui embaixo e também compartilhando este vídeo.
[00:21.000 --> 00:24.500]  Pois o assunto de inflação é recorrente aqui no canal pela sua importância,
[00:24.500 --> 00:30.500]  o impacto que tem na nossa vida financeira, profissional, na economia, na vida em sociedade.
--------------------- CUT ---------------------
[00:30.500 --> 00:34.500]  E o debate em torno da relação entre impressão de moeda e inflação,
[00:34.500 --> 00:38.500]  ele ressurge de tempos em tempos, como foi lá no início da pandemia,
[00:38.500 --> 00:42.500]  quando muitos economistas, banqueiros centrais, políticos,
[00:42.500 --> 00:47.500]  de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[00:47.500 --> 00:50.500]  afirmavam categoricamente que imprimir dinheiro
[00:50.500 --> 00:54.500]  não geraria inflação naquele momento, naquelas circunstâncias.
[00:54.500 --> 00:57.500]  E a verdade é que não é tão simples responder essa pergunta,
[00:57.500 --> 01:02.500]  porque imprimir dinheiro não necessariamente vai gerar inflação,
[01:02.500 --> 01:05.500]  depende de outros fatores, depende das circunstâncias.
[01:05.500 --> 01:09.500]  Mas sim que imprimir dinheiro é sempre um fator inflacionário.

The result I got running this faster version of Whisper:

Detected language 'pt' with probability 0.996094
[0s -> 3s]  Afinal de contas, imprimir dinheiro gera ou não gera inflação?
[30s -> 36s]  E o debate em torno da relação entre impressão de moeda e inflação resurge de tempos em tempos,
[36s -> 42s]  como foi lá no início da pandemia, quando muitos economistas, banqueiros centrais, políticos,
[42s -> 47s]  de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[47s -> 54s]  afirmavam categoricamente que imprimir dinheiro não geraria inflação naquele momento, naquelas circunstâncias.
[54s -> 58s]  E a verdade é que não é tão simples responder essa pergunta, porque
[58s -> 65s]  imprimir dinheiro não necessariamente vai gerar inflação, depende de outros fatores, depende das circunstâncias.
[65s -> 69s]  Mas sim que imprimir dinheiro é sempre um fator inflacionário.

As you can see there was a part cut off at the beginning of my audio, in case you want to test my audio to see if you get my results: https://www.dropbox.com/s/m0q30hmzbx6mvt2/jota.mp3?dl=1
And another question, is it possible to get the return as an srt or vtt file, like the standard Whisper?
Thank you very much.

Zaz frances wrong

Hello!
Everything is fine, superb! I really like your code at my oldy Intel Icore-3 first generation (no cpp version working on so old cpu)
but
https://www.youtube.com/results?search_query=zaz+belle++live+
strange problem with zaz
whisper - ok, no problem, file, i have downloaded it again, but absolutely same.
but with faster-whisper something wrong.
not exactly with wrong recognition, it's something with time shift etc
also its not problem with frances - other zaz songs is fine.
thank u very mouch

Feature : Add support for VAD filter

Thank you for releasing the code

since this implementation require less memory than other implementation
adding VAD (Voice activity detection) should be more suitable
Voice activity detection make whisper more accurate especially for non english

(openai/whisper#29 (comment))

will this possible to add ?
thank you

Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

I recently tried this wonderful tool on CPU of my Windows 10 amchine and got quite good results. But when I tried on GPU via model = WhisperModel(model_path, device="cuda", compute_type="float16") I received following error Requested float16 compute type, but the target device or backend do not support efficient float16 computation.
I have GTX1050 Ti and main driver is 31.0.15.1694. How can I fix this error and run on my GPU card?

inference slowed down with `word_timestamps=True`

Hi,
I have a simple script to run inference on a wav file. I noticed when word_timestamps=True, the processing time is much longer.
I'm using the same wav file in each of these cases, you can see duration below for each:

model = WhisperModel(model_path, "cpu", compute_type="int8", cpu_threads=4)
segments, info = model.transcribe(input, word_timestamps=True, beam_size=1)

OMP_NUM_THREADS=4 python inference.py
duration: 310.73524594306946s

model = WhisperModel(model_path, "cpu", compute_type="int8", cpu_threads=4)
segments, info = model.transcribe(input, word_timestamps=False, beam_size=1)

OMP_NUM_THREADS=4 python inference.py
duration: 225.86893439292908s

Is this expected? Or is there some optimization that could be done for word-level timestamps.

transcription speed blowouts

Thanks again for this project - for context I'm testing it transcribing a live public radio stream, appreciate the rapid speed and low memory as it's most useful providing near-live transcription.
The radio stream is maybe 60% voice on studio mic, 30% phone voice, 5% voice talking over music, and 5% music.
I have a simple python script running 30s chunks from the live radio stream into faster-whisper continuously.
Using base model, on a cheap VPS with just 2GB RAM - I'm sure I could get better results with a higher spec machine but it's a proof of concept - would be useful to run across a large number of different streams here.

Most 30s chunks take between 6-8s to transcribe, which is perfect, but roughly 1 in 10 can blow out between 20-50s.

I haven't quite figured out what causes it, I wonder if it's when the 30s chunk has a mix of music and talk, or a mix of different audio sources? Was wondering if in your experience you could shed light on the reason? Would a larger model stop the blowouts?

Translation feature in Faster-Whisper not translating language fully for certain audio/video files

Description

I noticed that the translation feature in Faster-Whisper does not seem to translate the language fully for certain audio/video files. It appears to (at random) only translate parts of the language into English, whereas Whisper is mostly capable of translating the entire language. Is there a reason for this difference in translation capabilities?

Reproduction Steps

Use the following code to transcribe a video and translate it to English using Faster-Whisper:

from faster_whisper import WhisperModel
import torch

model_path = "whisper-large-v2"

# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if torch.cuda.is_available() else "int8"

# Run on GPU with INT8 or FP16
model = WhisperModel(model_path, device=device, compute_type=compute_type)   

# Transcribe video and translate to English
with torch.no_grad():
    segments, info = model.transcribe("test.mp4", beam_size=5, task="translate")

# Print transcription and translation segments
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
    print("\n[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Replace "test.mp4" with the video file provided by me that was used to test the translation feature.

Expected Behavior

The entire language in the video file should be translated into English.

Actual Behavior

The translation feature only translates/transcribes the English portion of the video.

Additional Information

I have attached the video file used to test the translation feature.

arabic.mp4

Support float32?

16:52:07 kris ~/faster-whisper $ python test_transcription.py Traceback (most recent call last): File "/home/kris/faster-whisper/test_transcription.py", line 8, in <module> model = WhisperModel(model_path, device="cpu", compute_type="float16") # compute_type="float16" File "/home/kris/faster-whisper/faster_whisper/transcribe.py", line 71, in __init__ self.model = ctranslate2.models.Whisper( ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

I get somewhat of the same error with whisper however it automatically changes to float32. Is there a possible fix for this?

initial_prompt?

In the transcribe.py code I don't see the initial_prompt parameter.
Is it somewhere?
Can it be added?

Repetited text within and between segments

Sometimes, a sentence repeat itself multiple times at the end of a segment, and may continue to repeat in subsequent segments.

This was a known issue of openai/whisper (upstream issue openai/whisper#977 and openai/whisper#1059), and may be fixed by openai/whisper@38f2f4d

When I use "faster-whisper", I encountered the same sentence repetition. I found it's also reported on #35 (comment)

Could you please check if this commit openai/whisper@38f2f4d can/should be ported here?

How to Host this on server (base model ) what should I do . I am a begginer to hosting

If I want to host this to a server which type of server I should use , for CPU and GPU .

I researched a little for CPU i came across linode .

I have no experience hosting thinks like this , appreciate if anyone can answer .

Different output from mono audio and stereo audio

Previously, I would like to say very grateful for this project. I have a fairly unique problem, namely the transcript results obtained from mono audio and stereo audio are quite different. The transcript results from stereo audio are better than mono audio, even though we know that before the transcript process, the audio will be converted to mono.

Parameters used:
compute_type : int8
model : large
device: gpu
beam_size: 5
language: 'id'

Link Audio:
In Here

If anybody shared the translated models?

Will this project support ".pt" file instead of bin model hold in hugging face?Or this project can storage translated models in some public storage.

Because the bin models in hugging face are very large and my network is pretty slow, it will be a large challenge for me to download this size of model.

Different output for `medium` between openai/whisper and this

Hi,

First, let me start by saying great job. This is awesome! It's crazy how much faster this is than openai/whisper. I really appreciate the effort here.

I am running into a few issues and just want to better understand.

I am on the latest ctranslate2==3.6.0 and am using the medium model both with a beam_size=5 set. On faster-whisper, I get:

 ______ available?î Yeah. Speaker? ______, this is

And on openai/whisper I get:

I am using [NAME] here.

 __________ Hello. __________ Hello, is [NAME] available? __________ Yes, speaking. __________ Hi,

I have other examples of this as well but need to redact the text before I can post them.

I've also noticed non-English characters in the translation as well for faster-whisper for English-only audio. And even on 3.6.0, I still appear to get the beginning of my audio chopped off as well (but not all the time). It's seemingly random.

Is it normal to have some differences? Is there come config difference I am missing between the two?

While the speed increases are great, the inconsistencies are enough that we can't really use this over openai/whisper for our tasks. Anything I can do to help debug, let me know.

As a side note, I've been testing medium.en and the random characters seem to not happen and accuracy appears to be better for faster-whisper compared to the multi language model for faster-whisper where as medium for openai/whsiper appears to work fine.

Thanks in advanced!

Is medium model use the same tiny tokernizer file?

https://github.com/guillaumekln/faster-whisper/blob/4a18adc382ae5c23821e761aa1c48932a7db0ddd/faster_whisper/transcribe.py#L90

TypeError when providing language to model

Hey, when I am providing language to transcribe method like segments, info = model.transcribe(file_name, language="english", beam_size=5)

i am getting the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 25
     
     20  segments, info = model.transcribe(file_name, language="english", beam_size=5)
     21 
     23 print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
---> 25 for segment in segments:
     26     print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
   (...)

File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:187, in WhisperModel.generate_segments(self, features, language, options)
    182 def generate_segments(self, features, language, options):
    183     tokenized_segments = self.generate_tokenized_segments(
    184         features, language, options
    185     )
--> 187     for start, end, tokens in tokenized_segments:
    188         text = self.decode_text_tokens(tokens)
    189         if not text.strip():

File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:224, in WhisperModel.generate_tokenized_segments(self, features, language, options)
    216 previous_tokens = all_tokens[prompt_reset_since:]
    217 prompt = self.get_prompt(
    218     language,
    219     previous_tokens,
    220     task=options.task,
    221     without_timestamps=options.without_timestamps,
    222 )
--> 224 result, temperature = self.generate_with_fallback(segment, prompt, options)
    226 if (
    227     result.no_speech_prob > options.no_speech_threshold
    228     and result.scores[0] < options.log_prob_threshold
    229 ):
    230     offset += segment.shape[-1]

File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:315, in WhisperModel.generate_with_fallback(self, segment, prompt, options)
    309     kwargs = {
    310         "beam_size": options.beam_size,
    311         "patience": options.patience,
    312     }
    314 final_temperature = temperature
--> 315 result = self.model.generate(
    316     features,
    317     [prompt],
    318     max_length=max_length,
    319     return_scores=True,
    320     return_no_speech_prob=True,
    321     **kwargs,
    322 )[0]
    324 tokens = result.sequences_ids[0]
    325 text = self.decode_text_tokens(tokens)

TypeError: generate(): incompatible function arguments. The following argument types are supported:
    1. (self: ctranslate2._ext.Whisper, features: ctranslate2._ext.StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_no_speech_prob: bool = False, sampling_topk: int = 1, sampling_temperature: float = 1) -> Union[List[ctranslate2._ext.WhisperGenerationResult], List[ctranslate2._ext.WhisperGenerationResultAsync]]

Invoked with: <ctranslate2._ext.Whisper object at 0x7f19f702bf30>, <ctranslate2._ext.StorageView object at 0x7f19fef2df70>, [[50258, None, 50359]]; kwargs: max_length=448, return_scores=True, return_no_speech_prob=True, beam_size=5, patience=1

word-level timestamps

Hi, I really appreciate you sharing this implementation.
I found it to be very fast with accurate results.
I do not see word-level timestamps in the result. Are word level timestamps possible?

Add support for passing buffers to transcribe

Being able to have an option to pass the audio buffer directly to faster-whisper instead of creating a file would be really great.

Support for variable size chunks

I can see that now only 30 second chunks are supported by the CTranslate2 model. Shorter chunks are padded to 30s such that model.generate can accept exclusively [batch_size, 80, 3000] inputs.

In some real-time applications like shorter chunks may be used and the original Whisper model supports shorter chunks despite being trained on 30s. Would it be possible to allow shorter chunks for faster inference in contrast to when always padding to 30s?

word_timestamps on Faster Whisper

Hello, I would like to know if it's possible to add the "--word_timestamps" option to Faster Whisper now, since this new option has been added to the official Whisper repository. It would be very helpful if this option could be included in Faster Whisper. Thank you in advance.

Thank you for creating this ,This is very fast

I can't thank you enough for creating this faster-whisper . this is faster than whisper cpp . this is awesome . thank you so much

Segmentation fault on Mac M1 during conversion

I have been unable to convert the model
Regardless of whether I have tried with or without quantization, or different models - unfortunately, I have had no success.

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-small --output_dir output
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny --output_dir output
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny.en --output_dir output Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.92k/1.92k [00:00<00:00, 486kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 151M/151M [00:04<00:00, 37.2MB/s] Segmentation fault: 11 192:faster-whisper$ /usr/local/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
^^ this one just freezes

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny --output_dir output --quantization float16
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-small --output_dir output --quantization float16
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny.en --output_dir output --quantization float16
Segmentation fault: 11

Any ideas to determine if the pronunciation is clear or fluent ?

Slower than original Whisper on ARM 64bit (Raspberry Pi 4/Orange Pi 5)

Hi @guillaumekln ,

we've discussed this in #9 a bit but I think its worth to create an extra issue to keep track of it.
In my tests on Raspberry Pi 4 and Orange Pi 5 Whisper.cpp is actually slower than the original Whisper. Here is an excerpt of results:

Raspberry Pi 400

Test date: 2023.02.17

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Whisper original	tiny	1	4	-	5.9s	0.54	perfect
Whisper original	tiny	2	4	-	4.3s	1.19	perfect
Whisper Cpp	ggml-tiny	1	4	-	9.1s	0.83	perfect
Whisper Cpp	ggml-tiny	2	4	-	8.6s	2.39	perfect
Whisper Cpp (BLAS)	ggml-tiny	1	4	-	8.4s	0.76	perfect
Whisper Cpp (BLAS)	ggml-tiny	2	4	-	8.0s	2.22	perfect
Whisper CT2	whisper-tiny-ct2	1	4	-	3.9s	0.36	perfect
Whisper CT2	whisper-tiny-ct2	2	4	-	3.2s	0.90	perfect

Orange Pi 5

Test date: 2023.02.19

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Whisper original	tiny	1	4	-	3.0s	0.27	perfect
Whisper original	tiny	2	4	-	1.9s	0.53	perfect
Whisper Cpp (BLAS)	ggml-tiny	1	4	-	3.7s	0.34	perfect
Whisper Cpp (BLAS)	ggml-tiny	2	4	-	3.5s	0.97	perfect
Whisper CT2	whisper-tiny-ct2	1	4	-	1.3s	0.12	perfect
Whisper CT2	whisper-tiny-ct2	2	4	-	1.4s	0.39	perfect

I've repeated the tests yesterday on Orange Pi 5 with similar results.

Exception: Model "openai/whisper-tiny.en" on the Hub doesn't have a tokenizer

Currently hitting this exception in this block of code. Looks like huggingface got rid of tiny.

tokenizer_file = os.path.join(model_path, "tokenizer.json")
if os.path.isfile(tokenizer_file):
    self.hf_tokenizer = tokenizers.Tokenizer.from_file(tokenizer_file)
else:
    self.hf_tokenizer = tokenizers.Tokenizer.from_pretrained(
        "openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
    )

Speed tests and comparison to other Whisper versions

Just wanted to say thanks for this great port of Whisper to CTranslate2.

I've done some tests and compared it to other ports like a TFlite version and a C++ version on Raspberry Pi 4. You can find the results >here<.

In conclusion it is as fast as the Tflite version, but smaller and has the better API right now 🙂 👍.

Two different errors with converting MP3s to WAV

Hi again!

I am running into two different issues consistently, mainly with av but I am not sure if you've seen this before.

av.error.InvalidDataError: [Errno 1094995529] Invalid data found when processing input

and

invalid new backstep -1

with libav.mp3float

Is there a newer version of av to use or should I just do the conversion myself and pass down .wav? Or is it possible my version of ffmpeg is older since av is just a binding?

How to assign more CPU on this python script

Is there any possible way to assign more CPU on this script. Honestly, is's super fast on my windows machine. However, I discover that it only takes maybe 60%-70% CPU, so if there's any possible way to make fully use of my CPU. Or is there any other way to improve the speed without losing the quality.

the timestamps of the texts are not accuracy.

Very appreciate for your project, just comparing with the whisper.cpp. Found the timestamp are not accuracy,seems the whisper.cpp is accurately. The model is whisper-large-v2.

whisper.cpp
[00:00:00.000 --> 00:00:03.440] [MUSIC PLAYING]
[00:00:03.440 --> 00:00:03.940] Impossible.
[00:00:03.940 --> 00:00:10.440] A woman leading a man's army.
[00:00:10.440 --> 00:00:16.440] It is my duty to fight for the kingdom.
[00:00:16.440 --> 00:00:25.940] The girl who has come to save the dynasty.
[00:00:27.380 --> 00:00:30.380] [SCREAMING]
[00:00:30.380 --> 00:00:35.380] You will die pretending to be something you are not.
[00:00:35.380 --> 00:00:40.380] Get here, I stand.
[00:00:40.380 --> 00:00:48.380] I'm Hua Mulan.
[00:00:48.380 --> 00:00:51.380] I will bring honor to us all.
[00:00:51.380 --> 00:00:53.880] Disney's Mulan, rated PG-13.
[00:00:53.880 --> 00:00:55.880] Streaming September 4th.
[00:00:55.880 --> 00:00:58.380] Exclusively available to Disney+ subscribers
[00:00:58.380 --> 00:01:00.740] with Premier Access.

faster-whisper
[0.00s -> 5.00s] Impossible.
[5.00s -> 11.00s] A woman leading a man's army.
[11.00s -> 17.00s] It is my duty to fight for the kingdom.
[17.00s -> 26.00s] A girl who has come to save the dynasty.
[26.00s -> 39.00s] You will die pretending to be something you're not.
[39.00s -> 47.00s] Yet here I stand.
[47.00s -> 51.00s] I'm Hua Mulan. I will bring honor to us all.
[51.00s -> 56.00s] Disney's Mulan. Rated PG-13. Streaming September 4th.
[56.00s -> 74.00s] Exclusively available to Disney Plus subscribers with Premiere Access.

Here is my test audio link. Please try it, you will found the timestamps are incorrect.
https://stream.lestream.cn/source.mp3

How do I run this?

How are you supposed to run this? I'm on Windows 10.

IndexError: index 1 is out of bounds for axis 0 with size 1

Got error when transcribing segments.

Traceback (most recent call last):
  File "E:\AI\faster-whisper\trans.py", line 102, in <module>
    gensrt(segments, output_file, True)
  File "E:\AI\faster-whisper\trans.py", line 55, in gensrt
    for i, segment in enumerate(segments):
  File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 389, in generate_segments    self.add_word_timestamps(
  File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 547, in add_word_timestamps
    alignment = self.find_alignment(tokenizer, text_tokens, mel, num_frames)
  File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 617, in find_alignment
    start_times = jump_times[word_boundaries[:-1]]
IndexError: index 1 is out of bounds for axis 0 with size 1

What files need to be prepared to convert my own model

Hugging Face provides finetune code：https://huggingface.co/blog/fine-tune-whisper
These files are obtained after finetune

When I use the following command, I get the following error

ct2-transformers-converter --model ./checkpoint-90 --output_dir ./tmp

I compared the model of the hugging face and found that there are many files missing：https://huggingface.co/openai/whisper-small/tree/main

Please tell me how to get the missing files

Seek help

Inference on long files

Hello,

Thank you for this great library!
Is there any way we can chunk the initial audio into shorter samples, let's say 50 seconds each, run inference on those, and end up with a final reconstruction.
I came across this article and I wonder if it's possible to get it working here.
Any ideas if this is possible ?

Incompatible with CUDA v12.1

OS: Arch Linux x86_64, python-numpy-1.24.2, CUDA v12.1, all other system packages at latest versions.
Using a GPU for transcription

Steps to reproduce

Install cuda-12.1.0-1-x86_64.pkg.tar.zst via pacman
Attempt to run transcription using the sample Python snippet provided

Expected behaviour

Transcription begins as with CUDA v11.8 installed

Actual behaviour

An error message is shown in the output:

Traceback (most recent call last):
  File "faster-whisper/faster.py", line 15, in <module>
    segments, info = model.transcribe(sys.argv[1], beam_size=5)
  File "faster_whisper/transcribe.py", line 207, in transcribe
    results = self.model.detect_language(input)
RuntimeError: Library libcublas.so.11 is not found or cannot be loaded

Workaround

cd /var/cache/pacman/pkg
# downgrade CUDA and dependecies to CUDA v11.8, using older package files
sudo pacman -U cuda-11.8.0-1-x86_64.pkg.tar.zst cudnn-8.6.0.163-1-x86_64.pkg.tar.zst

Any Colab notebook to test?

Is there any Google Colab Notebook for implementation? Would be very good for people that has no access to GPUs

Apple Neural Engine on M1 Macs?

First of all, this is amazing work. It mops the floor with vanilla Whisper on Apple M1 chips.

However, I noticed that it only supports float32 types on Apple M1. All types seem to fallback to that due to ctranslate2 presumably. I read that Apple Neural Engine supports fp16, int16 and int8 types. Any chance we can support those via ANE? That might take the performance to an even higher level

thanks!

Memory spike at the end of transcription

Hello, great work! I experimented a bit with this and came across an anomaly. While transcribing George Bush Columbia talk, the memory stays around 2.5GB, but then I encounter a sudden spike beyond 3.5GB in VRAM in case of GPU usage and RAM in case of CPU usage, when using int8, when all spoken text was already out of the model. Is it due to silence at the end or some additional operations? Would you know why this happens and how to prevent this?

!wget https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
model_path = "whisper-large-v2-ct2/"
model = WhisperModel(model_path, device="cuda", compute_type="int8",)
segments, info = model.transcribe("./George_W_Bush_Columbia_FINAL.ogg", beam_size=1, language="en", condition_on_previous_text=False)

The output:

...
[183.06s -> 185.82s]  are safely home.
[185.82s -> 192.62s]  May God bless the grieving families and may God continue to bless America.
['transcribe /home/ubuntu/src/faster-whisper/run.py:10', 'time_delta', 44.965]
Traceback (most recent call last):
  File "/home/ubuntu/src/faster-whisper/run.py", line 16, in <module>
    for segment in segments:
  File "/home/ubuntu/src/faster-whisper/faster_whisper/transcribe.py", line 285, in generate_segments
    result, avg_log_prob, temperature = self.generate_with_fallback(
  File "/home/ubuntu/src/faster-whisper/faster_whisper/transcribe.py", line 461, in generate_with_fallback
    result = self.model.generate(
RuntimeError: CUDA failed with error out of memory

how to update faster-whisper?

I installed the master version

Install the master branch:

pip install "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"

and how to update it?
thank you.

Word confidence scores

Thanks for the excellent work on this package. A quick query, I've been looking at the following suggestions as to how to get word confidence from vanilla Whisper in grey mode. Could you provide some pointers as to how to implement this in the faster whisper / ctranslate2 implementation?
As the code here deviates significantly. github.com/openai/whisper/discussions/284