kadirnar / whisper-plus Goto Github PK

View Code? Open in Web Editor NEW

1.5K 18.0 120.0 998 KB

WhisperPlus: Faster, Smarter, and More Capable 🚀

License: Apache License 2.0

Python 96.92% Jupyter Notebook 3.07% Shell 0.02%

whisper-plus's Introduction

WhisperPlus: Faster, Smarter, and More Capable 🚀

🛠️ Installation

pip install whisperplus git+https://github.com/huggingface/transformers
pip install flash-attn --no-build-isolation

🤗 Model Hub

You can find the models on the HuggingFace Model Hub

🎙️ Usage

To use the whisperplus library, follow the steps below for different tasks:

🎵 Youtube URL to Audio

from whisperplus import SpeechToTextPipeline, download_youtube_to_mp3
from transformers import BitsAndBytesConfig, HqqConfig
import torch

url = "https://www.youtube.com/watch?v=di3rHkEZuUw"
audio_path = download_youtube_to_mp3(url, output_dir="downloads", filename="test")

hqq_config = HqqConfig(
    nbits=4,
    group_size=64,
    quant_zero=False,
    quant_scale=False,
    axis=0,
    offload_meta=False,
)  # axis=0 is used by default

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

pipeline = SpeechToTextPipeline(
    model_id="distil-whisper/distil-large-v3",
    quant_config=hqq_config,
    flash_attention_2=True,
)

transcript = pipeline(
    audio_path=audio_path,
    chunk_length_s=30,
    stride_length_s=5,
    max_new_tokens=128,
    batch_size=100,
    language="english",
    return_timestamps=False,
)

print(transcript)

🍎 Apple MLX

from whisperplus.pipelines import mlx_whisper
from whisperplus import download_youtube_to_mp3

url = "https://www.youtube.com/watch?v=1__CAdTJ5JU"
audio_path = download_youtube_to_mp3(url)

text = mlx_whisper.transcribe(
    audio_path, path_or_hf_repo="mlx-community/whisper-large-v3-mlx"
)["text"]
print(text)

🍏 Lightning Mlx Whisper

from whisperplus.pipelines.lightning_whisper_mlx import LightningWhisperMLX
from whisperplus import download_youtube_to_mp3

url = "https://www.youtube.com/watch?v=1__CAdTJ5JU"
audio_path = download_youtube_to_mp3(url)

whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12, quant=None)
output = whisper.transcribe(audio_path=audio_path)["text"]

📰 Summarization

from whisperplus.pipelines.summarization import TextSummarizationPipeline

summarizer = TextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary = summarizer.summarize(transcript)
print(summary[0]["summary_text"])

📰 Long Text Support Summarization

from whisperplus.pipelines.long_text_summarization import LongTextSummarizationPipeline

summarizer = LongTextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary_text = summarizer.summarize(transcript)
print(summary_text)

💬 Speaker Diarization

You must confirm the licensing permissions of these two models.

pip install -r speaker_diarization.txt
pip install -U "huggingface_hub[cli]"
huggingface-cli login

from whisperplus.pipelines.whisper_diarize import ASRDiarizationPipeline
from whisperplus import download_youtube_to_mp3, format_speech_to_dialogue

audio_path = download_youtube_to_mp3("https://www.youtube.com/watch?v=mRB14sFHw2E")

device = "cuda"  # cpu or mps
pipeline = ASRDiarizationPipeline.from_pretrained(
    asr_model="openai/whisper-large-v3",
    diarizer_model="pyannote/speaker-diarization-3.1",
    use_auth_token=False,
    chunk_length_s=30,
    device=device,
)

output_text = pipeline(audio_path, num_speakers=2, min_speaker=1, max_speaker=2)
dialogue = format_speech_to_dialogue(output_text)
print(dialogue)

⭐ RAG - Chat with Video(LanceDB)

pip install sentence-transformers ctransformers langchain

from whisperplus.pipelines.chatbot import ChatWithVideo

chat = ChatWithVideo(
    input_file="trascript.txt",
    llm_model_name="TheBloke/Mistral-7B-v0.1-GGUF",
    llm_model_file="mistral-7b-v0.1.Q4_K_M.gguf",
    llm_model_type="mistral",
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
)

query = "what is this video about ?"
response = chat.run_query(query)
print(response)

🌠 RAG - Chat with Video(AutoLLM)

pip install autollm>=0.1.9

from whisperplus.pipelines.autollm_chatbot import AutoLLMChatWithVideo

# service_context_params
system_prompt = """
You are an friendly ai assistant that help users find the most relevant and accurate answers
to their questions based on the documents you have access to.
When answering the questions, mostly rely on the info in documents.
"""
query_wrapper_prompt = """
The document information is below.
---------------------
{context_str}
---------------------
Using the document information and mostly relying on it,
answer the query.
Query: {query_str}
Answer:
"""

chat = AutoLLMChatWithVideo(
    input_file="input_dir",  # path of mp3 file
    openai_key="YOUR_OPENAI_KEY",  # optional
    huggingface_key="YOUR_HUGGINGFACE_KEY",  # optional
    llm_model="gpt-3.5-turbo",
    llm_max_tokens="256",
    llm_temperature="0.1",
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    embed_model="huggingface/BAAI/bge-large-zh",  # "text-embedding-ada-002"
)

query = "what is this video about ?"
response = chat.run_query(query)
print(response)

🎙️ Text to Speech

from whisperplus.pipelines.text2speech import TextToSpeechPipeline

tts = TextToSpeechPipeline(model_id="suno/bark")
audio = tts(text="Hello World", voice_preset="v2/en_speaker_6")

🎥 AutoCaption

pip install moviepy
apt install imagemagick libmagick++-dev
cat /etc/ImageMagick-6/policy.xml | sed 's/none/read,write/g'> /etc/ImageMagick-6/policy.xml

from whisperplus.pipelines.whisper_autocaption import WhisperAutoCaptionPipeline
from whisperplus import download_youtube_to_mp4

video_path = download_youtube_to_mp4(
    "https://www.youtube.com/watch?v=di3rHkEZuUw",
    output_dir="downloads",
    filename="test",
)  # Optional

caption = WhisperAutoCaptionPipeline(model_id="openai/whisper-large-v3")
caption(video_path=video_path, output_path="output.mp4", language="english")

😍 Contributing

pip install pre-commit
pre-commit install
pre-commit run --all-files

📜 License

This project is licensed under the terms of the Apache License 2.0.

🤗 Citation

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

whisper-plus's People

Contributors

Stargazers

Watchers

Forkers

brkgyln guldenizbektas techthiyanes statkumar matrixer2306 jeffara paperwave tomchapin omerkulahci hanifeoglu lewis-huang aoocar zb0419 wzxwj oushinco myself659 goldluo126 memory360 18tiercoder southcity ttym cellinlab sea-007 chuwoo powerbabe xgithubzero huangsip zengbo0710 antiboson daxianyu purezhi bozz2022 aichang2017 jinjin20220405 demangrove danielas2022 gmyanfeng edwincw19 redzgn majiajue channono lyhiving rabbitguy bob5588 daxisheng fjh8680 chiefstone weiyinerzui codeduckky 8-diagrams alterem jasonwong08 ethan12345678912 lvkun0223 kansonkong immissile zxwzxw gxman1113 lancechung8888 cnin0770 xmas25 boweipacer audiebant magiccampus bluebad itsbean tymon42 linuer tjxj aaronluoxiao tiancz woyin akashad98 selvadevan smartjoy-tech billbill-w notmariekondo jtsang4 yangshenming mszczodrak seafitliu sikamedia glaceage whaozl tradingindian stevie86 rajeevaroraau alialemimatinpour 2braincells2go edustack suryatmodulus cryptoleek-eth vvandriichuk xymfei hturany bestlee666 osmanbulutedu tonywang-sh kalideir lc-1010

whisper-plus's Issues

Colab notebook ImportError: cannot import name 'download_and_convert_to_mp3' from 'whisperplus'

I can not run Colab notebook to test it. I received following error on from whisperplus import SpeechToTextPipeline, download_and_convert_to_mp3 section

ImportError: cannot import name 'download_and_convert_to_mp3' from 'whisperplus' (/usr/local/lib/python3.10/dist-packages/whisperplus/__init__.py)

Add Whispercpp Library

Repo: https://github.com/aarnphm/whispercpp

is there any way to form srt file for subtitle?

[Feature Request] Add Hugging Face Model Deployment with Gradio 🚀

Description:

Add code to deploy the project on Hugging Face using Gradio after enabling Whisper Model text generation 🚀

Add Hqq + Compile + Flash Attention support

I added hqq + compile support to WhisperPlus library but it doesn't work. Torch.compile doesn't support Whisper model. Transformers library will work when torch.compile support is added.

huggingface/transformers#30707

RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

Hello.
I'm trying to run Whisper Plus but keep getting this error message:

RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

Using the example code (Youtube URL to Audio):

from whisperplus import SpeechToTextPipeline, download_and_convert_to_mp3

url = "https://www.youtube.com/watch?v=di3rHkEZuUw"

audio_path = download_and_convert_to_mp3(url)
pipeline = SpeechToTextPipeline(model_id="distil-whisper/distil-large-v2")
transcript = pipeline(audio_path, "distil-whisper/distil-large-v2", "english")

print(transcript)

I created a new environment with Miniconda (with Python 3.11 on Windows 10) called "whisper", then -> conda activate whisper -> pip install whisperplus -> y -> (Run the code above at VS Code with the correct environment selected) -> Error "ffmpeg" not located -> pip install ffmpeg -> (Run) -> Error "ffmpeg" not located -> pip uninstall ffmpeg -> conda install conda-forge::ffmpeg -> (Run) -> RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

Checked the "requirements.txt" and everything is installed accordingly.

Trying to run the AutoCaption code gets a "ImportError: cannot import name 'WhisperAutoCaptionPipeline' from 'whisperplus' (C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\whisperplus_init_.py)".

I don't know what is going on!
Any insight would be greatly appreciated!
Thanks!!!

Full error:

C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\utils\generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\utils\generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\utils\generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\pyannote\audio\core\io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher 
enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\torch_audiomentations\utils\io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
2024-02-05 00:55:52,780 - INFO - Downloading started... output\test.mp3
2024-02-05 00:55:55,000 - INFO - Download and conversion successful. File saved at: output\test.mp3
2024-02-05 00:55:55,000 - INFO - Loading model...
2024-02-05 00:55:55,959 - INFO - Model loaded successfully.
2024-02-05 00:55:55,959 - INFO - Using device: cpu
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-05 00:55:56,886 - INFO - Transcribing audio...
Traceback (most recent call last):
  File "c:\Users\Daniel\Videos\Work Work\import os.py", line 7, in <module>
    transcript = pipeline(audio_path, "distil-whisper/distil-large-v2", "english")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\whisperplus\pipelines\whisper.py", line 74, in __call__
    result = pipe(audio_path)["text"]
             ^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 357, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\pipelines\base.py", line 1132, in __call__
    return next(
           ^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\pipelines\pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\pipelines\base.py", line 1046, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 555, in _forward
    encoder_outputs=encoder(inputs, attention_mask=attention_mask),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\transformers\models\whisper\modeling_whisper.py", line 1119, in forward
    inputs_embeds = nn.functional.gelu(self.conv1(input_features))
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\torch\nn\modules\conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Daniel\miniconda3\envs\whisper\Lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

pyannote -> pyannote.audio

pyannote in req.txt must be pyannote.audio I believe. pip fails with the current version.

[Feature Request] Integrate RAG-based Language Model for Interactive Q&A 🤖

Integrate the RAG-based Language Model for interactive question and answer functionality. Users can utilize any text input to query their preferred language model. The model will respond with answers derived from the specified document, enhancing user engagement and interactivity. 🤖

Github: Autollm: Ship RAG based LLM ⚙️

[Feature Request] Real-Time Speech-to-Text with Whisper Model 🎙️

Implement real-time functionality for the Whisper model, enabling it to transcribe speech into text as the user speaks🎤

Add MLX library support

discussion: how to clear the gpu memory usage done

In this snippet where we initialise a whisperplus pipeline, how to clear out this memory usage once its used.

pipeline = ASRDiarizationPipeline.from_pretrained(
        asr_model="openai/whisper-large-v3",
        diarizer_model="pyannote/speaker-diarization",
        chunk_length_s=30,
        device=DEVICE,
        use_auth_token=<auth_token>
    )

Would appreciate some help on this. I tried del pipeline but it doesn't seem to work

Add metric calculation feature to Whisperplus library

Metric: https://huggingface.co/learn/audio-course/chapter5/evaluation

Error: ImportError: 'speechbrain' must be installed to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings.

I am trying to use Diarization example from Your readme. Ive hardcoded path to a file (tested .mp3 and .wav). I am always getting Error: ImportError: 'speechbrain' must be installed to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings.
This is the code I use:
`
from whisperplus import (
ASRDiarizationPipeline,
download_and_convert_to_mp3,
format_speech_to_dialogue,
)

#audio_path = download_and_convert_to_mp3("https://www.youtube.com/watch?v=mRB14sFHw2E")
audio_path = "/home/piotr/WhisperPlus/Tutorial/zycie.mp3"

device = "cuda" # cpu or mps
pipeline = ASRDiarizationPipeline.from_pretrained(
asr_model="openai/whisper-large-v3",
diarizer_model="pyannote/speaker-diarization",
use_auth_token=False,
#use_auth_token="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",
chunk_length_s=30,
device=device,
)

output_text = pipeline(audio_path, num_speakers=2, min_speaker=1, max_speaker=2)
dialogue = format_speech_to_dialogue(output_text)
print(dialogue)

`
This is traceback:

`
(WhisperPlus38) (base) piotr@Legion7:~/WhisperPlus/Tutorial$ /home/piotr/anaconda3/envs/WhisperPlus38/bin/python /home/piotr/WhisperPlus/Tutorial/diarizoation.py
2024-05-07 13:00:27,424 - WARNING - /home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-07 13:00:32,154 - INFO - Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.4. To apply the upgrade to your files permanently, run python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
Traceback (most recent call last):
File "/home/piotr/WhisperPlus/Tutorial/diarizoation.py", line 11, in
pipeline = ASRDiarizationPipeline.from_pretrained(
File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/whisperplus/pipelines/whisper_diarize.py", line 43, in from_pretrained
diarization_pipeline = Pipeline.from_pretrained(diarizer_model, use_auth_token=use_auth_token)
File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/pyannote/audio/core/pipeline.py", line 136, in from_pretrained
pipeline = Klass(**params)
File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 167, in init
self._embedding = PretrainedSpeakerEmbedding(
File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/pyannote/audio/pipelines/speaker_verification.py", line 754, in PretrainedSpeakerEmbedding
return SpeechBrainPretrainedSpeakerEmbedding(
File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/pyannote/audio/pipelines/speaker_verification.py", line 245, in init
raise ImportError(
ImportError: 'speechbrain' must be installed to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings. Visit https://speechbrain.github.io for installation instructions.

[Feature Request] Add Text Summarization with Open Source LLM Models 📚

Description:

Implement code to perform text summarization using open source language models after enabling text generation from audio data 📚

[Feature Request] Text-to-Speech with Bark Model 🔊

Implement code for text-to-speech using the Bark model. Enable the generation of high-quality audio from input text, leveraging the capabilities of the Bark model for a natural and expressive voice. 🌲🔊

🌐 Example Demo: Colab bark text2speech
📚 Example Doc: text-to-speech-bark

AttributeError: 'HQQLinear' object has no attribute 'weight'

Thank you for creating this amazing package. It looks very promising. But i am facing some issue on installation on Linux Mint (Ubuntu) desktop. Where i have a NVIDIA RTX 3060 GPU with 12GB VRAM. I am just trying the same code as provided in the reference readme.

It can download the YouTube video and convert to MP3, also model is downloaded. It loaded the model into memory (not sure) but then i am stuck. Below the error. If need any trace log i can share.

2024-05-10 16:10:34,324 - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Traceback (most recent call last):
  File "/home/ml/whisperplus/demo-test.py", line 24, in <module>
    pipeline = SpeechToTextPipeline(
  File "/home/ml/.local/lib/python3.10/site-packages/whisperplus/pipelines/whisper.py", line 26, in __init__
    self.load_model(model_id, quant_config)
  File "/home/ml/.local/lib/python3.10/site-packages/whisperplus/pipelines/whisper.py", line 43, in load_model
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
  File "/home/ml/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ml/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3689, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/ml/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4123, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/ml/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 847, in _load_state_dict_into_meta_model
    old_param = getattr(old_param, split)
  File "/home/ml/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
**AttributeError: 'HQQLinear' object has no attribute 'weight'**

ImageMagick is not installed on your computer BUT its installed.

I am trying to run autocaption example. Ive run on this error. Imagemagick is in fact installed.

(WhisperPlus38) (base) piotr@Legion7:~/WhisperPlus/Tutorial$ /home/piotr/anaconda3/envs/WhisperPlus38/bin/python /home/piotr/WhisperPlus/Tutorial/autocaption.py
2024-05-07 12:08:18,308 - WARNING - /home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")

2024-05-07 12:08:18,390 - INFO - Loading model...
2024-05-07 12:08:19,331 - INFO - Model loaded successfully.
2024-05-07 12:08:19,331 - INFO - Using device: cuda
ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)
  configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --enable-libvpl --disable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --enable-shared
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/home/piotr/WhisperPlus/Tutorial/Wiktor.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.20.100
  Duration: 00:02:27.75, start: 0.000000, bitrate: 15237 kb/s
  Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1920x1072, 15036 kb/s, 24 fps, 24 tbr, 12288 tbn (default)
    Metadata:
      handler_name    : Core Media Video
      vendor_id       : [0][0][0][0]
  Stream #0:1[0x2](und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 196 kb/s (default)
    Metadata:
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]
Stream mapping:
  Stream #0:1 -> #0:0 (aac (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to '/home/piotr/WhisperPlus/Tutorial/Wiktor.mp3':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    TSSE            : Lavf60.16.100
  Stream #0:0(und): Audio: mp3, 44100 Hz, stereo, fltp (default)
    Metadata:
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]
      encoder         : Lavc60.31.102 libmp3lame
[out#0/mp3 @ 0x5617a2ec8140] video:0kB audio:2309kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.015059%
size=    2309kB time=00:02:27.69 bitrate= 128.1kbits/s speed=58.8x    
2024-05-07 12:08:22,008 - INFO - Transcribing audio...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
Traceback (most recent call last):
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/moviepy/video/VideoClip.py", line 1137, in __init__
    subprocess_call(cmd, logger=None)
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/moviepy/tools.py", line 54, in subprocess_call
    raise IOError(err.decode('utf8'))
OSError: convert-im6.q16: attempt to perform an operation not allowed by the security policy `@/tmp/tmp1bxytlbo.txt' @ error/property.c/InterpretImageProperties/3771.
convert-im6.q16: label expected `@/tmp/tmp1bxytlbo.txt' @ error/annotate.c/GetMultilineTypeMetrics/782.
convert-im6.q16: no images defined `PNG32:/tmp/tmp10k7g0kv.png' @ error/convert.c/ConvertImageCommand/3234.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/piotr/WhisperPlus/Tutorial/autocaption.py", line 4, in <module>
    caption(video_path="/home/piotr/WhisperPlus/Tutorial/Wiktor.mp4", output_path="output.mp4", language="spanish")
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/whisperplus/pipelines/whisper_autocaption.py", line 88, in __call__
    return self.add_subtitles_to_video(video_path, result['chunks'], output_path)
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/whisperplus/pipelines/whisper_autocaption.py", line 59, in add_subtitles_to_video
    txt_clip = TextClip(text, fontsize=24, color='white', bg_color='black', size=(max_width, None))
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/moviepy/video/VideoClip.py", line 1146, in __init__
    raise IOError(error)
OSError: MoviePy Error: creation of None failed because of the following error:

convert-im6.q16: attempt to perform an operation not allowed by the security policy `@/tmp/tmp1bxytlbo.txt' @ error/property.c/InterpretImageProperties/3771.
convert-im6.q16: label expected `@/tmp/tmp1bxytlbo.txt' @ error/annotate.c/GetMultilineTypeMetrics/782.
convert-im6.q16: no images defined `PNG32:/tmp/tmp10k7g0kv.png' @ error/convert.c/ConvertImageCommand/3234.
.

.This error can be due to the fact that ImageMagick is not installed on your computer, or (for Windows users) that you didn't specify the path to the ImageMagick binary in file conf.py, or that the path you specified is incorrect
(WhisperPlus38) (base) piotr@Legion7:~/WhisperPlus/Tutorial$ imagemagick
imagemagick: command not found
(WhisperPlus38) (base) piotr@Legion7:~/WhisperPlus/Tutorial$ sudo apt install imagemagick
[sudo] password for piotr: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
imagemagick is already the newest version (8:6.9.12.98+dfsg1-5.2build2).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
(WhisperPlus38) (base) piotr@Legion7:~/WhisperPlus/Tutorial$

[Feature Request]: Add AutoLLM Library

Github: Autollm: Ship RAG based LLM ⚙️

Typo in Documentation

The title for the TextToSpeechPipeline function is written Speech to Text

These are just different things is all.

Issue: diarization = self.diarization_pipeline( TypeError: 'NoneType' object is not callable

I am getting this error. Should I put somewhere HF auth token?

Could not download 'pyannote/speaker-diarization' pipeline.
It might be because the pipeline is private or gated so make
sure to authenticate. Visit https://hf.co/settings/tokens to
create your access token and retry with:

   >>> Pipeline.from_pretrained('pyannote/speaker-diarization',
   ...                          use_auth_token=YOUR_AUTH_TOKEN)

If this still does not work, it might be because the pipeline is gated:
visit https://hf.co/pyannote/speaker-diarization to accept the user conditions.
Traceback (most recent call last):
  File "/home/piotr/WhisperPlus/Tutorial/Diarization.py", line 18, in <module>
    output_text = pipeline(audio_path, num_speakers=2, min_speaker=1, max_speaker=2)
  File "/home/piotr/anaconda3/envs/WhisperPlus308/lib/python3.8/site-packages/whisperplus/pipelines/whisper_diarize.py", line 104, in __call__
    diarization = self.diarization_pipeline(
TypeError: 'NoneType' object is not callable

Additional Information

It would be great to have additional information on how this package works and what it does differently say compared to Whisper.cpp , faster-whisper , whisperX , distal-whisper and so on.

new Installation - many errors in Conda Python 3.8, 3.10 and 3.12

Hi,
I am trying to install WhisperPlud but face many issues.

Traceback (most recent call last):
  File "/home/piotr/WhisperPlus/Tutorial/caption.py", line 1, in <module>
    from whisperplus import WhisperAutoCaptionPipeline
ImportError: cannot import name 'WhisperAutoCaptionPipeline' from 'whisperplus' (/home/piotr/anaconda3/envs/WhisperPlus310/lib/python3.10/site-packages/whisperplus/__init__.py)

Traceback (most recent call last):
  File "/home/piotr/WhisperPlus/Tutorial/youtubetoaudio.py", line 20, in <module>
    bnb_4bit_compute_dtype=torch.bfloat16,
NameError: name 'torch' is not defined

I have new Installation - many errors in Conda Python 3.8, 3.10 and 3.12
Ive created new conda and follow instalation procedure:

pip install whisperplus git+https://github.com/huggingface/transformers

pip install flash-attn --no-build-isolation

Nvidia Cuda toolkit 12.4, nevest drivers.

Add Lightning Whisper MLX library

Repo: https://github.com/mustafaaljadery/lightning-whisper-mlx

Flash Attention Support

Hey! Great work.

I think the latest code change for the SpeechToTextPipeline expects all GPUs to be Flash Attention 2compatible.

I'm not sure if there's anyway to override the kwargs.

https://github.com/kadirnar/whisper-plus/blob/487bfa05572a04eb39af260eb3197533ddcdcb0d/whisperplus/pipelines/whisper.py#L72C13-L72C58

I used it on P100 from Kaggle and got the error about Flash Attention

404 error - Huggingface Link

https://huggingface.co/spaces/ArtGAN/WhisperPlus - This link isn't working.

ValueError: We expect a numpy ndarray as input, got `<class 'NoneType'>`

Hi,
I am trying an example of Youtube to text. I am getting following error.

024-05-07 07:58:17,383 - WARNING - /home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")

2024-05-07 07:58:19,889 - ERROR - An error occurred: __init__: could not find match for ^\w+\W
2024-05-07 07:58:20,578 - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2024-05-07 07:58:21,842 - INFO - Model loaded successfully.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-07 07:58:22,965 - INFO - Transcribing audio...
Traceback (most recent call last):
  File "/home/piotr/WhisperPlus/Tutorial/YoutubeToText.py", line 31, in <module>
    transcript = pipeline(
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/whisperplus/pipelines/whisper.py", line 91, in __call__
    result = pipe(audio_path)
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 285, in __call__
    return super().__call__(inputs, **kwargs)
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1234, in __call__
    return next(
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/transformers/pipelines/pt_utils.py", line 186, in __next__
    processed = next(self.subiterator)
  File "/home/piotr/anaconda3/envs/WhisperPlus38/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 410, in preprocess
    raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
ValueError: We expect a numpy ndarray as input, got `<class 'NoneType'>`

[CONTRIBUTION] Speech Dataset Generator

Hi everyone!

My name is David Martin Rius and I have just published this project on GitHub: https://github.com/davidmartinrius/speech-dataset-generator/

Now you can create datasets automatically with any audio or lists of audios.

I hope you find it useful.

Here are the key functionalities of the project:

Dataset Generation: Creation of multilingual datasets with Mean Opinion Score (MOS).
Silence Removal: It includes a feature to remove silences from audio files, enhancing the overall quality.
Sound Quality Improvement: It improves the quality of the audio when needed.
Audio Segmentation: It can segment audio files within specified second ranges.
Transcription: The project transcribes the segmented audio, providing a textual representation.
Gender Identification: It identifies the gender of each speaker in the audio.
Pyannote Embeddings: Utilizes pyannote embeddings for speaker detection across multiple audio files.
Automatic Speaker Naming: Automatically assigns names to speakers detected in multiple audios.
Multiple Speaker Detection: Capable of detecting multiple speakers within each audio file.
Store speaker embeddings: The speakers are detected and stored in a Chroma database, so you do not need to assign a speaker name.
Syllabic and words-per-minute metrics

Feel free to explore the project at https://github.com/davidmartinrius/speech-dataset-generator

ValueError: We expect a numpy ndarray as input, got `<class 'NoneType'>`

I followed the steps of "Youtube URL to Audio", but the following error occurred:

2023-12-18 11:06:54,971 - INFO - Transcribing audio...
Traceback (most recent call last):
  File "/Users/jinlin/Desktop/learn/whisperplus/index.py", line 11, in <module>
    transcript = pipeline(
                 ^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/whisperplus/pipelines/whisper.py", line 74, in __call__
    result = pipe(audio_path)["text"]
             ^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 357, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1132, in __call__
    return next(
           ^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                           ^^^^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 183, in __next__
    processed = next(self.subiterator)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jinlin/Desktop/learn/whisperplus/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 482, in preprocess
    raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
ValueError: We expect a numpy ndarray as input, got `<class 'NoneType'>`

Here is my config of env:
home = /opt/homebrew/opt/[email protected]/bin
include-system-site-packages = false
version = 3.11.4
executable = /opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/bin/python3.11
command = /opt/homebrew/opt/[email protected]/bin/python3.11 -m venv /Users/jinlin/Desktop/learn/whisperplus

RuntimeError: CUDA error: device-side assert triggered

I followed the steps of "Youtube URL to Audio", The first time I ran it successfully, the second time it failed, the following error occurred, :

from whisperplus import SpeechToTextPipeline, download_and_convert_to_mp3

url = "https://www.youtube.com/watch?v=7S-2XKufdus&ab_channel=MitchAsser"
audio_path = download_and_convert_to_mp3(url)
pipeline = SpeechToTextPipeline(model_id="openai/whisper-large-v3")
transcript = pipeline(audio_path, "openai/whisper-large-v3", "english")

print(transcript)

RuntimeError                              Traceback (most recent call last)
[<ipython-input-40-cf07cfb4f81e>](https://localhost:8080/#) in <cell line: 5>()
      3 url = "https://www.youtube.com/watch?v=7S-2XKufdus&ab_channel=MitchAsser"
      4 audio_path = download_and_convert_to_mp3(url)
----> 5 pipeline = SpeechToTextPipeline(model_id="openai/whisper-large-v3")
      6 transcript = pipeline(audio_path, "openai/whisper-large-v3", "english")
      7 

7 frames
[/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py](https://localhost:8080/#) in empty_cache()
    160     """
    161     if is_initialized():
--> 162         torch._C._cuda_emptyCache()
    163 
    164 

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

GPU 0: Tesla T4 (UUID: GPU-ea2578e1-f112-e896-e103-fae036848bcb)
Thu Mar 28 16:33:50 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0              29W /  70W |  10449MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

In google colab

[Feature Request] Add Functionality for MP3 Download and Text Generation with Whisper Model 🎶

Description:

🌐 MP3 Download Functionality: Implement code using the YT-DLP library to download the audio of a specific YouTube video in .mp3 format.
🤖 Text Generation with Whisper Model: Develop the capability to feed the downloaded MP3 file into the Whisper model, generating text based on the audio.
🚀 Expected Behavior: When the user runs the project, the specified YouTube video's audio should be downloaded in .mp3 format, and this audio file should be processed by the Whisper model to generate corresponding text.

🐟 Add Docker file

Can't run Diarization on MPS device

I did not manage to run the Speaker Diarization from the README example on an Appel MPS device.

I got this error and don't know how to fix it:


% python app-plus.py
/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
2024-04-09 11:17:30,670 - INFO - Downloading started... output/test.mp3
2024-04-09 11:18:41,270 - INFO - Download and conversion successful. File saved at: output/test.mp3
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-04-09 11:18:51,055 - INFO - Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.2. Bad things might happen unless you revert torch to 1.x.
Traceback (most recent call last):
  File "/Users/jeanjerome/PROJETS/voxcatalyst/app-plus.py", line 10, in <module>
    pipeline = ASRDiarizationPipeline.from_pretrained(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/whisperplus/pipelines/whisper_diarize.py", line 43, in from_pretrained
    diarization_pipeline = Pipeline.from_pretrained(diarizer_model, use_auth_token=use_auth_token)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/pyannote/audio/core/pipeline.py", line 136, in from_pretrained
    pipeline = Klass(**params)
               ^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 167, in __init__
    self._embedding = PretrainedSpeakerEmbedding(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/pyannote/audio/pipelines/speaker_verification.py", line 754, in PretrainedSpeakerEmbedding
    return SpeechBrainPretrainedSpeakerEmbedding(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/vox-catalyst/lib/python3.11/site-packages/pyannote/audio/pipelines/speaker_verification.py", line 245, in __init__
    raise ImportError(
ImportError: 'speechbrain' must be installed to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings. Visit https://speechbrain.github.io for installation instructions.

Also speechbrain is installed:

% pip list | grep speechbrain
speechbrain               1.0.0

And HF token is declared in use_auth_token attribute...

Any idea?
Thanks for your response... and your great work!

【Installing Error】using pip intall on Mac M1pro

Hi there, i've got an issue when installing on Mac M1pro. Anyone could help to tell me anything i could do with these error ?

Many thanks !

ERROR: Exception:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/cli/base_command.py", line 180, in exc_logging_wrapper
status = run_func(*args)
^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/cli/req_command.py", line 245, in wrapper
return func(self, options, args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/commands/install.py", line 377, in run
requirement_set = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 95, in resolve
result = self._result = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
state = resolution.resolve(requirements, max_rounds=max_rounds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 427, in resolve
failure_causes = self._attempt_to_pin_criterion(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 239, in _attempt_to_pin_criterion
criteria = self._get_updated_criteria(candidate)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 230, in _get_updated_criteria
self._add_to_criteria(criteria, requirement, parent=candidate)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
if not criterion.candidates:
^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/resolvelib/structs.py", line 156, in bool
return bool(self._sequence)
^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in bool
return any(self)
^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in
return (c for c in iterator if id(c) not in self._incompatible_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
candidate = func()
^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 182, in _make_candidate_from_link
base: Optional[BaseCandidate] = self._make_base_candidate_from_link(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 228, in _make_base_candidate_from_link
self._link_candidate_cache[link] = LinkCandidate(
^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 290, in init
super().init(
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 156, in init
self.dist = self._prepare()
^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 222, in _prepare
dist = self._prepare_distribution()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 301, in _prepare_distribution
return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/operations/prepare.py", line 525, in prepare_linked_requirement
return self._prepare_linked_requirement(req, parallel_builds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/operations/prepare.py", line 640, in _prepare_linked_requirement
dist = _get_prepared_distribution(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/operations/prepare.py", line 71, in _get_prepared_distribution
abstract_dist.prepare_distribution_metadata(
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/distributions/sdist.py", line 54, in prepare_distribution_metadata
self._install_build_reqs(finder)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/distributions/sdist.py", line 124, in _install_build_reqs
build_reqs = self._get_build_requires_wheel()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/distributions/sdist.py", line 101, in _get_build_requires_wheel
return backend.get_requires_for_build_wheel()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_internal/utils/misc.py", line 745, in get_requires_for_build_wheel
return super().get_requires_for_build_wheel(config_settings=cs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_impl.py", line 166, in get_requires_for_build_wheel
return self._call_hook('get_requires_for_build_wheel', {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_impl.py", line 321, in _call_hook
raise BackendUnavailable(data.get('traceback', ''))
pip._vendor.pyproject_hooks._impl.BackendUnavailable: Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend
obj = import_module(mod_path)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/init.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1310, in _find_and_load_unlocked
File "", line 488, in _call_with_frames_removed
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1331, in _find_and_load_unlocked
File "", line 935, in _load_unlocked
File "", line 994, in exec_module
File "", line 488, in _call_with_frames_removed
File "/private/var/folders/c7/7bf0077d7811df5w0nykrzmc0000gn/T/pip-build-env-xks54qe6/overlay/lib/python3.12/site-packages/setuptools/init.py", line 10, in
import distutils.core
ModuleNotFoundError: No module named 'distutils'

ModuleNotFoundError: No module named "Whisper"

Hi, I might be doing something incorrectly, but I am having trouble installing this.

I started by using a venv then activating it, then installing everything as shown on the instructions:

python -m venv .venv
source .venv/bin/activate
pip install whisperplus git+https://github.com/huggingface/transformers
pip install flash-attn --no-build-isolation

All of this worked fine. But then when I tried to import anything, I got ModuleNotFoundError: No module named 'whisper'

Full traceback:

(.venv) classgpu@host:~/whisperplustesting$ python test22.py 
Traceback (most recent call last):
  File "/home/classgpu/whisperplustesting/test22.py", line 1, in <module>
    from whisperplus import SpeechToTextPipeline, download_and_convert_to_mp3
  File "/home/classgpu/whisperplustesting/.venv/lib/python3.10/site-packages/whisperplus/__init__.py", line 1, in <module>
    from whisper.pipelines.whisper_autocaption import WhisperAutoCaptionPipeline
ModuleNotFoundError: No module named 'whisper'

I then tried installing whisper directly though pip install openai-whisper but then I got the error ModuleNotFoundError: No module named 'whisper.pipelines'

Does the speaker diarization totally using pyannote or are there some embedding sharing?

pyannoate diarizaiton on other language are not very accurate.

contribution: vague info for using speaker diarization

In this snippet taken from README.md, it should be stated that auth_token would be required if using a diarized model from pyannote.

from whisperplus import (
    ASRDiarizationPipeline,
    download_and_convert_to_mp3,
    format_speech_to_dialogue,
)

audio_path = download_and_convert_to_mp3("https://www.youtube.com/watch?v=mRB14sFHw2E")

device = "cuda"  # cpu or mps
pipeline = ASRDiarizationPipeline.from_pretrained(
    asr_model="openai/whisper-large-v3",
    diarizer_model="pyannote/speaker-diarization",
    use_auth_token=False,
    chunk_length_s=30,
    device=device,
)

output_text = pipeline(audio_path, num_speakers=2, min_speaker=1, max_speaker=2)
dialogue = format_speech_to_dialogue(output_text)
print(dialogue)

Could not download 'pyannote/speaker-diarization' pipeline.

It might be because the pipeline is private or gated so make
sure to authenticate. Visit https://hf.co/settings/tokens to
create your access token and retry with:

   >>> Pipeline.from_pretrained('pyannote/speaker-diarization',
   ...                          use_auth_token=YOUR_AUTH_TOKEN)

Where would I include the auth token in the function call?

Or would I have to include it in the env? etc.

bug: whisperplus pyannote.audio dependencies

Hey, I needed some assistance in setting whisper-plus up with diarization support.

Here is the installed whisperplus specs and dependency issues

whisperplus==0.2.7
 - torch [required: >=2.0.0, installed: 1.13.1]
 - torchaudio [required: >=2.0.0, installed: 0.13.1]
 - pyannote.audio [required: ==3.1.0, installed: 0.0.1]
 - pyannote.core [required: ==5.0.0, installed: 4.5]
 - pyannote.database [required: ==5.0.1, installed: 4.1.3]
 - pyannote.pipeline [required: ==3.0.1, installed: 2.3]

However, when running a script using whisperplus I get this -

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.1+cu121. Bad things might happen unless you revert torch to 1.x.

Pls would appreciate some insight on this

kadirnar / whisper-plus Goto Github PK

whisper-plus's Introduction

WhisperPlus: Faster, Smarter, and More Capable 🚀

🛠️ Installation

🤗 Model Hub

🎙️ Usage

🎵 Youtube URL to Audio

🍎 Apple MLX

🍏 Lightning Mlx Whisper

📰 Summarization

📰 Long Text Support Summarization

💬 Speaker Diarization

⭐ RAG - Chat with Video(LanceDB)

🌠 RAG - Chat with Video(AutoLLM)

🎙️ Text to Speech

🎥 AutoCaption

😍 Contributing

📜 License

🤗 Citation

whisper-plus's People

Contributors

Stargazers

Watchers

Forkers

whisper-plus's Issues

Here are the key functionalities of the project:

Recommend Projects

Recommend Topics

Recommend Org