Giter VIP home page Giter VIP logo

henestrosadev / audiotext Goto Github PK

View Code? Open in Web Editor NEW
115.0 7.0 10.0 82.26 MB

A desktop application that transcribes audio from files, microphone input or YouTube videos with the option to translate the content and create subtitles.

License: Other

Python 100.00%
customtkinter python speech-recognition audio-to-text video-to-text transcriber speech-to-text speech-to-text-api subtitles-generator whisperx ffmpeg

audiotext's Introduction

Logo

Audiotext

A desktop application that transcribes audio from files, microphone input or YouTube videos with the option to translate the content and create subtitles.

Code Quality badge status
Version GitHub Contributors License
GitHub Contributors Issues GitHub pull requests

Report Bug · Request Feature · Ask Question

Table of Contents

About the Project

Main

Audiotext transcribes the audio from an audio file, video file, microphone input, directory, or YouTube video into any of the 99 different languages it supports. You can transcribe using the Google Speech-to-Text API, the Whisper API, or WhisperX. The last two methods can even translate the transcription or generate subtitles!

You can also choose the theme you like best. It can be dark, light, or the one configured in the system.

Dark Dark theme
Light Light theme

Supported Languages

Click here to display
  • Afrikaans
  • Albanian
  • Amharic
  • Arabic
  • Armenian
  • Assamese
  • Azerbaijan
  • Bashkir
  • Basque
  • Belarusian
  • Bengali
  • Bosnian
  • Breton
  • Bulgarian
  • Burmese
  • Catalan
  • Chinese
  • Chinese (Yue)
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Faroese
  • Farsi
  • Finnish
  • French
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Haitian
  • Hausa
  • Hawaiian
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Italian
  • Japanese
  • Javanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Lao
  • Latin
  • Latvian
  • Lingala
  • Lithuanian
  • Luxembourgish
  • Macedonian
  • Malagasy
  • Malay
  • Malayalam
  • Maltese
  • Maori
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian
  • Norwegian Nynorsk
  • Occitan
  • Pashto
  • Polish
  • Português
  • Punjabi
  • Romanian
  • Russian
  • Sanskrit
  • Serbian
  • Shona
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Somali
  • Spanish
  • Sundanese
  • Swahili
  • Swedish
  • Tagalog
  • Tajik
  • Tamil
  • Tatar
  • Telugu
  • Thai
  • Tibetan
  • Turkish
  • Turkmen
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Welsh
  • Yiddish
  • Yoruba

Supported File Types

Audio file formats
  • .aac
  • .flac
  • .mp3
  • .mpeg
  • .oga
  • .ogg
  • .opus
  • .wav
  • .wma
Video file formats
  • .3g2
  • .3gp2
  • .3gp
  • .3gpp2
  • .3gpp
  • .asf
  • .avi
  • .f4a
  • .f4b
  • .f4v
  • .flv
  • .m4a
  • .m4b
  • .m4r
  • .m4v
  • .mkv
  • .mov
  • .mp4
  • .ogv
  • .ogx
  • .webm
  • .wmv

Project Structure

ASCII folder structure
│   .gitignore
│   audiotext.spec
│   LICENSE
│   README.md
│   requirements.txt
│
├───.github
│   │   CONTRIBUTING.md
│   │   FUNDING.yml
│   │
│   ├───ISSUE_TEMPLATE
│   │       bug_report_template.md
│   │       feature_request_template.md
│   │
│   └───PULL_REQUEST_TEMPLATE
│           pull_request_template.md
│
├───docs/
│
├───res
│   ├───img
│   │       icon.ico
│   │
│   └───locales
│       │   main_controller.pot
│       │   main_window.pot
│       │
│       ├───en
│       │   └───LC_MESSAGES
│       │           app.mo
│       │           app.po
│       │           main_controller.po
│       │           main_window.po
│       │
│       └───es
│           └───LC_MESSAGES
│                   app.mo
│                   app.po
│                   main_controller.po
│                   main_window.po
│
└───src
    │   app.py
    │
    ├───controllers
    │       __init__.py
    │       main_controller.py
    │
    ├───handlers
    │       file_handler.py
    │       google_api_handler.py
    │       openai_api_handler.py
    │       whisperx_handler.py
    │       youtube_handler.py
    │
    ├───interfaces
    │       transcribable.py
    │
    ├───models
    │   │   __init__.py
    │   │   transcription.py
    │   │
    │   └───config
    │           __init__.py
    │           config_subtitles.py
    │           config_system.py
    │           config_transcription.py
    │           config_whisper_api.py
    │           config_whisperx.py
    │
    ├───utils
    │       __init__.py
    │       audio_utils.py
    │       config_manager.py
    │       constants.py
    │       dict_utils.py
    │       enums.py
    │       env_keys.py
    │       path_helper.py
    │
    └───views
        │   __init__.py
        │   main_window.py
        │
        └───custom_widgets
                __init__.py
                ctk_scrollable_dropdown/
                ctk_input_dialog.py

Built With

  • CTkScrollableDropdown for the scrollable option menu to display the full list of supported languages.
  • CustomTkinter for the GUI.
  • moviepy for video processing, from which the program extracts the audio to be transcribed.
  • OpenAI Python API library for using the Whisper API.
  • PyAudio for recording microphone audio.
  • pydub for audio processing.
  • python-dotenv for handling environment variables.
  • PyTorch for building and training neural networks.
  • PyTorch-CUDA for enabling GPU support (CUDA) with PyTorch. CUDA is a parallel computing platform and application programming interface model created by NVIDIA.
  • pytube for audio download of YouTube videos.
  • SpeechRecognition for using the Google Speech-To-Text API.
  • Torchaudio for audio processing tasks, including speech recognition and audio classification.
  • WhisperX for fast automatic speech recognition. This product includes software developed by Max Bain. Uses faster-whisper, which is a reimplementation of OpenAI's Whisper model using CTranslate2.

(back to top)

Getting Started

Installation

  1. Install FFmpeg to execute the program. Otherwise, it won't be able to process the audio files.

    To check if you have it installed on your system, run ffmpeg -version. It should return something similar to this:

    ffmpeg version 5.1.2-essentials_build-www.gyan.dev Copyright (c) 2000-2022 the FFmpeg developers
    built with gcc 12.1.0 (Rev2, Built by MSYS2 project)
    configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-libass --enable-libfreetype --enable-libfribidi --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libmfx --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband
    libavutil      57. 28.100 / 57. 28.100
    libavcodec     59. 37.100 / 59. 37.100
    libavformat    59. 27.100 / 59. 27.100
    libavdevice    59.  7.100 / 59.  7.100
    libavfilter     8. 44.100 /  8. 44.100
    libswscale      6.  7.100 /  6.  7.100
    libswresample   4.  7.100 /  4.  7.100
    

    If the output is an error, it is because your system cannot find the ffmpeg system variable, which is probably because you don't have it installed on your system. To install ffmpeg, open a command prompt and run one of the following commands, depending on your operating system:

    # on Ubuntu or Debian
    sudo apt update && sudo apt install ffmpeg
    
    # on Arch Linux
    sudo pacman -S ffmpeg
    
    # on MacOS using Homebrew (https://brew.sh/)
    brew install ffmpeg
    
    # on Windows using Chocolatey (https://chocolatey.org/)
    choco install ffmpeg
    
    # on Windows using Scoop (https://scoop.sh/)
    scoop install ffmpeg
    
  2. Go to releases and download the latest.

  3. Decompress the downloaded file.

  4. Open the audiotext folder and double-click the Audiotext executable file.

Setting Up the Project Locally

  1. Clone the repository by running git clone https://github.com/HenestrosaDev/audiotext.git.
  2. Change the current working directory to audiotext by running cd audiotext.
  3. (Optional but recommended) Create a Python virtual environment in the project root. If you're using virtualenv, you would run virtualenv venv.
  4. (Optional but recommended) Activate the virtual environment:
    # on Windows
    . venv/Scripts/activate
    # if you get the error `FullyQualifiedErrorId : UnauthorizedAccess`, run this:
    Set-ExecutionPolicy Unrestricted -Scope Process
    # and then . venv/Scripts/activate
    
    # on macOS and Linux
    source venv/Scripts/activate
  5. Run pip install -r requirements.txt to install the dependencies.
  6. (Optional) If you intend to contribute to the project, run pip install -r requirements-dev.txt to install the development dependencies.
  7. (Optional) If you followed step 6, run pre-commit install to install the pre-commit hooks in your .git/ directory.
  8. Copy and paste the .env.example file as .env to the root of the directory.
  9. Run python src/app.py to start the program.

Notes

  • You cannot generate a single executable file for this project with PyInstaller due to the dependency with the CustomTkinter package (reason here).
  • For Apple Silicon Macs and Ubuntu users: An error occurs when trying to install the pyaudio package. Here is a StackOverflow post explaining how to solve this issue.
  • I had to comment out the lines pprint(response_text, indent=4) in the recognize_google function from the __init__.py file of the SpeechRecognition package to avoid opening a command line along with the GUI. Otherwise, the program would not be able to use the Google API transcription method because pprint throws an error if it cannot print to the CLI, preventing the code from generating the transcription. The same applies to the lines using the logger package in the moviepy/audio/io/ffmpeg_audiowriter file from the moviepy package. There is also a change in the line 169 that changes logger=logger to logger=None to avoid more errors related to opening the console.

(back to top)

Usage

Once you open the Audiotext executable file (explained in the Getting Started section), you'll see something like this:

Main

Transcription Language

The target language for the transcription. If you use the Whisper API or the WhisperX transcription methods, you can set this to a language other than the one spoken in the audio in order to translate it to the selected language.

For example, to translate an English audio into French, you would set Transcription language to French, as shown in the video below:

english-to-french.mp4

This is an unofficial way to perform translations, so be sure to double-check the generated transcription for errors.

Transcription Method

There are three transcription methods available in Audiotext:

  • Google Speech-To-Text API (hereafter referred to as Google API): Requires an Internet connection. It doesn't punctuate sentences (the punctuation is produced by Audiotext), and the quality of the resulting transcriptions often requires manual adjustment due to lower quality compared to the Whisper API or WhisperX. In its free tier, usage is limited to 60 minutes per month, but this limit can be extended by adding an API key.

  • Whisper API: Requires an Internet connection. This method is intended for people whose machines are not powerful enough to run WhisperX gracefully. It has fewer options than WhisperX, but the quality of the transcriptions is similar to those generated by the large-v2 model of Whisper X. However, you need to set an OpenAI API key to use this method. See the Whisper API Key section for more information.

  • WhisperX: Selected by default. It doesn't require an Internet connection because the entire transcription process takes place locally on your computer. As a result, it's much more demanding of hardware resources than the other remote transcription methods. WhisperX can run on CPUs and CUDA GPUs, although it performs better on the latter. The quality of the transcription depends on the selected model size and computation type. In addition, WhisperX offers a wider range of features, including a more customizable subtitle generation process than the Whisper API and more output file types. It has no usage restrictions while remaining completely free.

Audio Source

You can transcribe from four different audio sources:

  • File (see image above): Click the file explorer icon to select the file you want to transcribe, or manually enter the path to the file in the Path input field. You can transcribe audio from both audio and video files.

    Note that the file explorer has the All supported files option selected by default. To select only audio files or video files, click the combo box in the lower right corner of the file explorer to change the file type, as marked in red in the following image:

    File explorer

    Supported files

  • Directory: Click the file explorer icon to select the directory containing the files you want to transcribe, or manually enter the path to the directory in the Path input field. Note that the Autosave option is checked and cannot be unchecked because each file's transcription will automatically be saved in the same path as the source file.

    Main

    For example, let's use the following directory as a reference:

    └───files-to-transcribe
        │   paranoid-android.mp3
        │   the-past-recedes.flac
        │
        └───movies
                mulholland-dr-2001.avi
                seul-contre-tous-1998.mp4
    

    After transcribing the files-to-transcribe directory using WhisperX, with the Overwrite existing files option unchecked and the output file types .vtt and .txt selected, the folder structure will look like this:

    └───files-to-transcribe
        │   paranoid-android.mp3
        │   paranoid-android.txt
        │   paranoid-android.vtt
        │   the-past-recedes.flac
        │   the-past-recedes.txt
        │   the-past-recedes.vtt
        │
        └───movies
                mulholland-dr-2001.avi
                mulholland-dr-2001.txt
                mulholland-dr-2001.vtt
                seul-contre-tous-1998.mp4
                seul-contre-tous-1998.txt
                seul-contre-tous-1998.vtt
    

    If we transcribe the directory again with the Google API and the Overwrite existing files option unchecked, Audiotext won't process any files because there are already .txt files corresponding to all the files in the directory. However, if we added the file endors-toi.wav to the root of files-to-transcribe, it would be the only file that would be processed because it doesn't have a .txt attached to it. The same would happen in the WhisperX scenario, since endors-toi.wav has no transcription files generated.

    Note that if we check the Overwrite existing files option, all files will be processed again and the existing transcription files will be overwritten.

  • Microphone: To start recording, simply click the Start recording button to begin the process. The text of the button will change to Stop recording and its color will change to red. Click it to stop recording and generate the transcription.

    Here is a video demonstrating this feature:

    english.mp4

    Note that your operating system must recognize an input source, otherwise an error message will appear in the text box indicating that no input source was detected.

  • YouTube video: Requires an Internet connection to get the audio of the video. To generate the transcription, simply enter the URL of the video in the YouTube video URL field and click the Generate transcription button when you are finished adjusting the settings.

    From microphone

Save Transcription

When you click on the Save transcription button, you'll be prompted for a file explorer where you can name the transcription file and select the path where you want to save it. Please note that any text entered or modified in the textbox WILL NOT be included in the saved transcription.

Autosave

If checked, the transcription will automatically be saved in the root of the folder where the file to transcribe is stored. If there are already existing files with the same name, they won't be overwritten. To do that, you'll need to check the Overwrite existing files option (see below).

Note that if you create a transcription using the Microphone or YouTube audio sources with the Autosave action enabled, the transcription files will be saved in the root of the audiotext-vX.X.X directory.

Overwrite Existing Files

This option can only be checked if the Autosave option is checked. If Overwrite existing files is checked, existing transcriptions in the root directory of the file to be transcribed will be overwritten when saving.

For example, let's use this directory as a reference:

└───audios
        foo.mp3
        foo.srt
        foo.txt

If we transcribe the audio file foo.mp3 with the output file types .json, .txt and .srt and the Autosave and Overwrite existing files options checked, the files foo.srt and foo.txt will be overwritten and the file foo.json will be created.

On the other hand, if we transcribe the audio file foo.mp3 with the same output file types, with the option Autosave checked but without the option Overwrite existing files, the file foo.json will still be created, but the files foo.srt and foo.txt will remain unchanged.

Google Speech-To-Text API Options

The Google API options frame appears if the selected transcription method is Google API. See the Transcription Method section to know more about the Google API.

google-api-options

Google API Key

Since the program uses the free Google API tier by default, which allows you to transcribe up to 60 minutes of audio per month for free, you may need to add an API key if you want to make extensive use of this feature. To do so, click the Set API key button. You'll be presented with a dialog box where you can enter your API key, which will only be used to make requests to the API.

Google API key dialog

Remember that WhisperX provides fast, unlimited audio transcription that supports translation and subtitle generation for free, unlike the Google API. Also note that Google charges for the use of the API key, for which Audiotext is not responsible.

Whisper API Options

The Whisper API options frame appears if the selected transcription method is Whisper API. See the Transcription Method section to know more about the Whisper API.

Whisper API options

Whisper API Key

As noted in the Transcription Method section, an OpenAI API key is required to use this transcription method. Otherwise, you won't be able to use it.

To add it, click the Set OpenAI API key button. You'll be presented with a dialog box where you can enter your API key, which will only be used to make requests to the API.

OpenAI API key dialog

OpenAI charges for the use of the API key, for which Audiotext is not responsible. See the Troubleshooting section if you get error 429 on your first request with an API key.

Response Format

The format of the transcript output, in one of these options:

  • json
  • srt (subtitle file type)
  • text
  • verbose_json
  • vtt (subtitle file type)

Defaults to text.

Temperature

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Defaults to 0.

Timestamp Granularities

The timestamp granularities to populate for this transcription. Response format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment.

Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

Defaults to segment.

WhisperX Options

The WhisperX options appear when the selected transcription method is WhisperX. You can select the output file types of the transcription and whether to translate the transcription into English.

WhisperX options

Output File Types

You can select one or more of the following transcription output file types:

  • .aud
  • .json
  • .srt (subtitle file type)
  • .tsv
  • .txt
  • .vtt (subtitle file type)

If you select one of the two subtitle file types (.vtt and .srt), the Subtitle options frame will be displayed with more options (read more here).

Translate to English

To translate the transcription to English, simply check the Translate to English checkbox before generating the transcription, as shown in the video below.

spanish-to-english.mp4

If you want to translate the audio to another language, check the Transcription Language section.

Subtitle Options

When you select the .srt and/or the .vtt output file type(s), the Subtitle options frame will be displayed. Note that the input options only apply to the .srt and .vtt files:

Subtitle options

To get the subtitle file(s) after the audio is transcribed, you can either check the Autosave option before generating the transcription or click Save transcription and select the path where you want to save them as explained in the Save Transcription section.

Highlight Words

Underline each word as it's spoken in .srt and .vtt subtitle files. Not checked by default.

Max. Line Count

The maximum number of lines in a segment. 2 by default.

Max. Line Width

The maximum number of characters in a line before breaking the line. 42 by default.

Advanced Options

When you click the Show advanced options button in the WhisperX options frame, the Advanced options frame appears, as shown in the figure below.

WhisperX advanced options

It's highly recommended that you don't change the default configuration unless you're having problems with WhisperX or you know exactly what you're doing, especially the Compute type and Batch size options. Change them at your own risk and be aware that you may experience problems, such as having to reboot your system if the GPU runs out of VRAM.

Model Size

There are five main ASR (Automatic Speech Recognition) model sizes that offer tradeoffs between speed and accuracy. The larger the model size, the more VRAM it uses and the longer it takes to transcribe. Unfortunately, WhisperX hasn't provided specific performance data for each model, so the table below is based on the one detailed in OpenAI's Whisper README. According to WhisperX, the large-v2 model requires <8GB of GPU memory and batches inference for 70x real-time transcription (taken from the project's README).

Model Parameters Required VRAM
tiny 39 M ~1 GB
base 74 M ~1 GB
small 244 M ~2 GB
medium 769 M ~5 GB
large 1550 M <8 GB

Note

large is divided into three versions: large-v1, large-v2, and large-v3. The default model size is large-v2, since large-v3 has some bugs that weren't as common in large-v2, such as hallucination and repetition, especially for certain languages like Japanese. There are also more prevalent problems with missing punctuation and capitalization. See the announcements for the large-v2 and the large-v3 models for more insight into their differences and the issues encountered with each.

The larger the model size, the lower the WER (Word Error Rate in %). The table below is taken from this Medium article, which analyzes the performance of pre-trained Whisper models on common Dutch speech.

Model WER
tiny 50.98
small 17.90
large-v2 7.81

Compute Type

This term refers to different data types used in computing, particularly in the context of numerical representation. It determines how numbers are stored and represented in a computer's memory. The higher the precision, the more resources will be needed and the better the transcription will be.

There are three possible values for Audiotext:

  • int8: Default if using CPU. It represents whole numbers without any fractional part. Its size is 8 bits (1 byte) and it can represent integer values from -128 to 127 (signed) or 0 to 255 (unsigned). It is used in scenarios where memory efficiency is critical, such as in quantized neural networks or edge devices with limited computational resources.
  • float16: Default if using CUDA GPU. It's a half precision type representing 16-bit floating point numbers. Its size is 16 bits (2 bytes). It has a smaller range and precision compared to float32. It's often used in applications where memory is a critical resource, such as in deep learning models running on GPUs or TPUs.
  • float32: Recommended for CUDA GPUs with more than 8 GB of VRAM. It's a single precision type representing 32-bit floating point numbers, which is a standard for representing real numbers in computers. Its size is 32 bits (4 bytes). It can represent a wide range of real numbers with a reasonable level of precision.

Batch Size

This option determines how many samples are processed together before the model parameters are updated. It doesn't affect the quality of the transcription, only the generation speed (the smaller, the slower).

For simplicity, let's divide the possible batch size values into two groups:

  • Small batch size (0<x<=8): Training with small batch sizes means that model weights are updated more frequently, potentially leading to more stable convergence. They use less memory, which can be important when working with limited resources. 8 is the default value.
  • Large batch size (>8): Speeds up in training, especially on hardware optimized for parallel processing such as GPUs. Max. recommended to 16.

Use CPU

WhisperX will use the CPU for transcription if checked. Checked by default if there is no CUDA GPU.

As noted in the Compute Type section, the default compute type value for the CPU is int8, since many CPUs don't support efficient float16 or float32 computation, which would result in an error. Change it at your own risk.

Troubleshooting

The program is unresponsive when using WhisperX

The first transcription created by WhisperX will take longer than subsequent ones. That's because Audiotext needs to load the model, which can take a few minutes, depending on the hardware the program is running on. It may appear to be unresponsive, but do not close it, as it will eventually return to a normal state.

Once the model is loaded, you'll notice a dramatic increase in the speed of subsequent transcriptions using this method.

I get the error RuntimeError: CUDA Out of memory when using WhisperX

Try any of the following (2 and 3 can affect quality) (taken from WhisperX README):

  1. Reduce batch size, e.g. 4
  2. Use a smaller ASR model, e.g. base
  3. Use lighter compute type, e.g. int8

Is it possible to use less GPU/CPU memory requirements when using WhisperX?

You can follow the steps above. See the Model Size section for how much memory you need for each model.

The program takes too much time to generate a transcription

Try using a smaller ASR model and/or a lighter computation type, as indicated in the point above. Keep in mind that the first WhisperX transcription will take some time to load the model. Also remember that the transcription process depends heavily on your system's hardware, so don't expect instant results on modest CPUs. Alternatively, you can use the Whisper API or Google API transcription methods, which are much less hardware intensive than WhisperX because the transcriptions are generated remotely, but you'll be dependent on the speed of your Internet connection.

When I try to generate a transcription using the Whisper API method, I get the error 429

You'll be prompted with an error like this:

RateLimitError("Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}")

This is either because your account run out of credits or because you need to fund your account before you can use the API for the first time (even if you have free credits available). To fix this, you need to purchase credits for your account (starting at $5) with a credit or debit card by going to the Billing section of your OpenAI account settings.

After funds are added to your account, it may take up to 10 minutes for your account to become active.

If you are using an API key that was created before you funded your account for the first time, and the error still persists after about 10 minutes, you'll need to create a new API key and change it in Audiotext (see the Whisper API Key section to change it).

(back to top)

Roadmap

  • Add support for WhisperX.
  • Generate .srt and .vtt files for subtitles (only for WhisperX).
  • Add "Stop recording" button state when recording from the microphone.
  • Add a dialogue to let users input their Google Speech-To-Text API key.
  • Add subtitle options.
  • Add advanced options for WhisperX.
  • Add the option to transcribe YouTube videos.
  • Add checkbox to automatically save the generated transcription (#17).
  • Allow transcription of multiple files from a directory.
  • Add Output file types option to WhisperX options.
  • Add support for .json, .tsv and .aud output file types when using WhisperX as transcription method.
  • Add appearance_mode to config.ini.
  • Add support for Whisper's API (#42).
  • Add pre-commit to use the ruff and mypy hooks.
  • Set up a CI pipeline to check code quality on pull requests and pushes to main by running the pre-commit hooks.
  • Change the Generate transcription button to Cancel transcription when a transcription is in progress.
  • Generate executables for macOS and Linux.
  • Add tests.

You can propose a new feature creating an issue.

Authors

See also the list of contributors who participated in this project.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please read the CONTRIBUTING.md file, where you can find more detailed information about how to contribute to the project.

Acknowledgments

I used the following resources to create this project:

License

Distributed under the BSD-4-Clause license. See LICENSE for more information.

Support

Would you like to support the project? That's very kind of you! However, I would suggest that you to consider supporting the packages that I've used to build this project first. If you still want to support this particular project, you can go to my Ko-Fi profile by clicking on the button down below!

ko-fi

(back to top)

audiotext's People

Contributors

henestrosadev avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

audiotext's Issues

No Audiotext.app in the downloaded folder

This looks great! Excited to try it! I have downloaded the lastest 2.2 version and unzipped this on my Mac.

The instructions on the main page say:
"Open the audiotext folder and double-click the Audiotext executable file (.exe for Windows and .app for macOS)."

However these is no "Audiotext" Folder within the "v2.2.0" zip file when unzipped.

I did find in the root of that folder Audiotext.exe but not Audiotext.app

Thank you!!

image image

Real time Transcription

Make the recording option of the transcription real-time

This will make it easier to use the application in a live transcription scenario.

Instead of the microphone recording alone, as it records in set time, it will keep transcribing and appending it.

I'm willing to implement this as a means of contributing if the guidance is provided.

Thanks for the awesome software!

[Bug] Windows, a few errors

Steps to reproduce

Windows.

Downloaded the latest release, already have ffmpeg installed.

Transcription Language: Swedish
Audio source: file (file.mkv)
Transcription method: Whisper X
Output filetype: srt

Clicked on "Generate transcription"

Took around an hour, then I got:

Traceback (most recent call last):
  File "handlers\whisperx_handler.py", line 53, in transcribe_file
  File "whisperx\alignment.py", line 71, in load_align_model
    raise ValueError(f"No default align-model for language: {language_code}")
ValueError: No default align-model for language: sv

An .srt file was created, and looking at the result (here are the first 11 lines):

1
00:00:23,660 --> 00:00:52,381
–Trodde du att jag hade glömt bort dig? –Risto, vad är det du gör? –Varför betalar du inte för? –Jag har inte sett nåt! Jesper! Jag betalar för att du får dubbelt så mycket jag lovar! –Jag vill inte! –Risto, gör inte det! –Titta på mig! –Titta mig i ögonen!

2
00:00:59,838 --> 00:01:01,510
För en väckbara.

3
00:03:30,452 --> 00:03:59,684
–Vad är det som har hänt? –Jag kan tyvärr inte berätta. –Jag ska besöka en vän som bor här. –Vad heter den personen? –Jakob Fivel. –Jag ska kalla på nån. Vad sa du nyligen?

It seems it does a decent job, but it cant split the dialogs correctly.

Perhaps its because there is no align model?

I am encountering an error while using the software.[Bug]

Traceback (most recent call last):
File "handlers\whisperx_handler.py", line 53, in transcribe_file
File "whisperx\alignment.py", line 73, in load_align_model
align_model = bundle.get_model(dl_kwargs={"model_dir": model_dir}).to(device)
File "torchaudio\pipelines_wav2vec2\impl.py", line 126, in get_model
File "torchaudio\pipelines_wav2vec2\impl.py", line 202, in _get_state_dict
File "torchaudio\pipelines_wav2vec2\impl.py", line 89, in _get_state_dict
File "torch\hub.py", line 750, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "torch\serialization.py", line 797, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "torch\serialization.py", line 283, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
图片

RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

Steps to reproduce

I followed the guide here https://github.com/HenestrosaDev/audiotext#set-up-the-project-locally

With these options:
image

Expected behaviour

It should transcribe the audio file selected

Actual behaviour

Errors out: RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

(If the library threw an exception, paste the full stack trace here)

Traceback (most recent call last):
File "C:\Users\HP\Documents\GitHub\Tarm\audiotext\src\handlers\whisperx_handler.py", line 30, in transcribe_file
model = whisperx.load_model(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\HP\Documents\GitHub\Tarm\audiotext.venv\Lib\site-packages\whisperx\asr.py", line 288, in load_model
model = model or WhisperModel(whisper_arch,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\HP\Documents\GitHub\Tarm\audiotext.venv\Lib\site-packages\faster_whisper\transcribe.py", line 133, in init
self.model = ctranslate2.models.Whisper(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

System information

(Delete all statements that don't apply.)

  • System: "Windows 10 x64"
  • System language: (Please, indicate the region as well): English
  • Audiotext version: v2.2.3

Unable to run the executable in GNU/Linux

Hi!
It seems to be some problem with GNU/Linux executable. I am running ArchLinux. I tried audiotext-v1.2.2-unix.zip and audiotext-v1.2.0-unix.zip

In both cases, when I run ./audiotext I get the error cannot execute binary file

I did:

file audiotext
audiotext: Mach-O 64-bit arm64 executable, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|PIE>

So it seems that the provided binary is compiled only for macOS. Can you provide a Linux compatible binary?

Not extracting audio using ffmpeg

Hello, Hi i am using the recently released latest version of audiotext v2. 1.0. Using the whisperx option to transcribe and generating subtiles I got the following error even though the use cpu option was selected when running:

RuntimeError: CUDA error: CUDA driver version is insufficient for CUDA runtime version

But I have fixed this by manually editing the config file. use_cpu=True and use_gpu=Flase. And the issue was resolved.

But Now I am facing with another error:

Traceback (most recent call last):
File "controller\main_controller.py", line 179, in _transcribe_using_whisperx
File "whisperx\audio.py", line 61, in load_audio
out = subprocess.run(cmd, capture_output=True, check=True).stdout
File "subprocess.py", line 501, in run
File "subprocess.py", line 969, in init
File "subprocess.py", line 1438, in _execute_child
FileNotFoundError: [WinError 2] The system cannot find the file specified

How to solve the issue?

My device specs:
Windows 11
Ryzen 3 processor
12 Gigabytes of Ram

Issue generating subtitles

Steps to reproduce

  1. Tried to generate subtitles for this classic movie: https://1drv.ms/v/s!AvxL3H5dkUh1h4k-5Z8uG14x1YrLfQ?e=nAnSTF
  2. Settings:
    -Language: English (also tested with Spanish, same result)
    -Transcribe from File
    -Transcribe using WhisperX
    -Translate to English: NOT CHECKED
    -Generate subtitles: CHECKED
    -Highlight words: NOT CHECKED
    -Max. line count: 2
    -Max. line width: 42
    -Model size: large-v2
    -Compute type: float32
    -Batch size: 8
    -Use CPU: NOT CHECKED
    Running 2.2.1 on Windows 10 x64
  3. Generate transcription works OK.

Expected behaviour

Save transcription generates .txt, .srt and .vtt files.

It does behave as expected with other audio files.

Actual behaviour

Save transcription gets stuck on an empty 0-byte .txt. Neither .srt nor .vtt are generated.

System information

  • Windows 10 x64
  • Spanish system language (es-ES)
  • Audiotext 2.2.1

AttributeError('Could not find PyAudio; check installation')

Following #40 issue resolution, I tried using transcription from microphone. I got AttributeError('Could not find PyAudio; check installation')

When i do pip install pyaudio==0.2.13

Here's the output

Collecting pyaudio==0.2.13
Using cached PyAudio-0.2.13.tar.gz (46 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [33 lines of output]
Traceback (most recent call last):
File "c:\Users\HP\Documents\GitHub\Tarm\audiotext.venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "c:\Users\HP\Documents\GitHub\Tarm\audiotext.venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\HP\Documents\GitHub\Tarm\audiotext.venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 112, in get_requires_for_build_wheel
backend = _build_backend()
^^^^^^^^^^^^^^^^
File "c:\Users\HP\Documents\GitHub\Tarm\audiotext.venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 77, in build_backend
obj = import_module(mod_path)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\importlib_init
.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1310, in _find_and_load_unlocked
File "", line 488, in _call_with_frames_removed
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1331, in _find_and_load_unlocked
File "", line 935, in load_unlocked
File "", line 995, in exec_module
File "", line 488, in call_with_frames_removed
File "C:\Users\HP\AppData\Local\Temp\pip-build-env-ohlyygnb\overlay\Lib\site-packages\setuptools_init
.py", line 16, in
import setuptools.version
File "C:\Users\HP\AppData\Local\Temp\pip-build-env-ohlyygnb\overlay\Lib\site-packages\setuptools\version.py", line 1, in import pkg_resources
File "C:\Users\HP\AppData\Local\Temp\pip-build-env-ohlyygnb\overlay\Lib\site-packages\pkg_resources_init
.py", line 2191, in
register_finder(pkgutil.ImpImporter, find_on_path)
^^^^^^^^^^^^^^^^^^^
AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Unhandled Exception Occured with : KeyError: 'English_United States'

When I ran the "audiotext.exe" exe file from the audiotext-v1.2.0.zip after extraction. I got the below exception.

Traceback (most recent call last): File "main.py", line 261, in <module> File "main.py", line 19, in __init__ File "main.py", line 62, in create_sidebar KeyError: 'English_United States'
image

Program doesn't start on Windows 10 with KeyError de_DE[Bug]

Executing the release v1.2.1-windows gives me the following error.
I don't know what my location code has to do with keeping the program from successfully starting.

Traceback (most recent call last):
  File "app.py", line 46, in <module>
  File "app.py", line 32, in __init__
  File "view\main_window.py", line 17, in __init__
  File "view\main_window.py", line 58, in _init_sidebar
KeyError: 'de_DE'

[Bug] pyaudio error, module 'pkgutil' has no attribute 'ImpImporter'

Steps to reproduce

Followed the guide, cloned repo, started the virtualenv, run 'pip install -r requirements.txt'
Get the error message:

Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [33 lines of output]
      Traceback (most recent call last):
        File "/home/magnus/xx/audiotext/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/magnus/xx/audiotext/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/magnus/xx/audiotext/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 112, in get_requires_for_build_wheel
          backend = _build_backend()
                    ^^^^^^^^^^^^^^^^
        File "/home/magnus/xx/audiotext/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend
          obj = import_module(mod_path)
                ^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
          return _bootstrap._gcd_import(name[level:], package, level)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
        File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
        File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
        File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
        File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
        File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
        File "<frozen importlib._bootstrap_external>", line 995, in exec_module
        File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
        File "/tmp/pip-build-env-3e896ujr/overlay/lib/python3.12/site-packages/setuptools/__init__.py", line 16, in <module>
          import setuptools.version
        File "/tmp/pip-build-env-3e896ujr/overlay/lib/python3.12/site-packages/setuptools/version.py", line 1, in <module>
          import pkg_resources
        File "/tmp/pip-build-env-3e896ujr/overlay/lib/python3.12/site-packages/pkg_resources/__init__.py", line 2191, in <module>
          register_finder(pkgutil.ImpImporter, find_on_path)
                          ^^^^^^^^^^^^^^^^^^^
      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]

I'm on Arch, which uses python 3.12.4

I saw the other bug report on this issue that it should be fixed, but it still happens to me.

Help

Hi!
I am not very familiar with this stuff and I require some assistance. I am using macos and I have installed ffmpeg and downloaded audiotext unix file and when I try to open it it just opens the text editor? Do I need to run the file with ffmpeg? Could you help me pelase :). I would really like to learn and use this program it seems really useful to me as a student.

CUDA error when CPU selected

Steps to reproduce

I got a CUDA error on several retries to transcribe an English mp3 file of about 4 MB on Windows 11.
Version of audiotext used was 2.1.0

Expected behaviour

Transcribed text of audio file

Actual behaviour

Traceback (most recent call last):
File "controller\main_controller.py", line 170, in _transcribe_using_whisperx
File "whisperx\asr.py", line 287, in load_model
model = model or WhisperModel(whisper_arch,
File "faster_whisper\transcribe.py", line 130, in init
RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

image

Asking a Question.

Can I use more than 10 minutes of audio to translate to text? And is it free or not? Do I have to buy google API? Thanks.

HELP!!!

HELLO MAN...CONGRATULATIONS FOR YOUR AMAZING PROJECT!!!BUT COULD YOU PLEASE TELL ME STEP BY STEP WHAT I HAVE TO DO FOR USE YOUR AMAZING PROJECT?I AM NEW ON THIS AND I DONT KNOW!!!

Auto Save Transcript When Complete

Overview

I propose a checkbox that saves a srt, vrt, or txt file automatically in the directory (assuming it isn't overwriting an existing file with the same name) as soon as the transcription of a file is complete.

Motivation

The problem that this solves is not necessarily for one-off files. But if in the future a feature is added that allows transcription of multiple files at once this would be extremely useful.

Proposal

The way to implement this would likely be an additional checkbox on the GUI that asks the user if they want to autosave the transcript and in what formats they want autosaved. Before saving there should be a check to see if there is an existing file with the same name. An additional checkbox could be added to either always overwrite, with the default being to not overwrite any files with the same name.

Alternatives Considered

None. I'm open to hearing what others think of though

Additional Context

As stated above, I think that the greatest advantage of this feature would be if additional functionality gets added in the future that transcribes multiple files at once.

Next Steps

I am unfamiliar with how to go about executing the next steps myself but after messing around with chat GPT this is what it thinks the best next steps are:

Technical Implementation Details

  1. UI Changes:

Add Checkboxes to the GUI:
Create a new checkbox labeled "Autosave Transcript" in the transcription settings panel.
Add additional checkboxes or a dropdown menu for selecting the file formats (SRT, VRT, TXT) for autosaving.
Add a checkbox labeled "Overwrite Existing Files" with a default setting of not overwriting existing files.

Pseudocode for GUI elements

autosave_checkbox = CheckBox("Autosave Transcript")
format_checkbox_srt = CheckBox("Save as SRT")
format_checkbox_vrt = CheckBox("Save as VRT")
format_checkbox_txt = CheckBox("Save as TXT")
overwrite_checkbox = CheckBox("Overwrite Existing Files", default=False)

settings_panel.add(autosave_checkbox)
settings_panel.add(format_checkbox_srt)
settings_panel.add(format_checkbox_vrt)
settings_panel.add(format_checkbox_txt)
settings_panel.add(overwrite_checkbox)

  1. File Saving Logic:

Determine File Path and Name:
Construct the file path based on the directory where the original file is located and append the appropriate file extension for the selected formats.

Pseudocode

def get_file_path(original_file_path, extension):
directory = os.path.dirname(original_file_path)
base_name = os.path.splitext(os.path.basename(original_file_path))[0]
return os.path.join(directory, f"{base_name}.{extension}")
Check for Existing Files:
Before saving, check if a file with the same name already exists in the directory.
If the "Overwrite Existing Files" checkbox is not selected and a file exists, generate a new file name or skip saving.
python

Pseudocode

def check_and_save_file(file_path, content, overwrite):
if os.path.exists(file_path):
if overwrite:
save_file(file_path, content)
else:
new_file_path = generate_unique_filename(file_path)
save_file(new_file_path, content)
else:
save_file(file_path, content)

def generate_unique_filename(file_path):
base, extension = os.path.splitext(file_path)
counter = 1
new_file_path = f"{base}{counter}{extension}"
while os.path.exists(new_file_path):
counter += 1
new_file_path = f"{base}
{counter}{extension}"
return new_file_path

3. Saving the Transcription:

Function to Save Files:
Implement the logic to save the transcription content into the specified file formats.

Pseudocode

def save_file(file_path, content):
with open(file_path, 'w') as file:
file.write(content)

def autosave_transcription(original_file_path, transcript, formats, overwrite):
for fmt in formats:
file_path = get_file_path(original_file_path, fmt)
check_and_save_file(file_path, transcript, overwrite)

  1. Integration with Transcription Completion:

Hook into the Transcription Process:
Modify the transcription process to include a call to the autosave function when the transcription is complete.

Pseudocode

def on_transcription_complete(original_file_path, transcript):
if autosave_checkbox.is_checked():
selected_formats = []
if format_checkbox_srt.is_checked():
selected_formats.append('srt')
if format_checkbox_vrt.is_checked():
selected_formats.append('vrt')
if format_checkbox_txt.is_checked():
selected_formats.append('txt')

    overwrite = overwrite_checkbox.is_checked()
    autosave_transcription(original_file_path, transcript, selected_formats, overwrite)
  1. Testing and Validation:

Unit Tests:
Write unit tests to validate the file saving logic, including scenarios with existing files and different user preferences for overwriting.

Pseudocode

def test_generate_unique_filename():
# Add tests to validate unique file name generation

def test_check_and_save_file():
# Add tests to validate file saving logic with and without overwriting

def test_autosave_transcription():
# Add tests to validate the autosave functionality for different formats

  1. Documentation:

Update User Guide:
Include instructions and screenshots in the user guide to explain how to use the new autosave feature and configure the settings.

Multiple file support

Overview

Allow more than one file to be selected for transcription.

Add a checkbox to skip files that already have transcriptions.

Add the ability to have it monitor a list of folders periodically and transcribe everything in them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.