Giter VIP home page Giter VIP logo

whisper.el's Introduction

whisper.el

Speech-to-Text interface for Emacs using OpenAI’s whisper speech recognition model. For the inference engine it uses the awesome C/C++ port whisper.cpp that can run on consumer grade CPU (without requiring a high end GPU).

You can capture audio with your local input device (microphone) or choose a media file on disk in your local language, and have the transcribed text pasted to your Emacs buffer (optionally after translating to English). This runs offline without having to use non-free cloud service for decent result (though result quality of whisper varies widely depending on language, see below).

Install and Usage

Aside from a C++ compiler (to compile whisper.cpp), the system needs to have FFmpeg for recording audio.

You can install whisper.el by cloning this repo somewhere, and then use it like:

(use-package whisper
  :load-path "path/to/whisper.el"
  :bind ("C-H-r" . whisper-run)
  :config
  (setq whisper-install-directory "/tmp/"
        whisper-model "base"
        whisper-language "en"
        whisper-translate nil))

You will use these functions:

  • whisper-run: Toggle between recording from your microphone and transcribing
  • whisper-file: Same as before but transcribes a local file on disk

Invoking whisper-run with a prefix argument (C-u) has the same effect as whisper-file.

Both of these functions will automatically compile whisper.cpp dependency and download language model the first time they are run. When recording is in progress, invoking them stops it and starts transcribing. Otherwise if compilation, download (of model file) or transcription job is in progress, calling them again cancels that.

Variables

  • whisper-install-directory: Location where whisper.cpp will be installed. Default is ~/.emacs.d/.cache/.
  • whisper-language: Specify your spoken language. Default is en. For all possible short-codes: see here. You can also set it to auto to allow whisper.cpp to infer the language from first 30 seconds of audio. How well whisper works will vary depending on the language. Some scores could be found in the original paper, or here.
  • whisper-model: Which language model to use. Default is base. Values are: tiny, base, small, medium, large-v1, large. Bigger models are more accurate, but takes more time and more RAM to run (aside from more disk space and download size), see: resource requirements. Note that these come with .en variants that might be faster, but are for English only.
  • whisper-translate: Default nil means transcription output language is same as spoken language. Setting it to t translates it to English first.
  • whisper-use-threads: Default nil means let whisper.cpp choose appropriate value (which it sets with formula min(4, num_of_cores)). If you want to use more than 4 threads (as you have more than 4 cpu cores), set this number manually.
  • whisper-recording-timeout: Default is 300 seconds. We do not want to start recording and then forget. The intermediate temporary file is stored in uncompressed wav format (roughly 4.5mb per minute but can vary), they can grow and fill disk even if /tmp/ is used for it by default.
  • whisper-enable-speed-up: Default is nil. This can supposedly speed up transcribing up to 2x, at the expense of some accuracy loss. You should experiment if it works for you, specially when using larger models.

Additionally, depending on your input device and system you will need to modify these variables to get recording to work:

  • whisper--ffmpeg-input-format: This is what you would pass to the -f flag of FFmpeg to input, to record audio. Default is pulse on Linux, avfoundation on OSX and dshow on Windows.
  • whisper--ffmpeg-input-device: This is what you would pass to the -i flag of FFmpeg to record audio, like hw:0,2 or something. There is no default (unless you are using pulseaudio in that case it’s default) so this will likely need to be set.

Caveats

  • Whisper is open-source in the sense that weights and the engine source is available. But training data or methodology is not.
  • Real time transcribing is probably not feasible with it yet. The accuracy is better when it has a bigger window of surrounding context. Plus it would need beefy hardware to keep up, possibly using a smaller model. There is some interesting activity going on at whisper.cpp upstream, but in the end I don’t see the appeal of that in my workflow (yet).

whisper.el's People

Contributors

munen avatar natrys avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.