Giter VIP home page Giter VIP logo

Comments (15)

mkiol avatar mkiol commented on August 28, 2024 3

I've pushed 4.3.0 release to Fluthub. Updated package is even bigger than the previous one - sorry.

I'm still working on implementing "modular approach" but it is not yet ready.

To address the problem, I'm publishing "Tiny" Flatpak package as well. You can download it from the Releases page. This "Tiny" version is much smaller and contains only the basic features.

Comparison between "Fluthub" and "Tiny" Flatpak packages:

Feature Fluthub Tiny
Coqui/DeepSpeech STT ✔️ ✔️
Vosk STT ✔️ ✔️
Whisper STT ✔️ ✔️
Whisper STT GPU ✔️
Faster Whisper STT ✔️
April-ASR STT ✔️ ✔️
eSpeak TTS ✔️ ✔️
MBROLA TTS ✔️ ✔️
Piper TTS ✔️ ✔️
RHVoice TTS ✔️ ✔️
Coqui TTS ✔️
Mimic3 TTS ✔️
Punctuation restoration ✔️

from dsnote.

mkiol avatar mkiol commented on August 28, 2024 1

Exactly, flatpak package contains most (but not all) of the dependencies. This includes dozens of python libraries, CUDA and partial ROCm frameworks etc. Libs have to be shipped inside Speech Note package because flatpak sandbox blocks any use of system libraries.

In details it looks like this (the biggest ones):

lib size (MB) role
PyTorch 676 Coqui TTS, Restore Punctuation,
CUDA 606 GPU acceleration for Whisper (NVIDIA)
unidic_lite 248 TTS for Japanese
ROCm (only OpenCL part) 135 GPU acceleration for Whisper (AMD)
gruut_lang_es 128 TTS for Spanish
llvmlite 103 Coqui TTS
pypinyin 100 Coqui TTS
scipy 79 Coqui TTS
gruut_lang_de 71 TTS for German
pandas 60 Coqui TTS
transformers 56 Coqui TTS, Restore Punctuation
Perl 54 TTS for Korean
OpenBLAS 40 Whisper STT, Vosk STT, Translator

The most problematic are Python libraries. They have tendency to be ridiculously huge.

is it not better to move those to download mode just like language data?

Good idea but not so easy to implement. I really don't want to create separate "packages system" only for this app.

Maybe the solution is to distribute in not-sandboxed package format in which the app would use libraries installed in the system. Actually, here is a pull request adding AUR package. It works quite well I must say. Maybe next will be a package for Debian.

from dsnote.

mkiol avatar mkiol commented on August 28, 2024 1

@rezad1393 Thank you for sharing. There is very little documentation regarding this functionality.

from dsnote.

mkiol avatar mkiol commented on August 28, 2024 1

"Modular" version has been released 🎉

It looks like this (section from README):

Starting from v4.4.0, the app distributed via Flatpak (published on Flathub) consists of the following packages:

  • Base package "Speech Note" (net.mkiol.SpeechNote)
  • Add-on for AMD graphics card "Speech Note AMD" (net.mkiol.SpeechNote.Addon.amd)
  • Add-on for NVIDIA graphics card "Speech Note NVIDIA" (net.mkiol.SpeechNote.Addon.nvidia)

Comparison between Base, Tiny and Add-ons Flatpak packages:

Sizes Base Tiny AMD add-on NVIDIA add-on
Download size 0.9 GiB 70 MiB +2.1 GiB +3.8 GiB
Unpacked size 2.9 GiB 170 MiB +11.5 GiB +6.9 GiB
Features Base Tiny AMD add-on NVIDIA add-on
Coqui/DeepSpeech STT + +
Vosk STT + +
Whisper (whisper.cpp) STT + +
Whisper (whisper.cpp) STT AMD GPU - - +
Whisper (whisper.cpp) STT NVIDIA GPU - - +
Faster Whisper STT + -
Faster Whisper STT NVIDIA GPU - - +
April-ASR STT + +
eSpeak TTS + +
MBROLA TTS + +
Piper TTS + +
RHVoice TTS + +
Coqui TTS + -
Coqui TTS AMD GPU - - +
Coqui TTS NVIDIA GPU - - +
Mimic3 TTS + -
Punctuation restoration + -
Translator + +

from dsnote.

harkai84 avatar harkai84 commented on August 28, 2024

Cool. Thank you for this overview. Very interesting.

from dsnote.

rezad1393 avatar rezad1393 commented on August 28, 2024

right now the app is too big.

I dont have nvidia. and my gpu is internal (amd apu) so I don't think ROCm is even applicable to me.
I dont use japanese. nor spanish. nor german. so 1.2GB is not for me.

about intergration I am not a flatpak dev but I see in flatpak stuff like this when I try to install steam client.

app/com.valvesoftware.Steam/x86_64/stable
10) runtime/com.valvesoftware.Steam.Utility.MangoHud/x86_64/stable
11) runtime/com.valvesoftware.Steam.Utility.steamtinkerlaunch/x86_64/stable
12) app/com.steamgriddb.steam-rom-manager/x86_64/stable
13) runtime/com.valvesoftware.Steam.Utility.vkBasalt/x86_64/stable
14) runtime/com.valvesoftware.Steam.CompatibilityTool.Proton-Exp/x86_64/stable
15) runtime/com.valvesoftware.Steam.Utility.gamescope/x86_64/stable

as if steam is the main app and other are addons you can add.

so maybe you can separate those cuda and stuff to addons in flatpak and they would get integrated (or can be integrated) with main app like is user install them.
net.mkiol.SpeechNote
net.mkiol.SpeechNote.cuda
net.mkiol.SpeechNote.rocm
and so on.

another way to do it is use flatpak for main app but put cuda libs and others as separate downloaded hosted on github that user downloads inside the app (not system installed) just like lang model data is right now.

from dsnote.

mkiol avatar mkiol commented on August 28, 2024

as if steam is the main app and other are addons you can add.
so maybe you can separate those cuda and stuff to addons in flatpak and they would get integrated (or can be integrated) with main app like is user install them.
net.mkiol.SpeechNote
net.mkiol.SpeechNote.cuda
net.mkiol.SpeechNote.rocm

Thanks for pointing that out. I wasn't aware of this possibility. In flatpak vocabulary it is called "extensions". I will investigate what can be done.

from dsnote.

rezad1393 avatar rezad1393 commented on August 28, 2024

flatpak/flatpak-docs#18 (comment)
maybe that helps. though that is not a flatpak dev explanation but maybe helps you a bit.

when I search for extension and flatpak I got a lot of gnome.extensions ,which wasn't helpful.

this is way better
https://blog.tingping.se/2018/03/18/flatpaking-plugins.html

https://old.reddit.com/r/flatpak/comments/hoeenw/example_of_extension_packages_with_extradata/

from dsnote.

rezad1393 avatar rezad1393 commented on August 28, 2024

thank you.

I am not in a hurry and I am playing with speech to text part.
I don't think the two version is the the best model.

I can wait if you are gonna release a modular version.

btw, sometimes github will ban/censor you for hosting large files to add (like the addons) so please consider hosting them as addons on flathub.

this has the benefit of putting customized and working versions of addons (engines and data).

from dsnote.

rezad1393 avatar rezad1393 commented on August 28, 2024

thank you for the new release.
I don't see any amd but I have no dicreate amd card so it is not that important for me.

btw can you add the tiny version to flathub?

also what is "Punctuation restoration" exactly?

from dsnote.

mkiol avatar mkiol commented on August 28, 2024

btw can you add the tiny version to flathub?

Adding it as a new app to Flathub would be problematic. This "Tiny" package is de facto "Base" without everything that depends on Python. Actually, I'm thinking about creating addition add-on with all Python dependencies, so "Tiny" would become "Base". This might work 🤔 On the other hand, I don't want to make too many add-ons to not confuse users too much.

also what is "Punctuation restoration" exactly?

Yeah, the function of this is definitely not well explained in the app.

"Punctuation restoration" is an extra text processing after STT. This processing uses additional ML model (you need to download it in the model browser) to guess the right punctuation marks (,.?!:) in the text. This is only needed and only enabled for DeepSpeech, Vosk and few April models. Whisper and Faster Whisper natively support punctuation therefore this feature is not used for these models. I've added "Punctuation restoration" as an option (vs. always enabled) because currently configured ML model for restoring punctuation is quite slow and "memory hungry". If you are looking for speed, this feature should be disabled.

from dsnote.

rezad1393 avatar rezad1393 commented on August 28, 2024

thanks for the answer.
btw I am not familiar with all the ML tts model.

is there anyone of them that is better than other ones?
I think all of them are open source?

I use whisper because I saw that first and others that I tried where not good even with English.
I tried them on my own persian language and whisper was the least problematic one. (though none were good enough for persian)

from dsnote.

mkiol avatar mkiol commented on August 28, 2024

is there anyone of them that is better than other ones?
I think all of them are open source?

License is not the same for all models. In v4.4.0 you can check individual license in the model browser. Not all models have license information because I wasn't able to label all of them. Simply, there are too many models! In general, models can be divided into two groups: "free to use" and "free to use only for non commercial purposes". Non-commercial model should have correct license attached in the model browser. When license is missing, most likely model is "free to use".

Which TTS model is the best? As everything, it depends :) May favorite is Coqui XTTS. This is multilingual voice cloning model that produces very natural sounding speech. It is also quite slow, but GPU acceleration helps. If you are looking for something lightweight Piper models are the best. Unfortunately, Farsi doesn't have many TTS options. There is one Coqui model which is terrible and a bit better Mimic3.

from dsnote.

rezad1393 avatar rezad1393 commented on August 28, 2024

I incorrectly said tts.
I meant speech to text.

is there any model that gives the best results to english for speech to text?

thank you for the tts part of answer.

from dsnote.

mkiol avatar mkiol commented on August 28, 2024

I meant speech to text.

Right 😄

I won't tell you anything you don't know. The best STT models are from "Whisper" family. Usually I use "Whisper Large-v2" because it has outstanding accuracy and, with GPU acceleration enabled, it is perfectly usable in terms of speed. If you can't use GPU, I would go with "Distil-FasterWhisper Small".

from dsnote.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.