mkiol / dsnote Goto Github PK

Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.

License: Mozilla Public License 2.0

QML 9.50% C++ 88.25% CMake 2.01% Shell 0.23%

asr sailfishos stt tts flatpak-applications linux-desktop nmt offline translator machine-translation

dsnote's Introduction

Speech Note

Linux desktop and Sailfish OS app for note taking, reading and translating with offline Speech to Text, Text to Speech and Machine Translation

Description

Speech Note let you take, read and translate notes in multiple languages. It uses Speech to Text, Text to Speech and Machine Translation to do so. Text and voice processing take place entirely offline, locally on your computer, without using a network connection. Your privacy is always respected. No data is sent to the Internet.

Speech Note uses many different processing engines to do its job. Currently these are used:

Speech to Text (STT)
Text to Speech (TTS)
- espeak-ng
- MBROLA
- Piper
- RHVoice
- Coqui TTS
- Mimic 3
- WhisperSpeech
Machine Translation (MT)
- Bergamot Translator

Languages and Models

Following languages are supported:

Lang ID	Name	DeepSpeech (STT)	Whisper (STT)	Vosk (STT)	April-ASR (STT)	Piper (TTS)	RHVoice (TTS)	espeak (TTS)	MBROLA (TTS)	Coqui (TTS)	Mimic3 (TTS)	WhisperSpeech (TTS)	Bergamot (MT)
af	Afrikaans		●					●			●
am	Amharic	● (e)	●					●		●
ar	Arabic		●	●		●		●	●	●			●
bg	Bulgarian		●					●		●
bn	Bengali		●					●		●	●
bs	Bosnian		●					●
ca	Catalan	●	●	●		●		●		●			●
cs	Czech	●	●	●		●	●	●	●	●			●
cy	Welsh					●
da	Danish		●			●		●		●			●
de	German	●	●	●		●		●		●	●		●
el	Greek	● (e)	●			●		●		●	●		●
en	English	●	●	●	●	●	●	●		●	●	●	●
eo	Esperanto			●			●	●
es	Spanish	●	●	●		●		●		●	●		●
et	Estonian	● (e)	●					●	●	●			●
eu	Basque	● (e)	●					●		●
fa	Persian	●	●	●		●		●	●	●	●		●
fi	Finnish	●	●			●		●		●	●		●
fr	French	●	●	●	●	●		●		●	●		●
ga	Irish							●		●
gu	Gujarati		●					●			●
ha	Hausa		●								●
he	Hebrew		●							●
hi	Hindi		●	●				●
hr	Croatian		●					●	●	●
hu	Hungarian	● (e)	●			●		●	●	●	●		●
id	Indonesian	● (e)	●					●	●	●
is	Icelandic		●			●		●		●			●
it	Italian	●	●	●		●		●		●	●		●
ja	Japanese		●	●				●		●
jv	Javanese		●								●
ka	Georgian		●			●	●	●
kk	Kazakh		●	●		●		●		●
ko	Korean		●	●				●		●
ky	Kyrgyz						●	●
la	Latin							●		●
lb	Luxembourgish					●
lt	Lithuanian		●					●	●	●			●
lv	Latvian	●	●					●		●
mk	Macedonian		●				●	●
mn	Mongolian	● (e)	●							●
mr	Marathi		●							●
ms	Malay		●					●	●	●
mt	Maltese		●					●		●
ne	Nepali		●			●		●			●
nl	Dutch	● (e)	●	●		●		●		●	●		●
no	Norwegian		●			●		●					●
pl	Polish	●	●	●	●	●	●	●	●	●	●	●	●
pt	Portuguese	● (e)	●	●		●		●	●	●			●
ro	Romanian	● (e)	●			●		●	●	●
ru	Russian	●	●	●		●	●	●			●		●
sk	Slovak		●			●	●	●		●
sl	Slovenian	● (e)	●			●		●		●			●
sq	Albanian		●				●	●		●
sr	Serbian		●			●	●	●
sv	Swedish		●	●		●		●	●	●	●
sw	Swahili	●	●			●		●		●
te	Telugu		●					●			●
th	Thai	● (e)	●					●		●
tl	Tagalog		●	●						●
tn	Tswana		●					●			●
tr	Turkish	● (e)	●	●		●		●	●	●			●
tt	Tatar		●				●	●		●
uk	Ukrainian	●	●	●		●	●	●		●	●		●
uz	Uzbek		●	●				●		●
vi	Vietnamese		●	●		●		●		●
yo	Yoruba	● (e)	●							●	●
zh	Chinese	●	●	●		●		●		●

^{(e) experimental, most likely doesn't work well}

Faster Whisper, Coqui TTS and Mimic3 models are only available on x86-64.

Language models can be downloaded directly from the app.

Details of models which are currently configured for download are described in models.json (GitHub) or models.json (GitLab).

How to install

Linux Desktop: Flatpak
Arch Linux (AUR):
- dsnote
- dsnote-git
Sailfish OS: OpenRepos

Flatpak packages

Starting from v4.4.0, the app distributed via Flatpak (published on Flathub) consists of the following packages:

Base package "Speech Note" (net.mkiol.SpeechNote)
Add-on for AMD graphics card "Speech Note AMD" (net.mkiol.SpeechNote.Addon.amd)
Add-on for NVIDIA graphics card "Speech Note NVIDIA" (net.mkiol.SpeechNote.Addon.nvidia)

Base package includes all the dependencies needed to run every feature of the application. Add-ons add the capability of GPU acceleration, which speeds up some operations in the application.

Base package and add-ons contain many "heavy" libraries like CUDA, ROCm, Torch and Python libraries. Due to this, the size of the packages and the space required after installation are significant. If you don't need all the functionalities, you can use much smaller "Tiny" package (available on Releases page), which provides only the basic features. If you need, you can also use "Tiny" packages together with GPU acceleration add-on.

Comparison between Base, Tiny and Add-ons Flatpak packages:

Sizes	Base	Tiny	AMD add-on	NVIDIA add-on
Download size	0.9 GiB	70 MiB	+2.1 GiB	+3.8 GiB
Unpacked size	2.9 GiB	170 MiB	+11.5 GiB	+6.9 GiB

Features	Base	Tiny	AMD add-on	NVIDIA add-on
Coqui/DeepSpeech STT	+	+
Vosk STT	+	+
Whisper (whisper.cpp) STT	+	+
Whisper (whisper.cpp) STT AMD GPU	-	-	+
Whisper (whisper.cpp) STT NVIDIA GPU	-	-		+
Faster Whisper STT	+	-
Faster Whisper STT NVIDIA GPU	-	-		+
April-ASR STT	+	+
eSpeak TTS	+	+
MBROLA TTS	+	+
Piper TTS	+	+
RHVoice TTS	+	+
Coqui TTS	+	-
Coqui TTS AMD GPU	-	-	+
Coqui TTS NVIDIA GPU	-	-		+
Mimic3 TTS	+	-
WhisperSpeech TTS	+	-
WhisperSpeech TTS AMD GPU	-	-	+
WhisperSpeech TTS NVIDIA GPU	-	-		+
Punctuation restoration	+	-
Translator	+	+

Beta version

In addition to the stable version in the Flathub repository, you can try to test the "Beta" version of the upcoming release. This version is usable, but may contain more bugs.

Beta version is available in "flathub-beta" repository. Follow these instructions to enable flathub-beta on your computer.

Building from sources

Arch Linux

It is also possible to build and install the latest development (git) or latest stable (release) version from the repository using the provided PKGBUILD file (please note that the same remarks about building on Linux apply):

git clone <git repository url>

cd dsnote/arch/git      # build latest git version
# or
cd dsnote/arch/release  # build latest release version

makepkg -si

Flatpak

git clone <git repository url>

cd dsnote/flatpak

flatpak-builder --user --install-deps-from=flathub --repo="/path/to/local/flatpak/repo" "/path/to/output/dir" net.mkiol.SpeechNote.yaml

Sailfish OS

git clone <git repository url>

cd dsnote
mkdir build
cd build

sfdk config --session specfile=../sfos/harbour-dsnote.spec
sfdk config --session target=SailfishOS-4.4.0.58-aarch64
sfdk cmake ../ -DCMAKE_BUILD_TYPE=Release -DWITH_SFOS=ON -DWITH_PY=OFF
sfdk package

Linux (direct build)

Speech Note has many build-time and run-time dependencies. This includes shared and static libraries, 3rd-party executables, Python and Perl scripts. Because of these complexity, the recommended way to build is to use Flatpak tool-chain (Flatpak manifest file and flatpak-builder). If you want to make a direct build (i.e. without flatpak) it is also possible but more complicated.

git clone <git repository url>

cd dsnote
mkdir build
cd build

cmake ../ -DCMAKE_BUILD_TYPE=Release -DWITH_DESKTOP=ON
make

To make build without support for Python components, add -DWITH_PY=OFF in cmake step.

To see other build options search for option(BUILD_XXX) in CMakeList.txt file.

How to enable a custom model

All models available for download are specified in the configuration file (config/models.json). To enable a custom model that is compatible with currently supported engines, simply edit this file and restart the application.

When you first run the application, the models configuration file is created in:

~/.local/share/net.mkiol/dsnote/models.json, or
~/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote/models.json (Flatpak), or
~/.local/share/org.mkiol/dsnote/models.json (Sailfish OS)

You can freely edit currently enabled models or add new ones.

Model definition looks like this:

{
    "name": "<model name>",
    "model_id": "<model unique id>",
    "engine": "<engine type>",
    "lang_id": "<lang id>",
    "checksum": "<md5 checksum>",
    "checksum_quick": "<partial md5 checksum>",
    "comp": "<compression type",
    "urls": [
        <model URLs>
    ],
    "size": "<download size of all files>"
}

Allowed engine types: stt_ds, stt_vosk, stt_april, stt_whisper, stt_fasterwhisper, tts_piper, tts_rhvoice, tts_espeak, tts_coqui, tts_mimic3, mnt_bergamot

Allowed compression types: none, gz, xz, tarxz, targz, zip, zipall, dir, dirgz

Allowed URL types: http, https, file

Checksums are calculated for all files after unpacking. If you are adding a new model, you can use the --gen-checksums command line option to find the right checksums. To do this, put empty strings in both checksum and checksum_quick, save the file and run Speech Note with the mentioned option.

For example:

{
    "name": "New Piper Voice",
    "model_id": "en_piper_new",
    "engine": "tts_piper",
    "lang_id": "en",
    "checksum": "",
    "checksum_quick": "",
    "size": ""
    "comp": "dir",
    "urls": [
        "file:///home/me/models/new-model-medium.onnx",
        "file:///home/me/models/new-model-medium.onnx.json"
    ]
}

flatpak run net.mkiol.SpeechNote --verbose --gen-checksums

Contributing to Speech Note

Any contribution is very welcome!

Project is hosted both on GitHub and GitLab. Feel free to make a PR/MR, report an issue or reqest for new feature on the platform you prefer the most.

Translation

Translation files in Qt format are in translations directory.

Preferred way to contribute translation is via Transifex service, but if you would like to make a direct PR/MR, please do it.

How to support

If you find Speech Note useful and would like to support this project, please consider doing one or two of the following:

Give a ⭐ on GitHub or/and GitLab.
Write a review in your applications manager app (Discover, Software or any other).
Tell others about this app by mentioning it on social media.
If you have spare money, make a small donation via Liberapay.

Libraries

Speech Note relies on following open source projects:

Reviews and demos

Screenshots (Speech Note 4.5)
Speech Note 4.5 changes video (Speech Note 4.5)
Marco's Box (Speech Note 4.4, Italian)
Marco's Box video (Speech Note 4.4, Italian)
alternativalinux (Speech Note 4.4, Italian)
alternativalinux video (Speech Note 4.4, Italian)
ZDNET (Speech Note 4.2)
Translator feature video demo on Sailfish OS (Speech Note 4.0)
Translator feature video demo on PinePhone (Speech Note 4.0)
DebugPoint.com (Speech Note 4.0)
DebugPoint.com video (Speech Note 4.0)
OMG! Linux (Speech Note 4.0)
LinuxLinks (Speech Note 4.0)
The Linux Cast video (Speech Note 4.0)
CONNECTwww.com (Speech Note 4.0)

License

Speech Note is an open source project. Source code is released under the Mozilla Public License Version 2.0.

3rd party libraries:

Coqui STT, released under the Mozilla Public License Version 2.0
Coqui TTS, released under the Mozilla Public License Version 2.0
Vosk API, released uder the Apache License 2.0
whisper.cpp, released under the MIT License
WebRTC, released under this license
libarchive, released under the BSD License
RNNoise-nu, released under the BSD 3-Clause License
{fmt}, released uder this license
Hugging Face Transformers, released under the Apache License 2.0
Piper, released under the MIT License
RHVoice, released under the GNU General Public License v2.0
ssplit-cpp, released under the Apache License 2.0
espeak-ng, released under the GNU General Public License v3.0
bergamot-translator, released under the Mozilla Public License 2.0
Rubber Band Library, released under the GNU General Public License (version 2 or later)
simdjson, released under the Apache License 2.0
Nlohmann JSON, released under the MIT License
uroman, released under this license
astrunc, released under the MIT License
FFmpeg, released under the GNU Lesser General Public License version 2.1 or later
LAME, released under the LGPL
Vorbis, released under this license
TagLib, released under the GNU Lesser General Public License (LGPL) and Mozilla Public License (MPL)
libnumbertext, released under the BSD License
KDBusAddons, released under the LGPL licenses
QHotkey, released under the BSD-3-Clause License
faster-whisper, released under the MIT License
Mimic 3, released under the AGPL-3.0 license
Unikud, released under the MIT License
april-asr, released under the GNU General Public License v3.0
libopus, released under this license
html2md, released under the MIT License
maddy, released under the MIT License
WhisperSpeech, released under the MIT License

The files in the directory nonbreaking_prefixes were copied from mosesdecoder project and distributed under the GNU Lesser General Public License v2.1.

dsnote's People

Contributors

Stargazers

Watchers

Forkers

popanz black-sheep-dev lafricain albanobattistella francewhoa pushbike zishan-rahman devsjr flimm devenbl alivededsec sean666888 maxugly yssource chrisshaw atbrakhi hopelessdecoy wabashave

dsnote's Issues

Save to audio file seemingly not working for large texts

I must first say that this project is amazing, really a game changer for me since I don't need to fiddle with conda environments in terminals to get different models working.

I am right now trying to transcribe a book with about 700 pages, since there is no audio book version, and especially the Piper Joe Medium model sounded amazing.

But it just doesn't save. It does though if I cut it in smaller chunks. I tried wav and Opus, thinking compression might have broke it, but nothing seems to make it save. It outputs a initialization error. "Error: text to speech initialization engine has failed"

Also, it refuses to initialize TTS again afterwards, and the app needs a restart.

I am on a fedora linux 38 system. I'm using the latest version of Speech Note.

Here are outputs from terminal upon trying to save the Wav file:

Same colorful text all over til the very end.

Interestingly Vorbis had the same pattern, but something different at the very end:

TTS RHVoice and Coqui (and maybe others) fails to run speech engine for texts with new lines.

Flatpak 4.1.0
For now i tested only following engines:

Coqui (MMS, Mai VITS)
Espeak (MBROLA, Robot)
Piper
RHVoice

Espeak and Piper works for every text so far. Coqui and RHVoice can't read text if there's at least one new line.

Cause is probably that for newline it's creating empty task.

[D] 20:14:48.26 0x7fc8df77ed80 encode_speech:174 - task: SENTENCE_BEFORE_NEW_LINE
[D] 20:14:48.26 0x7fc8df77ed80 encode_speech:174 - task: 
[D] 20:14:48.26 0x7fc8df77ed80 encode_speech:174 - task: SENTENCE_AFTER_NEW_LINE

[E] 20:14:59.438 0x7fc8d09ff600 operator():260 - py error: ValueError: You need to define either `text` (for sythesis) or a `reference_wav` (for voice conversion) to use the Coqui TTS API.

Spellcheck

Speech note is an excellent software that can solve a lot of my tasks. A small improvement proposal on my part would be the implementation of a spell check (e.g., Hunspell, Aspell) in the notepad. This would be very useful, for example, if you want to have text translated and make sure that there are no unnecessary errors before translation due to small typos. Probably the best solution would be an integration of grammar checks via LanguageTool (remote API or local server).

GPU not working

Selecting GPU to transcribe an audio file is causing a crash

QIBusPlatformInputContext: invalid portal bus.
QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.qgnomeplatform: Could not find color scheme  ""
whisper_init_from_file_no_state: loading model from '/home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/en_whisper_small.ggml'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 9
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 3
whisper_model_load: mem required  =  459.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  180.95 MB
ggml_opencl: selecting platform: 'Clover'
ggml_opencl: selecting device: 'AMD Radeon RX 6800M (navi22, LLVM 15.0.7, DRM 3.54, 6.5.5-1-linux)'
ggml_opencl: device FP16 support: false
ggml_opencl: kernel compile error:

fatal error: cannot open file '/usr/lib/x86_64-linux-gnu/GL/default/share/clc/gfx1031-amdgcn-mesa-mesa3d.bc': No such file or directory

'--action' and dbus issue

Thanks for this app!

I found the following issues while exploring the automation tools provided via the beta flatpak.

First, invoking any of the reading actions (start-reading, start-reading-clipboard, or pause-resume-reading) through --action cmdline option will not work, the program just prints:

Invalid action. Use one option from the following: start-listening, start-listening-active-window, start-listening-clipboard, stop-listening, start-reading, start-reading-clipboard, pause-resume-reading, cancel.

Second, I didn't have any problem using the dbus org.freedesktop.Application interface, calling ActivateAction works perfectly fine. But I could not find what is defined in dbus/org.mkiol.Speech.xml on the dbus session, it seems that powerful interface isn't exposed at all. Is this normal?

Add support of FasterWhisper

Hello,

I really appreciate your project! I think it's going in a very nice and useful direction!

I note that you support the Coqui STT, Vosk and whisper.cpp engines.
Would it be possible to add guillaumekln's fasterwhisper STT engine? (Here)

FasterWhisper has the advantage of being incredibly faster than whisper.cpp, while consuming relatively little extra RAM (the differences are shown in a table on its github).
So I think it would be a great idea! The models have, if I've understood correctly, been modified but are available on HuggingFace (again, everything is very well indicated on its github).

Thanks in advance! Good luck with the rest of the project ;)

Breizhux

why flatpak app is so big?

does it have whisper and other apps included?

is it not better to move those to download mode just like language data?

Speech Note crashes on start

I'm on OpenSUSE Tumbleweed and I'm using the Flatpak version of Speech Note.

$ flatpak run net.mkiol.SpeechNote 
Qt: Session management error: Could not open network socket
ALSA lib ../../oss/pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
free(): invalid size

Any way to fix pronunciation of words?

I find this is a brilliant app for Linux. But, is there some way to fix how certain names or words are pronounced?

Adding LLM(Large Language Model) support using Ollama

Also, we could add Large language models to the application. Starting with smaller models. And adding bigger ones over time.
This could be really helpful, as LLM's that are open source that replace ChatGPT, are getting more powerful every day.
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

What are your thoughts on this?
Is this feasible?

Crash (Illegal instruction) with DeepSpeech model

Original issue #8

backtrace:

Thread 1 "dsnote" received signal SIGILL, Illegal instruction.
0x00007fffd02795a7 in ?? () from /app/lib/libkenlm.so

cpu flags:

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm pti dtherm

service isn't restarted after switching storage directory

In Speech Notes' settings, when changing the directory where the Deep Speech models are stored, the harbour-dsnote.service isn't restarted and keep looking at the old (wrong) path.

Context:

I've been using Speech Notes on my Xperia XA2 (32GB edition, not that much storage available for /home)
I'm storing the models on the external SD card
I've recently switch to an Xperia 10iii, and installed Speech Notes on it too
I've moved the SD card to the new device
I've changed the Location of language files to point to the path on the SD Card.
Speech Notes now sees the already installed language models.
Speech Notes can even download new models.
BUT the settings doesn't allow me to select a model, only to download new
and the main pannel complains that no model has been configured.

Current work-around:

as the user (E.g.: nemo or defaultUser)

systemctl --user restart harbour-dsnote.service

Request:

Could it be possible for the app to trigger the service restart?
Or could you change the API so clients such as Speech Notes could send a "please restart" command?

[TTS] Add Mimic 3 models

I would like to see the Mimic 3 models in this app

A link to the GitHub is HERE.

It does a better job than Piper in my opinion and sounds more real.

P.S Awesome Project, keep up the good work.

Random jumps in language downloading menu.

Flatpak 4.1.0

After clicking download, scroll is jumping to top of list (most often if while clicked it was scrolled down), then when i will click download to some model at top, scroll may jump down.

Nothing is showing on output when runned with dsnote --verbose command, except following, but these occur only when opening languages menu.

[W] 20:33:49.225 0x7fe9ee845d80 () -   OpenType support missing for "Unifont", script 12
[W] 20:33:49.312 0x7fe9d1066600 () -   OpenType support missing for "Unifont", script 12
[W] 20:33:49.371 0x7fe9ee845d80 () -   OpenType support missing for "Biwidth", script 11
[W] 20:33:49.380 0x7fe9ee845d80 () -   OpenType support missing for "Fixed", script 11
[W] 20:33:49.398 0x7fe9d1066600 () -   OpenType support missing for "Biwidth", script 11
[W] 20:33:49.407 0x7fe9d1066600 () -   OpenType support missing for "Fixed", script 11

you should add edge models

https://github.com/rany2/edge-tts

Not sanboxed package format (AUR, deb, rpm)

Flatpak is a great package format but has few limitations. The major ones are as follows:

UI theme is not synced with the OS
- especially this affects Dark theme under GNOME
- even on KDE Plasma, app does not use native theme
GPU computation acceleration does not work out-of-the-box
- Flatpak runtime lacks CUDA and ROCm runtimes (all dependencies have to be shipped with the package)
- ROCm requires extra elevated permission --device=all to start working
Package size is huge

Not-sandboxed package formats for consideration:

distribution via AUR (probably the easiest option)
deb (Debian and all derivatives)
rpm (Fedora, OpenSUSE)

Missing support for certain file formats

In the latest version, it is not possible to select video files to transcribe the audio. Additionally, these formats are not supported:

ogg opus (gets stuck at 100%, no transcribed text in output) example file
mp3

Transcribe from URL

Whishper allows transcribing from any URL supported by yt-dlp, it would be very nice to have this feature available for this desktop app.

Pause function :)

Hello,

thank you for this amazing program! It would be nice, if you can add a pause button for TTS.
Have a nice day.

[idea] Translate option for non-english whisper models

As whisper is now supported (great stuff, thank You) it would be really cool if one could tick a box maybe and use the ability of whisper to translate to english, would be really handy when going abroad to be able just record people speaking local language and get instant translation

Add open dyslexic font

Hi,
It might be very useful to add open dyslexic fonts for some people who need it.
Also, import PDF files for transformation into audio files.
Thanks.
A.

DeepSpeech inference with libstt 1.4 on ARM is much slower comparing to libstt 1.1

Measurements for similar voice sample:

platform	libstt 1.1	libstt 1.4
x86_64 (AMD Ryzen 7 3700X 8-Core)	2490 ms	2550 ms
aarch64 (Xperia 10 III)	3800 ms	4600 ms
arm32 (Xperia 10)	10700 ms	21400 ms

Text not visible in dark mode

https://gitlab.com/mkiol/dsnote/-/issues/1

Deteting eSpeak model deletes all other models as well

https://gitlab.com/mkiol/dsnote/-/issues/2

How to get GPU acceleration working? (Debian 12.2, Gnome, Wayland, X11, Nvidia P1000 GPU, Zbook Studio G5)

Hello,

I got the "Speech Note" Flatpak working on my Debian 12 system (Zbook Studio G5). I can use Whisper in offline mode here. After downloading Whisper (large and/or medium), the speech recognition is quite good, but very slow (50 sec.). GPU acceleration would help, so I installed the Nivida drivers for my P1000. They work just fine with games, eg., but not with "Speech Note" and Whisper. Any ideas how to fix this? How do I get my Nvidia card to accelerate the speech recognition of whisper on Debian 12? Maybe this is a bug?

My Nvidia Driver Version: 525.125.06

I have already libcudart11.0 and nvidia-cuda-toolkit installed.

I tried both Wayland and X11.

My Card, the P1000, seems to supports CUDA 6.1 - this should be enough?

Terminal output, when starting Speech Note:

flatpak run net.mkiol.SpeechNote 
QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.qgnomeplatform: Could not find color scheme  ""
ALSA lib ../../oss/pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)



Some screens: 
![Bildschirmfoto vom 2023-10-16 19-43-06](https://github.com/mkiol/dsnote/assets/148144728/9fd7c5af-15b6-405c-bb53-d69e603fda99)
![Bildschirmfoto vom 2023-10-16 19-42-36](https://github.com/mkiol/dsnote/assets/148144728/a79775e9-9f99-43bf-b80c-36a9ed15a3a4)

Support GNOME Wayland for dsnote

Summary

The challenge is that as of August 23, 2023, dsnote does not support GNOME Wayland. This is a challenge because, by default, most of the recent versions of Linux distributions (distros) now use GNOME Wayland by default. Not GNOME X11. Linux's distributions, such as, but not limited to, Debian, Fedora, Manjaro, Red Hat Enterprise Linux, Ubuntu, etc.

The suggested resolution is to configure your Flatpak package appropriately so that it supports Wayland. The end result is that both GNOME Wayland and X11 are supported. If you're interested in this, this documentation about Flatpak sandbox might be useful. If somehow this documentation is not available, this archived page might be of interest. Alternatively, Flatpak support for maintainers is available here.

Below is the same as above. But with details if you're interested in those.

Steps to reproduce

Using Linux Debian 10 (Buster) 64 bits, using GNOME 3.30.2, using GNOME with its option Wayland, using the steps from https://flathub.org install the Flatpak for dsnote 4.1.0
dsnote will fail to start, and will not open. In other words, it is not usable. This is the challenge. When starting dsnote from Terminal/Console, this error message is display:

QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.wayland: Creating a fake screen in order for Qt not to crash
qt.qpa.qgnomeplatform: Could not find color scheme  ""
Complété

The needed end result is that dsnote is able to start with GNOME Wayland.
Close the present GNOME Wayland session
Using the GNOME user log-in front page, click on the cogwheel button on the right side of the log-in field. Using this button drop down menu, temporarily change the GNOME option from Wayland to the unsecured X11.
Using GNOME X11, start dsnote
It will open successfully. I don't know why dsnote does open in X11 but not in Wayland. My guess is that, somehow, dsnote does not yet support GNOME Wayland.
Log out GNOME. If appropriate, switch back to Wayland.

Flatpak page

https://flathub.org/fr/apps/net.mkiol.SpeechNote

Contribute

If needed, both me and the Ubertus.org team would be happy to contribute beta testing and documentation for this improvement or new feature. Any volunteer for a patch?

CUDA not recognized

I'm not sure why, but does seem to be related to flatpak

On system:
NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2
NVIDIA GeForce 940M (2GB VRAM) (Should be enough to run the small whisper)

On flatpak:

nvidia-535-104-05 org.freedesktop.Platform.GL.nvidia-535-104-05 1.4 user
nvidia-535-113-01 org.freedesktop.Platform.GL.nvidia-535-113-01 1.4 user
nvidia-535-98 org.freedesktop.Platform.GL.nvidia-535-98 1.4 user
nvidia-535-104-05 org.freedesktop.Platform.GL32.nvidia-535-104-05 1.4 user
nvidia-535-113-01 org.freedesktop.Platform.GL32.nvidia-535-113-01 1.4 user
nvidia-535-98 org.freedesktop.Platform.GL32.nvidia-535-98 1.4 user

Logs

[D] 14:13:45.593 0x7f5d825ff600 process_buff:226 - vad: no speech
[D] 14:13:45.593 0x7f5d825ff600 set_processing_state:430 - processing state: idle => decoding
[D] 14:13:45.593 0x7f5d825ff600 set_speech_detection_status:508 - speech detection status: speech-detected => decoding (no-speech)
[D] 14:13:45.593 0x7f5d825ff600 () - service refresh status, new state: listening-single-sentence
[D] 14:13:45.593 0x7f5d825ff600 () - task state changed: 1 => 2
[D] 14:13:45.593 0x7f5d825ff600 process_buff:284 - speech frame: samples=51360
[D] 14:13:45.593 0x7f5d825ff600 decode_speech:350 - speech decoding started
[D] 14:13:45.597 0x7f5de77bbd80 () - app task state: speech-detected => processing
CUDA error 209 at /run/build/whispercpp-cublas/ggml-cuda.cu:6102: no kernel image is available for execution on the device
[W] 14:13:46.168 0x7f5d825ff600 () - QObject::killTimer: Timers cannot be stopped from another thread
[W] 14:13:46.169 0x7f5d825ff600 () - QObject::~QObject: Timers cannot be stopped from another thread
[D] 14:13:46.178 0x7f5d825ff600 () - speech service dtor
[W] 14:13:46.179 0x7f5d825ff600 () - QtDBus: cannot relay signals from parent speech_service(0x5647aeab6ea0 "") unless they are emitted in the object's thread QThread(0x5647af143ed0 ""). Current thread is QThread(0x7f5d5c0016e0 "").
[D] 14:13:46.179 0x7f5d825ff600 () - mic source dtor
[W] 14:13:46.179 0x7f5d825ff600 () - QObject::killTimer: Timers cannot be stopped from another thread

Should OpenCL work on an Ice Lake v11 intel processor?

On stable and beta versions, it is saying that a suitable GPU isn't available. I've installed OpenCL packages on Fedora 38 and the equivalent flatpak OpenCL packages but it still says not available.

I get that it might be the case that it wouldn't be useful given it isn't a powerful discrete gpu, but wondered if it might be a bug causing it to report as unavailable.

Drag and drop support

Is drag and drop support for .mp3 files a possibility? Having to choose File:Transcribe a file:selecting a directory and changing from audio to all files for .mp3 to show up is tedious. A bonus would be for the name of the audio file to auto populate the text save dialog box. Maybe it could be fixed with Flatseal but I am not sure how.

Side note:
Using the whisper model gives great results. I can confirm that enabling GPU support in the settings does work as I see the GPU memory and usage spike when the transcription is occurring using Mint 21.2 and an Nvidia RTX 3050.
Would love to make a monetary contribution but I am unable to find a link unless I overlooked it.

Flathub description has wrong verb tense and no article before network connection.

Speech Note enables you to take and read notes with your voice with multiple languages. It uses Speech to Text and Text to Speech conversions to do so. All voice processing is entirely done off-line, locally on your computer without the use of network connection. Your privacy is always respected. No data is send to the Internet.

It should be

a network connection
sent

Support for aprilasr

Hello,

I really appreciate your project! I think it's going in a very nice and useful direction!

LiveCaptions uses aprilasr, which is very fast and only needs the CPU.

I think it would be great if you could add aprilasr as one of the speech recognition options in your project.

It would add a lot of value to your project by offering a fast and lightweight option for user's who don't have access to GPUs or want to conserve battery life on mobile devices.

Thanks in advance! Good luck with the rest of the project ;)

Command line arguments for automation

I would love it if there was a way to create audio files via commandline for a bit more automation

Use the same font size in speech panel as in text editor

https://gitlab.com/mkiol/dsnote/-/issues/18

Add option for bigger whisper models

Looking through whisper.cpp it needs 3x less memory than the original, which would make it possible to run even large model on xperia 10 III (3.3GB vs 10GB), this would probably be overkill, especially speed would suffer a lot, but adding small and medium would probably make sense

Add reading speed and esport audio in other format.

HI,
thanks for this awesome APP.
Very useful for students and teachers and for students with some problems.
I would suggest entering various reading speeds when text is read.
Also, the ability to export audio to other formats like MP3,ogg,etc.
Thank you.
V/R,
A.

Transcribe a file does not work with mounted Google Drive on Gnome

How it looks when it hangs.

To reproduce:

I launch Speech Note
I go to Files
I go to my mounted drive
- (This drive is integrated as a Gnome Settings/Online Account)
I select a file
It hangs forever

If I first move the file to Downloads and then select it, it will start transcribing.

Context

Device

Startup logs

Sorry for how long this is, I don't really know what's useful here...

[chrisshaw@chris-fedora ~]$ flatpak run net.mkiol.SpeechNote --verbose
QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.qgnomeplatform: Could not find color scheme  ""
[I] 13:28:20.174 0x7f658be10d80 init:49 - logging to stderr enabled
[D] 13:28:20.174 0x7f658be10d80 () - translation: "en_US"
[W] 13:28:20.174 0x7f658be10d80 () - failed to install translation
[D] 13:28:20.174 0x7f658be10d80 () - starting standalone app
[D] 13:28:20.175 0x7f658be10d80 () - app: net.mkiol dsnote
[D] 13:28:20.175 0x7f658be10d80 () - config location: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/config"
[D] 13:28:20.175 0x7f658be10d80 () - data location: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote"
[D] 13:28:20.175 0x7f658be10d80 () - cache location: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote"
[D] 13:28:20.175 0x7f658be10d80 () - settings file: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/config/net.mkiol/dsnote/settings.conf"
[D] 13:28:20.176 0x7f658be10d80 () - available styles: ("Default", "Fusion", "Imagine", "Material", "org.kde.breeze", "org.kde.desktop", "Plasma", "Universal")
[D] 13:28:20.176 0x7f658be10d80 () - style paths: ("/usr/lib/qml/QtQuick/Controls.2")
[D] 13:28:20.176 0x7f658be10d80 () - switching to style: "org.kde.desktop"
[D] 13:28:20.343 0x7f658be10d80 () - supported audio input devices:
ALSA lib ../../oss/pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
[D] 13:28:20.359 0x7f658be10d80 () - "pulse"
[D] 13:28:20.427 0x7f658be10d80 () - "upmix"
[D] 13:28:20.588 0x7f658be10d80 () - "default"
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
[D] 13:28:20.598 0x7f658be10d80 () - "alsa_input.usb-046d_HD_Pro_Webcam_C920_2AE889FF-02.analog-stereo"
[D] 13:28:20.598 0x7f658be10d80 () - "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"
[D] 13:28:20.598 0x7f658be10d80 () - "alsa_input.pci-0000_00_1f.3.analog-stereo"
[D] 13:28:20.598 0x7f658be10d80 add_cuda_devices:226 - scanning for cuda devices
[D] 13:28:20.601 0x7f658be10d80 add_cuda_devices:235 - cuda version: driver=0, runtime=12020
[D] 13:28:20.601 0x7f658be10d80 add_cuda_devices:240 - cudaGetDeviceCount returned: 35
[D] 13:28:20.601 0x7f658be10d80 add_hip_devices:263 - scanning for hip devices
[D] 13:28:20.601 0x7f658be10d80 hip_api:170 - failed to open hip lib: libamdhip64.so: cannot open shared object file: No such file or directory
[D] 13:28:20.601 0x7f658be10d80 add_opencl_devices:300 - scanning for opencl devices
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:317 - opencl number of platforms: 2
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:342 - opencl platform: 0, name=Clover, vendor=Mesa
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:356 - opencl number of devices: 0
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:342 - opencl platform: 1, name=AMD Accelerated Parallel Processing, vendor=Advanced Micro Devices, Inc.
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:356 - opencl number of devices: 0
[D] 13:28:20.815 0x7f6563fff600 loop:58 - py executor loop started
[D] 13:28:20.851 0x7f658be10d80 () - starting service: app-standalone
[D] 13:28:20.858 0x7f65621fe600 () - config version: 34 34
[D] 13:28:20.860 0x7f65621fe600 () - checksum ok: "6571cb18" "en_whisper_base.ggml"
[D] 13:28:20.860 0x7f65621fe600 () - found model: "en_whisper_base"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "am_espeak_am"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "ar_espeak_ar"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "bg_espeak_bg"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "bs_espeak_bs"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "ca_espeak_ca"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "cs_espeak_cs"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "da_espeak_da"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "de_espeak_de"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "el_espeak_el"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "en_espeak_en"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "eo_espeak_eo"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "es_espeak_es"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "et_espeak_et"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "eu_espeak_eu"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "is_espeak_is"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "fa_espeak_fa"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "fi_espeak_fi"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "fr_espeak_fr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "hi_espeak_hi"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "hr_espeak_hr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "hu_espeak_hu"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "id_espeak_id"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "it_espeak_it"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ja_espeak_ja"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "kk_espeak_kk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ko_espeak_ko"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "lv_espeak_lv"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "lt_espeak_lt"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "mk_espeak_mk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ms_espeak_ms"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ne_espeak_ne"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "nl_espeak_nl"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "no_espeak_no"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "pt_espeak_pt"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "pt_espeak_pt_br"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ro_espeak_ro"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ru_espeak_ru"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sk_espeak_sk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sl_espeak_sl"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sr_espeak_sr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sv_espeak_sv"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sw_espeak_sw"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "th_espeak_th"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "tr_espeak_tr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "uk_espeak_uk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ka_espeak_ka"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ky_espeak_ky"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "la_espeak_la"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "tt_espeak_tt"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sq_espeak_sq"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "uz_espeak_uz"
[D] 13:28:20.864 0x7f658be10d80 () - module already unpacked: "rhvoicedata"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "vi_espeak_vi"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "zh_espeak_yue"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "zh_espeak_hak"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "zh_espeak_cmn"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "ga_espeak_ga"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "mt_espeak_mt"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "bn_espeak_bn"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "pl_espeak_pl"
[D] 13:28:20.865 0x7f658be10d80 () - module already unpacked: "rhvoiceconfig"
[D] 13:28:20.868 0x7f65621fe600 () - models changed
[D] 13:28:20.876 0x7f658be10d80 () - module already unpacked: "espeakdata"
[D] 13:28:20.877 0x7f658be10d80 () - default tts model not found: "en"
[D] 13:28:20.877 0x7f658be10d80 () - default mnt lang not found: "en"
[D] 13:28:20.877 0x7f658be10d80 () - new default mnt lang: "en"
[D] 13:28:20.877 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:28:20.877 0x7f658be10d80 () - service state changed: unknown => idle
[D] 13:28:21.115 0x7f658be10d80 () - starting app: app-standalone
[D] 13:28:21.115 0x7f658be10d80 () - app service state: unknown => idle
[D] 13:28:21.115 0x7f658be10d80 () - app stt available models: 0 => 1
[D] 13:28:21.115 0x7f658be10d80 () - update listen
[D] 13:28:21.115 0x7f658be10d80 () - app active stt model: "" => "en_whisper_base"
[D] 13:28:21.115 0x7f658be10d80 () - update listen
[W] 13:28:21.116 0x7f658be10d80 () - no available mnt langs
[W] 13:28:21.116 0x7f658be10d80 () - no available mnt out langs
[W] 13:28:21.116 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:28:21.116 0x7f658be10d80 () - no available tts models for out mnt
[W] 13:28:21.116 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:28:21.116 0x7f658be10d80 () - app stt configured: false => true
logger error: invalid format string
qrc:/qml/main.qml:165:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
logger error: invalid format string
qrc:/qml/main.qml:156:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
logger error: invalid format string
qrc:/qml/Notepad.qml:24:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
logger error: invalid format string
qrc:/qml/Translator.qml:29:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
[D] 13:28:21.309 0x7f658be10d80 onCompleted:85 - default font pixel size: 14
[D] 13:28:21.328 0x7f658be10d80 () - default tts model not found: "en"
[D] 13:28:21.328 0x7f658be10d80 () - default mnt lang not found: "en"
[D] 13:28:21.328 0x7f658be10d80 () - new default mnt lang: "en"
[D] 13:28:21.328 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:28:21.328 0x7f658be10d80 () - service refresh status, new state: idle
[W] 13:28:21.380 0x7f658be10d80 ():164 - qrc:/qml/Translator.qml:164:9: QML ColumnLayout (parent or ancestor of QQuickLayoutAttached): Binding loop detected for property "preferredWidth"
[D] 13:28:21.524 0x7f658be10d80 () - stt models changed
[D] 13:28:21.525 0x7f658be10d80 () - update listen
[D] 13:28:21.525 0x7f658be10d80 () - tts models changed
[D] 13:28:21.525 0x7f658be10d80 () - update listen
[W] 13:28:21.525 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:28:21.525 0x7f658be10d80 () - no available tts models for out mnt
[D] 13:28:21.525 0x7f658be10d80 () - ttt models changed
[D] 13:28:21.526 0x7f658be10d80 () - mnt langs changed
[D] 13:28:21.526 0x7f658be10d80 () - update listen
[W] 13:28:21.526 0x7f658be10d80 () - no available mnt langs
[W] 13:28:21.526 0x7f658be10d80 () - no available mnt out langs
[D] 13:28:35.806 0x7f658be10d80 () - default tts model not found: "en"
[D] 13:28:35.807 0x7f658be10d80 () - default mnt lang not found: "en"
[D] 13:28:35.807 0x7f658be10d80 () - new default mnt lang: "en"
[D] 13:28:35.807 0x7f658be10d80 () - choosing model for id: "en_whisper_base" "en"
[D] 13:28:35.807 0x7f658be10d80 () - restart stt engine config: "lang=en, model-files=[model-file=/home/chrisshaw/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/en_whisper_base.ggml, scorer-file=, ttt-model-file=], speech-mode=automatic, vad-mode=aggressiveness-3, speech-started=0, use-gpu=0, gpu-device=[id=-1, api=opencl, name=, platform-name=]"
[D] 13:28:35.807 0x7f658be10d80 () - new stt engine required
[D] 13:28:35.808 0x7f658be10d80 open_whisper_lib:109 - using whisper-openblas
[D] 13:28:37.109 0x7f658be10d80 make_wparams:340 - cpu info: arch=x86_64, cores=4
[D] 13:28:37.110 0x7f658be10d80 make_wparams:342 - using threads: 4/4
[D] 13:28:37.110 0x7f658be10d80 make_wparams:344 - system info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 
[D] 13:28:37.110 0x7f658be10d80 start:199 - starting engine
[D] 13:28:37.110 0x7f658be10d80 start:207 - engine started
[D] 13:28:37.110 0x7f658be10d80 () - creating audio source
[D] 13:28:37.110 0x7f658be10d80 () - mic source created
[D] 13:28:37.110 0x7f64fbc15600 start_processing:244 - processing started
[D] 13:28:37.110 0x7f64fbc15600 set_processing_state:430 - processing state: idle => initializing
[D] 13:28:37.110 0x7f64fbc15600 set_processing_state:437 - speech detection status: no-speech => initializing (no-speech)
[D] 13:28:37.110 0x7f64fbc15600 () - service refresh status, new state: idle
[D] 13:28:37.110 0x7f64fbc15600 () - task state changed: 0 => 3
[D] 13:28:37.110 0x7f64fbc15600 create_whisper_model:175 - creating whisper model
whisper_init_from_file_no_state: loading model from '/home/chrisshaw/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/en_whisper_base.ggml'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 9
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   56.51 MB
[D] 13:28:37.340 0x7f658be10d80 () - using audio input: "alsa_input.usb-046d_HD_Pro_Webcam_C920_2AE889FF-02.analog-stereo"
whisper_model_load: model size    =   56.38 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   14.10 MB
whisper_init_state: compute buffer (encode) =   81.85 MB
whisper_init_state: compute buffer (cross)  =    4.40 MB
whisper_init_state: compute buffer (decode) =   24.61 MB
[D] 13:28:37.440 0x7f64fbc15600 create_whisper_model:185 - whisper model created
[D] 13:28:37.440 0x7f64fbc15600 set_processing_state:430 - processing state: initializing => idle
[D] 13:28:37.440 0x7f64fbc15600 set_processing_state:437 - speech detection status: initializing => no-speech (no-speech)
[D] 13:28:37.440 0x7f64fbc15600 () - service refresh status, new state: idle
[D] 13:28:37.440 0x7f64fbc15600 () - task state changed: 3 => 0
[D] 13:28:37.657 0x7f658be10d80 () - audio state: IdleState
[D] 13:28:37.658 0x7f658be10d80 () - service refresh status, new state: listening-auto
[D] 13:28:37.658 0x7f658be10d80 () - service state changed: idle => listening-auto
[W] 13:28:37.660 0x7f658be10d80 () - ignore TaskStatePropertyChanged signal
[W] 13:28:37.660 0x7f658be10d80 () - ignore TaskStatePropertyChanged signal
[D] 13:28:37.660 0x7f658be10d80 () - app current task: -1 => 0
[W] 13:28:37.660 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:28:37.660 0x7f658be10d80 () - app service state: idle => listening-auto
[W] 13:28:37.664 0x7f658be10d80 () - no available mnt langs
[W] 13:28:37.664 0x7f658be10d80 () - no available mnt out langs
[W] 13:28:37.664 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:28:37.664 0x7f658be10d80 () - no available tts models for out mnt
[W] 13:28:37.664 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:28:37.847 0x7f658be10d80 () - audio state: ActiveState
[D] 13:28:39.178 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=true, eof=false
[D] 13:28:39.210 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:40.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:40.795 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:42.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:42.194 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:43.561 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:43.597 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:45.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:45.201 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:46.561 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false

** (dsnote:2): WARNING **: 13:28:46.596: atk-bridge: get_device_events_reply: unknown signature
[D] 13:28:46.600 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:48.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:48.202 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:49.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:49.800 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:51.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:51.200 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:52.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:52.797 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:54.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:54.175 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:55.561 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:55.593 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:57.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:57.184 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:58.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:58.774 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:29:00.164 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:29:00.181 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:29:01.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:29:01.798 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:29:02.215 0x7f658be10d80 () - cancel
[D] 13:29:02.215 0x7f658be10d80 () - stop stt engine
[D] 13:29:02.215 0x7f658be10d80 stop:225 - stop requested
[D] 13:29:02.215 0x7f658be10d80 stop_processing_impl:166 - whisper cancel
[D] 13:29:02.215 0x7f64fbc15600 flush:446 - flush: exit
[D] 13:29:02.215 0x7f64fbc15600 reset_in_processing:356 - reset in processing
[D] 13:29:02.215 0x7f64fbc15600 start_processing:279 - processing ended
[D] 13:29:02.215 0x7f658be10d80 stop:240 - stop completed
[D] 13:29:02.215 0x7f658be10d80 () - mic source dtor
[D] 13:29:02.215 0x7f658be10d80 () - audio state: SuspendedState
[D] 13:29:02.215 0x7f658be10d80 () - audio ended
[D] 13:29:02.217 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:29:02.217 0x7f658be10d80 () - service state changed: listening-auto => idle
[D] 13:29:02.217 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:29:02.217 0x7f658be10d80 () - app current task: 0 => -1
[W] 13:29:02.217 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:29:02.217 0x7f658be10d80 () - app service state: listening-auto => idle
[W] 13:29:02.221 0x7f658be10d80 () - no available mnt langs
[W] 13:29:02.221 0x7f658be10d80 () - no available mnt out langs
[W] 13:29:02.221 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:29:02.221 0x7f658be10d80 () - no available tts models for out mnt
[W] 13:29:02.221 0x7f658be10d80 () - invalid task, reseting task state

Text cleaning before TTS

https://fe.disroot.org/notice/AaDjgOLRNpEFqFHciO

SpeechNote crashes

I have installed and run SpeechNote from flatpak. It starts up fine, but as soon as I press Listen, it loads the speech model and crashes.

$ flatpak run net.mkiol.SpeechNote
Gtk-Message: 13:44:53.142: Failed to load module "xapp-gtk3-module"
Qt: Session management error: Authentication Rejected, reason : None of the authentication protocols specified are supported and host-based authentication failed

I select the Speech to text model and press Listen

whisper_init_from_file_no_state: loading model from '/home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/multilang_whisper_base.ggml'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  218,00 MB (+    6,00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  140,60 MB

And SpeechNote crashes.
The same situation occurs for each selected speech model.

(Linux Mint 21.1, Xfce 4.16)

Language auto detection in Translator

https://fe.disroot.org/objects/276dc9db-ac27-4629-b9dd-a7004dc07bde

French translation (of the gui)

Some parts of Speechnote are note well translated.
I would like to help. I forked the git. Can I just use git, or do you have an other tool for translation?

Rename dsnote repository to SpeechNote

I notice the only way i can find your git repo is via the FlatHub package.
Would you be willing to change the name to "SpeechNote" as its better SEO as your tool will likely come up better on search and more people will find it i think.

Let me know what you think.

Not an issue. I am just very impressed with the quality of certain voices.

I am impressed with the Piper Ryan high and Piper Lessac high version is very human sounding. And I am impressed that this little program makes all the voices easily accessible.

Unable to add language second language model on Mint

If you have a language and language model in place, and your only interest is in changing the language model, it is confusing to have to select a language before seeing alternative language models. Some explanation would be nice.

Rewind forward/backward for text reading

https://fe.disroot.org/notice/AaDjgOLRNpEFqFHciO

"Error: couldn't download the model file"

Hello. First of all, thank you for your work. It looks fantastic. At least, until now, I couldn't try it, since the following problem appears (I took a screenshot so you can see it: "01. Text to speech Spanish - Error") By the way, I had no problem downloading in English.

Sorry if this request is not well made, it's my first time using Github.

Thanks for your job.

DS note inside any text box?

Are there any upcoming plans to introduce a feature that enables DS Note to seamlessly insert dictated text into a any selected text box/where the cursor is -- , similar to the functionality found in Windows where you can simply press Windows + H?

As someone with SEVERELY limited dexterity and mobility due to a disability, this function is crucial for me to do my normal working day-to-day and personally, a big barrier for me in making a full-time switch from Windows to Linux -- especially when I need to work. Unfortunately, I lack the programming skills or the capacity to grasp anything more complex than a basic "Hello, World!" program. I'm curious to know if such a feature is feasible within the DS Note program.

But for what's it worth right now, just having something on flathub where I can have something similar at least, is a gamechanger.

Donate button

Hello,

the new version of your program is really nice. Is it possible to spend you money for your hard work?
I look into the github page but i dont see any button.
:)

Hotkeys would be nice :)

hello,

enter for reading
p for pause
and more if you like.
Maybe you can add some things in the settings to config all hotkeys.
:)

This is really excellent software - I am genuinely impressed - but it needs a little bit of polishing.

It needs better instructions and information.
The OLD E-Speak Robot voice uses almost no processing power and it's very fast for text to speech conversions

OK the most basic voice uses almost no processing power, and the very best voices, use loads of processing power and unless you have a very powerful computer, it's lagging and buffering a lot.

The text to speech - save audio file to MP3 - that is great, but the need for processing power - if it does not exist - then it can take 12 to 18 hours to do a high rate conversion of a 400 page document to MP3.

So we need to see the data rate each voice operates at, kind of like internet speeds of dial up, ADSL etc..
And we need to have a selection of MP3 conversion rates, while super high fidelity is excellent 44 Khz is just fine for small file sizes and fairly good audio quality, where are super large file sizes and 256Khz - some people might like that - but for most text to speech work on written files instead of enormous amounts of reading... where are automated scripts for movie production.. We need a choice.
The audio player needs a speed and pitch control, along with pause and stop - as cancel - yeah I get it that it's a new to market product - while generally excellent - cancel kind of crunches the point of it.

So the voice names need a scale beside them - I figure that the small, medium and large designations MIGHT be linked to a data rate, but they might be linked to a download file size...

For most of my work I have to read LARGE documents, like 400 pages etc.. and it's better to read them out, and save them as an MP3, so I can listen to them when driving long distances or when resting etc..

I don't need stereo phonic high fidelity... just low resolution audio... that is fine...

I also lack computers that are much beyond office work and playing a few videos.. So the down scale options are needed.. "Oh voice X uses 200 times the resources of E-Speak Robot.. Hmmmm brilliant, but I will be happy with 25 times the processing power of E-speak robot...

I am REALLY impressed with what you all have done so far... It's incredible... I mean this is really good.

no main menu .desktop

I just installed the software through flat hub and it does not produce a main menu icon in my start menu, I am using Zorin 16 os. I can get it running through the command but there is no entry in my start menu.

mkiol / dsnote Goto Github PK

dsnote's Introduction

Speech Note

Contents of this README

Description

Languages and Models

How to install

Flatpak packages

Beta version

Building from sources

Arch Linux

Flatpak

Sailfish OS

Linux (direct build)

How to enable a custom model

Contributing to Speech Note

Translation

How to support

Libraries

Reviews and demos

License

dsnote's People

Contributors

Stargazers

Watchers

Forkers

dsnote's Issues

Summary

Steps to reproduce

Flatpak page

Contribute

It should be

To reproduce:

Context

Recommend Projects

Recommend Topics

Recommend Org