bugbakery / transcribee Goto Github PK

open source audio and video transcription software

License: GNU Affero General Public License v3.0

HTML 0.12% JavaScript 1.26% TypeScript 41.01% CSS 0.07% Python 39.54% Nix 1.50% Procfile 0.05% Shell 0.50% Dockerfile 0.08% Mako 0.10% Jupyter Notebook 15.76%

transcription collaborative speech-to-text

transcribee's Introduction

🎤 transcribee ✍️

[going to be] an open source audio- and videotranscription software

Note:

Currently, transcribee is heavily work-in-progress and not yet ready for production use. Please come back in a few weeks / months.

transcribee 🐝 aims to make the workflow for media transcription easier, faster and more accessible.

It can automatically generate a draft transcript of your audio
It allows you to quickly improve the automatic draft and fix any errors
It's collaborative – split the work with your friends or colleagues
It's open-source

Develop!

To get started with developing or to try the current state of transcribee, follow the instructions in the development setup document.

How does it work?

Note:

We're heavily working on transcribee. Not all steps described here are already implemented.

Creating a transcript with transcribee 🐝 is done with the following steps:

Import your media file

During import, your audio file is automatically converted to text using state-of-the-art models¹. transcribee 🐝 also automatically detects different speakers in your file.
Manually improve the transcript

After the automatic transcript is created, you can edit it to correct any mistakes the automatic transcription made.² You can also name the speakers.

Since transcribee 🐝 is a collaborative software, you can do this step (and all other manual steps) together with others. All changes are instantly synced with everyone working on the transcript.
Automatic re-alignment

To make sure that the timestamps of your corrected text are still correct, transcribee 🐝 matches this text back up with the audio.
Manual re-alignment

Now you can check the automatically generated timestamps and correct them.
Export

Once you are happy with the transcript, you can export it.

Acknowledgements

Funded from March 2023 until September 2023 by

At the moment we use whisper.cpp for transcription, Wav2Vec2 for realignment and speechbrain for speaker identification. ↩
The editor is based on slate with collaboration using the automerge CRDT. ↩

transcribee's People

Contributors

Stargazers

Watchers

Forkers

drgonzalomora ty1135 moeffju tuanvinhtl dnkbln joeysoph andrefauth prototypefund jinnotgin winyoueee

transcribee's Issues

Document sometimes stops persisting changes

Split segments on speaker change

Better Syncing Protocol

At the moment the sync works this way:

The frontend connects to the backend via WS
The backend send all existing changes to the frontend
If the user edits something, this change is send back to the backend, which stores it and relays it to all other connected frontends.

The first needed change is that the backend should indicate if it has send all store messages. Once we decided on a CRDT (see #16), we should also implement the syncing procotol of this CRDT.

Different text generated by `whisper.cpp` cli and worker.

For example for the base model + test.wav

version tagging of transcriptions

What will be the differences between transcribee and https://buzzcaptions.com?

The projects look superficially very similar, with Buzz being ready to use.

Remove dependency of align step on diarization

Backend: PDM commands for makemigrations, migrate, collectstatic

Currently the parameters configuring whisper are hardcoded to be the same as the whisper.cpp cli.
This notably includes the sampling strategy and the maximum segment length.
Do we want to make this configurable?

Manual Experiments seem to indicate the best_of parameter of the Greedy sampling Strategy having a huge influence.

Interesting parameters:

maximum segment length
sampling strategy + configuration
inital prompt
....

Close Popup when clicking outside of it

Playerbar doesn't resize

WaveSurfer Click and Drag

What does confidence mean?

Currently we store a single confidence value per Atom, which is the "transcript confidence" generated by whisper.

However there are atleast three confidence values that we will encounter:

transcript confidence
speaker assignment confidence
alignment confidence

Will the latter two ever be interesting to a user / interesting to store somewhere?

whisper -> wav2vec2 conversion

For alignment we use wav2vec2, with a character level alphabet. The transcript generated by whisper has to be converted into a transcript using only the characters know to the wav2vec2 model to perform alignment.

The currently conversion performed by the worker for this is very unsophisticated.

First every space is replace with |, which is the token used by wav2vec2 as a word separator
Any character not present in the wav2vec2 alphabet is simply dropped.

There are several improvements to this:

Use something like unicode normalization (for example NFKD) to try to replace characters in the whisper transcript, that are not present in the wav2vec2 with ones that could be present.
Handle languages with no word separators (chinese, japanese, etc)
Add handling for punctuation (. is also a word separator. What about words joined using a -?)

Inform frontend if document audio changes

The reencode worker replaces the document audio file. We might want to signal this to the frontend so it can reload the displayed audio

Originally posted by @anuejn in #59 (comment)

Alignment in frontend

The current wav2vec2 based alignment could possibly be performed in the frontend without too much effort and seems to be a lot quicker than the transcription using whisper.

Do we want to perform alignment in the frontend? This would be useful for working offline and for reducing the workload of the workers.

Rethink worker tasks & dependencies

Maybe it makes sense to make the dependencies more granular. E.g. the alignment could start aligning to the already transcribed bars (maybe with some safety distance) even if not the whole document is transcribed yet.
This could also lead to some cool UX, where we really display one worker for each task (similiar to normal users) and each worker can report their own status

Annotation mode (like google docs)

diff between my last change and current state

Develop Icon

Better handling for invalid utf-8

Currently we store all text of a Atom as utf-8. This interacts badly with whisper, as not every token generated by it is a valid utf-8 sequence.
There are two cases:

Back to back tokens generated by whisper are valid utf-8, but they are not valid utf-8 on their own.
The tokens are just completely invalid.

The first case is currently handled by combining the tokens generated by whisper into a Atom until the combined text is valid utf-8. This does however not solve the second case and will just cause the whole Paragraph to be empty. (As no Atom will ever be emitted for a segment.)

Furthermore the handling for the first case assumes these issues are always contained in a single Segment. This might not always be true.

I see two ways forward:

Save all text as bytes and only decode as utf-8 string whenever necessary.
Add more sophisticated handling for the cases where the generated byte stream is not valid utf-8.

The first option would allow "lossless" storage of everything generated by whisper, however it is unclear how to interpret the invalid utf-8 sequences.

Resetting the task state after a it fails

Make highlight / jump word wise and not token wise

Reload worker after task if source changed

The worker should check from time to time if its source files changed and restart

Report Died tasks in the progress API

login fails if an old auth token is still present

Non speech text

There are several non speech segments generated by whisper.
Broadly these come in three classes:

Punctuation
Special tokens (like start of transcript, end of transcript, etc)
Special annotations (like *Musik*, ..., etc)

The first two classes are easy to detect and if necessary for some processing step split out. Punctuation is sampled from a limited class of possible characters (like ., , ,, -, ...) and the special tokens have specific token IDs that can be filtered out.

The third class however is generated just like "normal" transcript text. The official whisper implementation seems to have some heuristics to filter them: https://github.com/openai/whisper/blob/ad3250a846fe7553a25064a2dc593e492dadf040/whisper/tokenizer.py#L237
However this also just looks like some basic heuristic, that will not work for *Musik* for example.

Do we care about this third class? Are there cases where we want to filter these?
One case where I think it could be useful to filter them is the alignment, but it is to be determined how important this is.

history silder

when I press enter in the name field of the new document dialog it goes away

Recombine paragraphs

We currently use the paragraphs generated by whisper.cpp more or less directly. In my experience this often splits sentences into one long paragraph and a short one only containing the last few words.

It's probably a good idea to recombine these segments using some kind of heuristic (one possible heuristic: If a paragraph is exactly the token limit, does not end with punctuation and the next paragraph ends with punctuation, recombine them)

Evaluate CRDTs

We currently have an editor based on the Yjs CRDT. We want to evaluate other CRDTs as well, before committing to one.

Requirements:

Works with TypeScript
Works with Python
Works with slate
Fast enough for editing a longer document

Nice-to-Have:

Existing TS&Py-Bindings
Easy way to inspect the history of a document and to roll back to specific versions

Possible Candidates:

Yjs
Automerge
[your candidate here]

Pauses and spaces in `Atom`s

Currently we generated Atoms for approximately every whisper token. These tokens notably include spaces that separate words.

Alignment is performed on the basis of these Atoms and the timing stored to the Atom is the timing including theses spaces.
For spaces inside a whole sentence this is not really significant, but for audio files that include long pauses between multiple sentences (like music) or have a long pause at the start of the file, the timing will usually include this long pause with the first Atom following this pause.

This means when the user clicks on the Atom to start playback from this point, the user will have to listen to this long pause before the interesting part with speech starts.

Do we think this is acceptable?

I think timings specified for a Atom should always include all "characters" in the Atoms text. So if a leading space is included in this Atom, the pause for that space should also be included in the timing.
So if we think the user having to wait for a long pause to be over is not acceptable the only way forward is splitting spaces and punctuation into separate Atoms.

Investigate automerge marks

Automerge Python Bindings

The upstream automerge python bindings are outdated. I started a fork at https://github.com/transcribee/automerge-py/tree/pajowu/rework but this is not complete

Collaborator Badges

Showing Who is online

Diarization is broken

worker   | WARNING:root:Worker failed with exception
worker   | Traceback (most recent call last):
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/worker.py", line 315, in run_task
worker   |     task_result = await self.perform_task(task)
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/worker.py", line 110, in perform_task
worker   |     await self.diarize(task)
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/worker.py", line 213, in diarize
worker   |     diarization = diarize(document_audio, progress_callback=progress_callback)
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/diarize.py", line 65, in diarize
worker   |     diarization = pipeline(audio, hook=_hook)
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/core/pipeline.py", line 324, in __call__
worker   |     return self.apply(file, **kwargs)
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/pipelines/speaker_diarization.py", line 496, in apply
worker   |     embeddings = self.get_embeddings(
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/pipelines/speaker_diarization.py", line 337, in get_embeddings
worker   |     embedding_batch: np.ndarray = self._embedding(
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/pipelines/speaker_verification.py", line 318, in __call__
worker   |     assert num_channels == 1
worker   | AssertionError

Improve readme

There should be at least a link to https://prototypefund.de/project/transcribee/ – or even better a translation of following paragraph:

Dazu erstellen wir ein leistungsfähiges Backend sowie eine leicht zu bedienende Web-UI. Für das Webinterface entwickeln wir eine React-basierte App, das Backend gestalten wir mit Python, da die meisten Open-Source-Bibliotheken entweder selbst auf Python basieren oder gute Python-Schnittstellen anbieten. Bei Transkription sind dies zum Beispiel Kaldi/Vosk oder Facebooks Wav2Vec2-Modelle. Für das Erkennen und Unterscheiden von Sprecher*innen in Texten nutzen wir pyannote.audio und für das zeitliche Zuordnen von Texten zu ihrer Audio-Quelle („Alignment“) anaeas oder der Montreal Forced Aligner.

Um ein Transkript zu erstellen, durchlaufen die Nutzer*innen innerhalb des Tools 5 Schritte. Zuerst importieren sie eine Audio- oder Videodatei, wobei ein automatisches Transkript erstellt wird. Danach folgt die manuelle Korrektur dieses Transkripts. Darauf hin wird der Text zum Ton automatisch re-aligned. Im vierten Schritt gibt es eine manuelle Korrektur des Re-Alignments und zum Schluss wird das Transkript exportiert. Für viele dieser Teilprobleme gibt es bereits gute Open-Source-Bibliotheken, die wir verknüpfen und einfach nutzbar machen.

Whisper parameters

Do we want the whisper parameters to be configurable?

If not, which ones do we want to choose?

Beamsearch seems to be preferred over Greedy for example.