Giter VIP home page Giter VIP logo

bugbakery / transcribee Goto Github PK

View Code? Open in Web Editor NEW
109.0 9.0 11.0 4.84 MB

open source audio and video transcription software

Home Page: https://transcribee.net

License: GNU Affero General Public License v3.0

HTML 0.12% JavaScript 1.26% TypeScript 41.01% CSS 0.07% Python 39.54% Nix 1.50% Procfile 0.05% Shell 0.50% Dockerfile 0.08% Mako 0.10% Jupyter Notebook 15.76%
transcription collaborative speech-to-text

transcribee's Introduction

transcribee logo

🎤 transcribee ✍️

[going to be] an open source audio- and videotranscription software

Note:

Currently, transcribee is heavily work-in-progress and not yet ready for production use. Please come back in a few weeks / months.

transcribee 🐝 aims to make the workflow for media transcription easier, faster and more accessible.

  • It can automatically generate a draft transcript of your audio
  • It allows you to quickly improve the automatic draft and fix any errors
  • It's collaborative – split the work with your friends or colleagues
  • It's open-source

Develop!

To get started with developing or to try the current state of transcribee, follow the instructions in the development setup document.

How does it work?

Note:

We're heavily working on transcribee. Not all steps described here are already implemented.

Creating a transcript with transcribee 🐝 is done with the following steps:

  1. Import your media file

    During import, your audio file is automatically converted to text using state-of-the-art models1. transcribee 🐝 also automatically detects different speakers in your file.

  2. Manually improve the transcript

    After the automatic transcript is created, you can edit it to correct any mistakes the automatic transcription made.2 You can also name the speakers.

    Since transcribee 🐝 is a collaborative software, you can do this step (and all other manual steps) together with others. All changes are instantly synced with everyone working on the transcript.

  3. Automatic re-alignment

    To make sure that the timestamps of your corrected text are still correct, transcribee 🐝 matches this text back up with the audio.

  4. Manual re-alignment

    Now you can check the automatically generated timestamps and correct them.

  5. Export

    Once you are happy with the transcript, you can export it.

Acknowledgements

  • Funded from March 2023 until September 2023 by logos of the "Bundesministerium für Bildung und Forschung", Prototype Fund and Open Knowledge Foundation Deutschland

Footnotes

  1. At the moment we use whisper.cpp for transcription, Wav2Vec2 for realignment and speechbrain for speaker identification.

  2. The editor is based on slate with collaboration using the automerge CRDT.

transcribee's People

Contributors

anuejn avatar dnkbln avatar moeffju avatar pajowu avatar phlmn avatar rroohhh avatar voronov007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transcribee's Issues

Better Syncing Protocol

At the moment the sync works this way:

  1. The frontend connects to the backend via WS
  2. The backend send all existing changes to the frontend
  3. If the user edits something, this change is send back to the backend, which stores it and relays it to all other connected frontends.

The first needed change is that the backend should indicate if it has send all store messages. Once we decided on a CRDT (see #16), we should also implement the syncing procotol of this CRDT.

Whisper configuration

Currently the parameters configuring whisper are hardcoded to be the same as the whisper.cpp cli.
This notably includes the sampling strategy and the maximum segment length.
Do we want to make this configurable?

Manual Experiments seem to indicate the best_of parameter of the Greedy sampling Strategy having a huge influence.

Interesting parameters:

  • maximum segment length
  • sampling strategy + configuration
  • inital prompt
  • ....

What does confidence mean?

Currently we store a single confidence value per Atom, which is the "transcript confidence" generated by whisper.

However there are atleast three confidence values that we will encounter:

  • transcript confidence
  • speaker assignment confidence
  • alignment confidence

Will the latter two ever be interesting to a user / interesting to store somewhere?

whisper -> wav2vec2 conversion

For alignment we use wav2vec2, with a character level alphabet. The transcript generated by whisper has to be converted into a transcript using only the characters know to the wav2vec2 model to perform alignment.

The currently conversion performed by the worker for this is very unsophisticated.

  1. First every space is replace with |, which is the token used by wav2vec2 as a word separator
  2. Any character not present in the wav2vec2 alphabet is simply dropped.

There are several improvements to this:

  1. Use something like unicode normalization (for example NFKD) to try to replace characters in the whisper transcript, that are not present in the wav2vec2 with ones that could be present.
  2. Handle languages with no word separators (chinese, japanese, etc)
  3. Add handling for punctuation (. is also a word separator. What about words joined using a -?)

Alignment in frontend

The current wav2vec2 based alignment could possibly be performed in the frontend without too much effort and seems to be a lot quicker than the transcription using whisper.

Do we want to perform alignment in the frontend? This would be useful for working offline and for reducing the workload of the workers.

Rethink worker tasks & dependencies

Maybe it makes sense to make the dependencies more granular. E.g. the alignment could start aligning to the already transcribed bars (maybe with some safety distance) even if not the whole document is transcribed yet.
This could also lead to some cool UX, where we really display one worker for each task (similiar to normal users) and each worker can report their own status

Better handling for invalid utf-8

Currently we store all text of a Atom as utf-8. This interacts badly with whisper, as not every token generated by it is a valid utf-8 sequence.
There are two cases:

  1. Back to back tokens generated by whisper are valid utf-8, but they are not valid utf-8 on their own.
  2. The tokens are just completely invalid.

The first case is currently handled by combining the tokens generated by whisper into a Atom until the combined text is valid utf-8. This does however not solve the second case and will just cause the whole Paragraph to be empty. (As no Atom will ever be emitted for a segment.)

Furthermore the handling for the first case assumes these issues are always contained in a single Segment. This might not always be true.

I see two ways forward:

  1. Save all text as bytes and only decode as utf-8 string whenever necessary.
  2. Add more sophisticated handling for the cases where the generated byte stream is not valid utf-8.

The first option would allow "lossless" storage of everything generated by whisper, however it is unclear how to interpret the invalid utf-8 sequences.

Non speech text

There are several non speech segments generated by whisper.
Broadly these come in three classes:

  • Punctuation
  • Special tokens (like start of transcript, end of transcript, etc)
  • Special annotations (like *Musik*, ..., etc)

The first two classes are easy to detect and if necessary for some processing step split out. Punctuation is sampled from a limited class of possible characters (like ., , ,, -, ...) and the special tokens have specific token IDs that can be filtered out.

The third class however is generated just like "normal" transcript text. The official whisper implementation seems to have some heuristics to filter them: https://github.com/openai/whisper/blob/ad3250a846fe7553a25064a2dc593e492dadf040/whisper/tokenizer.py#L237
However this also just looks like some basic heuristic, that will not work for *Musik* for example.

Do we care about this third class? Are there cases where we want to filter these?
One case where I think it could be useful to filter them is the alignment, but it is to be determined how important this is.

Recombine paragraphs

We currently use the paragraphs generated by whisper.cpp more or less directly. In my experience this often splits sentences into one long paragraph and a short one only containing the last few words.

It's probably a good idea to recombine these segments using some kind of heuristic (one possible heuristic: If a paragraph is exactly the token limit, does not end with punctuation and the next paragraph ends with punctuation, recombine them)

See also the discussion: #49 (comment)

Evaluate CRDTs

We currently have an editor based on the Yjs CRDT. We want to evaluate other CRDTs as well, before committing to one.

Requirements:

  • Works with TypeScript
  • Works with Python
  • Works with slate
  • Fast enough for editing a longer document

Nice-to-Have:

  • Existing TS&Py-Bindings
  • Easy way to inspect the history of a document and to roll back to specific versions

Possible Candidates:

  • Yjs
  • Automerge
  • [your candidate here]

Pauses and spaces in `Atom`s

Currently we generated Atoms for approximately every whisper token. These tokens notably include spaces that separate words.

Alignment is performed on the basis of these Atoms and the timing stored to the Atom is the timing including theses spaces.
For spaces inside a whole sentence this is not really significant, but for audio files that include long pauses between multiple sentences (like music) or have a long pause at the start of the file, the timing will usually include this long pause with the first Atom following this pause.

This means when the user clicks on the Atom to start playback from this point, the user will have to listen to this long pause before the interesting part with speech starts.

Do we think this is acceptable?

I think timings specified for a Atom should always include all "characters" in the Atoms text. So if a leading space is included in this Atom, the pause for that space should also be included in the timing.
So if we think the user having to wait for a long pause to be over is not acceptable the only way forward is splitting spaces and punctuation into separate Atoms.

Diarization is broken

worker   | WARNING:root:Worker failed with exception
worker   | Traceback (most recent call last):
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/worker.py", line 315, in run_task
worker   |     task_result = await self.perform_task(task)
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/worker.py", line 110, in perform_task
worker   |     await self.diarize(task)
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/worker.py", line 213, in diarize
worker   |     diarization = diarize(document_audio, progress_callback=progress_callback)
worker   |   File "/data/projects/transcribee/transcribee/worker/transcribee_worker/diarize.py", line 65, in diarize
worker   |     diarization = pipeline(audio, hook=_hook)
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/core/pipeline.py", line 324, in __call__
worker   |     return self.apply(file, **kwargs)
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/pipelines/speaker_diarization.py", line 496, in apply
worker   |     embeddings = self.get_embeddings(
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/pipelines/speaker_diarization.py", line 337, in get_embeddings
worker   |     embedding_batch: np.ndarray = self._embedding(
worker   |   File "/data/projects/transcribee/transcribee/worker/__pypackages__/3.10/lib/pyannote/audio/pipelines/speaker_verification.py", line 318, in __call__
worker   |     assert num_channels == 1
worker   | AssertionError

Improve readme

There should be at least a link to https://prototypefund.de/project/transcribee/ – or even better a translation of following paragraph:

Dazu erstellen wir ein leistungsfähiges Backend sowie eine leicht zu bedienende Web-UI. Für das Webinterface entwickeln wir eine React-basierte App, das Backend gestalten wir mit Python, da die meisten Open-Source-Bibliotheken entweder selbst auf Python basieren oder gute Python-Schnittstellen anbieten. Bei Transkription sind dies zum Beispiel Kaldi/Vosk oder Facebooks Wav2Vec2-Modelle. Für das Erkennen und Unterscheiden von Sprecher*innen in Texten nutzen wir pyannote.audio und für das zeitliche Zuordnen von Texten zu ihrer Audio-Quelle („Alignment“) anaeas oder der Montreal Forced Aligner.

Um ein Transkript zu erstellen, durchlaufen die Nutzer*innen innerhalb des Tools 5 Schritte. Zuerst importieren sie eine Audio- oder Videodatei, wobei ein automatisches Transkript erstellt wird. Danach folgt die manuelle Korrektur dieses Transkripts. Darauf hin wird der Text zum Ton automatisch re-aligned. Im vierten Schritt gibt es eine manuelle Korrektur des Re-Alignments und zum Schluss wird das Transkript exportiert. Für viele dieser Teilprobleme gibt es bereits gute Open-Source-Bibliotheken, die wir verknüpfen und einfach nutzbar machen.

Whisper parameters

Do we want the whisper parameters to be configurable?

If not, which ones do we want to choose?

Beamsearch seems to be preferred over Greedy for example.

Speaker segmentation

The worker currently cannot perform speaker segmentation.

Investigate how we want to do this.

Playback position not moved when clicking left of text

When clicking left of the transcript text (but right of the speaker name column), the text-cursor is placed at the start of the transcript line. However the playback position is not moved to the same point.

Screen.Recording.2023-05-02.at.14.01.22.mov

Config option(s) for ml devices

We should add an option to the worker config to choose the right device to be used for the ML tasks. I think at the moment only the pytorch-based models support this (i.e. pyannote for diariztion and torchaudio for alignment). #49 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.