bookbot-kids / speechline Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 9.69 MB

An end-to-end, offline, audio categorization, transcription, and segmentation.

Home Page: https://bookbot-kids.github.io/speechline/

License: Apache License 2.0

Python 54.25% Makefile 0.24% Jupyter Notebook 45.51%

audio-labeling data-pipeline speech-labeller

speechline's Issues

Description

Documentation fails to build on GitHub Actions.

Expected Behavior

Documentation gets built automatically and subsequently served on GitHub Pages.

Actual Behavior

Failed docs deployment. See Failed Job.

Steps to Reproduce

N/A

As per more testing, the new batched classification pipeline seems to take up too much of RAM. In contrast, single-audio classification is more memory efficient, although likely slower than batching. However, since time is not really a constraint here, unbatched, single-audio classification is more preferable.

Source Code
Tests
Documentation

[ML] Noise Tagging

In every non-transcribed section, infer what kind of noise is present. Set this to be optional, and the minimum non-transcribed section length in seconds (so as to not keep inferring super-short segments). Inferred tag is the predicted label with max probability above a certain threshold. Add tag as chunk with special tag e.g. [speech].

[ML] End-to-End Pipeline

[ML] Improve Caching and Dataset

Move DataFrame creation to utils
Fix relative package imports

[ML] Support Other Model Architectures

Description

We need your help to support more models in SpeechLine. Currently, we support,

Automatic Speech Recognizers:

wav2vec 2.0
OpenAI Whisper
HuBERT
SpeechT5

Audio Classifiers:

wav2vec 2.0
Audio Spectrogram Transformer

You can view examples here: classifier and trancribers. We mainly use HuggingFace pipelines for convenience.

Phoneme Error Rate fails on "see"

Description

PER breaks when predicted phoneme is longer than pronunciation in lexicon, causing an out of range error.

Expected Behavior

PER has no index error

Actual Behavior

PER has an index error

lexicon = {"see": [['s', 'i']]}
words = ["see"]
prediction = ['s', 'i', 'd']
per = PhonemeErrorRate(lexicon)
per(words=words, prediction=prediction)

IndexError                                Traceback (most recent call last)
Cell In[11], line 4
      2 prediction = ['s', 'i', 'd']
      3 per = PhonemeErrorRate(lexicon)
----> 4 per(words=words, prediction=prediction)

Cell In[9], line 18, in PhonemeErrorRate.__call__(self, words, prediction)
     15 for tag, i1, i2, j1, j2 in s.get_opcodes():
     16     if tag != "equal":
     17         # if there happens to be multiple valid phoneme swaps in current index
---> 18         if i1 == idx and len(stack[idx]) > 1:
     19             # get current substring
     20             expected = reference[i1:i2]
     21             predicted = prediction[j1:j2]

IndexError: list index out of range

Inaccurate Phoneme Timestamps

Phoneme timestamps extracted from wav2vec2 transcriber may be inaccurate due to excessive padding.

Description

Passing padded outputs to ctc-segmentation results in inaccurate timings.

Expected Behavior

Phoneme-level accurate timings.

Actual Behavior

Last phoneme may have inaccurate timings as the algorithm tries to align padding tokens as well, which is unnecessary.

Steps to Reproduce

N/A

[ML] Phoneme Overlap Runner

Integrate Phoneme Overlap to Runner
Lexicon defaults to Lexikos, but also add additional optional config which points to JSON file for extra lexicon entries
Test Suite
Documentation

[OS] Developer Guide

[ML] Audio Transcriber

[ML] Audio Chunking and Striding for Wav2Vec2/Whisper Inference

Implement Chunking and Striding for Wav2Vec2/Whisper Inference
Test Suites
Formatting

[UT] Punctuation Forced Aligner

To potentially allow wav2vec2 as a duration extractor for other speech tasks like speech synthesis, and since w2v2 doesn't classify punctuations, implement a punctuation forced aligner (PFA). PFA takes predicted phoneme offsets, ground truth text with punctuations, and inserts punctuations into the predicted phoneme offsets.

Source Code
Test Suites
Docs
Formatting

[OS] License Cleanup

Cleanup files without Apache License, and add
Change license year

`IndexError` on Pronunciation Stack Building

Description

PER breaks with the following lexicon entry:

lexicon = {
    "4806": [
        ['f', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'],
        ['f', 'ɔ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'],
        ['f', 'ɔ', 'eɪ', 't', 'z', 'ɪ', 'ɹ', 'oʊ', 's', 'ɪ', 'k', 's'],
        ['f', 'l', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's']
    ]
}
per = PhonemeErrorRate(lexicon)
per.compute_measures(words=["4806"], prediction=['f', 'ɔ', 'ɹ', 'eɪ', 't', 'ə', 's', 'ɪ', 'k', 's'])

See the error below.

This happens due to the way epsilons are added into the pronunciation stack. First, ['f', 'ɔ', 'eɪ', 't', 'z', 'ɪ', 'ɹ', 'oʊ', 's', 'ɪ', 'k', 's'] is set as the longest pronunciation used to compare all other pronunciations. However, notice that there are insertions coming from the shorter sequences like ['f', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'] (addition of ɹ) and similarly ['f', 'l', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'] (addition of l and ɹ). The current code only handles deletions, in which the shorter sequences are padded to the maximum length. But because of this, the original longest pronunciation ends up shorter than the padded once, because insertions aren't accounted for.

Expected Behavior

The longest pronunciation gets similarly padded by epsilon. All padded pronunciations must have the same length, only then can the pronunciation stack be generated and PER gets calculate accordingly.

Actual Behavior

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[89], line 1
----> 1 per.compute_measures(words=["4806"], prediction=['f', 'ɔ', 'ɹ', 'eɪ', 't', 'ə', 's', 'ɪ', 'k', 's'])

Cell In[79], line 37, in PhonemeErrorRate.compute_measures(self, words, prediction)
     34 def compute_measures(
     35     self, words: List[str], prediction: List[str]
     36 ) -> Dict[str, int]:
---> 37     stack = self._build_pronunciation_stack(words)
     38     reference = [
     39         phoneme for word in words for phoneme in max(self.lexicon[word], key=len)
     40     ]
     42     editops = Levenshtein.editops(reference, prediction)

Cell In[79], line 81, in PhonemeErrorRate._build_pronunciation_stack(self, words)
     79     print(pronunciations)
     80     length = len(pronunciations[0])
---> 81     word_stack = [
     82         set(pron[i] for pron in pronunciations) for i in range(length)
     83     ]
     84     stack += word_stack
     85 return stack

Cell In[79], line 82, in <listcomp>(.0)
     79     print(pronunciations)
     80     length = len(pronunciations[0])
     81     word_stack = [
---> 82         set(pron[i] for pron in pronunciations) for i in range(length)
     83     ]
     84     stack += word_stack
     85 return stack

Cell In[79], line 82, in <genexpr>(.0)
     79     print(pronunciations)
     80     length = len(pronunciations[0])
     81     word_stack = [
---> 82         set(pron[i] for pron in pronunciations) for i in range(length)
     83     ]
     84     stack += word_stack
     85 return stack

IndexError: list index out of range

[ML] ONNX wav2vec2

ONNX wav2vec2 via Optimum
Support hotwords
Test suite
Documentation

[OS] Pipeline Documentation

For example, what does the pandas dataframe contain? How often does it get populated, etc?
How do you convert from a pandas dataframe to a Hugging Face dataset? What is the purpose of the transformation?
What does the audio classifier do to separate audio (i.e. children voices from adult voices, noise from true audible speech, etc)?
Following classification, what is the purpose of the audio transcription?
What are the resulting audio chunks? At the phoneme level?

[ML] Phoneme Overlap Variants

Overlaps but separated with boundaries
Allow variations in overlap comparison
Tests
Documentation

[OS] Docstrings Cleanup

Make Docstring-style uniform as follows:

Newline after """
Newline for arguments and returns
Markdown-wrap all Python statements.

[ML] Support HuggingFace Datasets

Load HuggingFace Datasets instead of local directory
Add generic preprocessing of transcripts
Integrate to remaining pipeline

[ML] PER Lexicon Validator

Add a lexicon validator to ensure that all words in the sequences passed have entries in the lexicon.

Source Code
Test Suites
Documentation
Formatting

[ML] Phoneme Error Rate

Design a flexible phoneme error rate, where users can specify their own lexicon, consisting of words and their allowed pronunciation variants.

Source Code
Test Suites
Documentation
Formatting

[OS] Detailed Open Source Documentation

Usage
Architectural Diagram
Integrate OS Documents to Web Documentation (CoC, Charter, Contrib Guide, etc)

[ML] Add Filter Empty Transcript Option

In word-overlap segmentation method, we might want to immediately skip audios with empty transcripts -- because they can't possibly overlap with transcribed text anyway. Add an option to filter against these types of entries in the pipeline.

[OS] Web App Demo

[ML] Improve PER

[ML] Integrate OpenAI Whisper

Integrate OpenAI Whisper into Runner.

Source Code
Test Suites
Formatting
Documentation

[OS] Improve Docs

Fix return types not correctly rendered
Change all example blocks to pycon

[UT] Create HuggingFace Dataset

Automatically format SpeechLine output files as a HuggingFace dataset via a script.

[ML] Data Filtering

Filter against empty transcripts
Filter against short audio chunks

[OS] Basic Open Source Documentation

[ML] Pipeline Cleanup

Cleanup Docstrings
Refactor Batch Size to Config

[UT] Makefile

Code Formatting
Imports Sorting
Test Runs

[ML] Optimize Preprocessing

Change batched pre-processing to un-batched pre-processing

[ML] Audio Classifier

Inference Code
Tests
Documentation

[OS] Tutorial Notebooks

Notebook examples
Autoconvert script

[ML] Variable Phoneme Length PER

Extend PER calculation to allow for differing phoneme lengths during comparison. This is mainly to allow for phoneme insertions that may occur in different accent variations.

Source Code
Test Suites
Documentation
Formatting

Source Code
Test Suites
Formatting

[ML] Segmentation Strategies

Allow for unsegmented output
Allow for different segmentation techniques (currently only silence-based, also implement ground-truth overlapping).

[UT] FFMPEG `aac` to `wav` 16kHz Mono Converter

Script
Tests
Documentation

[ML] Audio Segmentation

[UT] AWS S3 Downloader

AWS S3 Utils
Document S3 Class
Downloader

bookbot-kids / speechline Goto Github PK

speechline's People

Contributors

Stargazers

Watchers

speechline's Issues

Description

Expected Behavior

Actual Behavior

Steps to Reproduce

Description

Description

Expected Behavior

Actual Behavior

Description

Expected Behavior

Actual Behavior

Steps to Reproduce

Description

Expected Behavior

Actual Behavior

Recommend Projects

Recommend Topics

Recommend Org