bookbot-kids / speechline Goto Github PK
View Code? Open in Web Editor NEWAn end-to-end, offline, audio categorization, transcription, and segmentation.
Home Page: https://bookbot-kids.github.io/speechline/
License: Apache License 2.0
An end-to-end, offline, audio categorization, transcription, and segmentation.
Home Page: https://bookbot-kids.github.io/speechline/
License: Apache License 2.0
Documentation fails to build on GitHub Actions.
Documentation gets built automatically and subsequently served on GitHub Pages.
Failed docs deployment. See Failed Job.
N/A
As per more testing, the new batched classification pipeline seems to take up too much of RAM. In contrast, single-audio classification is more memory efficient, although likely slower than batching. However, since time is not really a constraint here, unbatched, single-audio classification is more preferable.
In every non-transcribed section, infer what kind of noise is present. Set this to be optional, and the minimum non-transcribed section length in seconds (so as to not keep inferring super-short segments). Inferred tag is the predicted label with max probability above a certain threshold. Add tag as chunk with special tag e.g. [speech]
.
We need your help to support more models in SpeechLine. Currently, we support,
Automatic Speech Recognizers:
Audio Classifiers:
You can view examples here: classifier and trancribers. We mainly use HuggingFace pipelines for convenience.
PER breaks when predicted phoneme is longer than pronunciation in lexicon, causing an out of range error.
PER has no index error
PER has an index error
lexicon = {"see": [['s', 'i']]}
words = ["see"]
prediction = ['s', 'i', 'd']
per = PhonemeErrorRate(lexicon)
per(words=words, prediction=prediction)
IndexError Traceback (most recent call last)
Cell In[11], line 4
2 prediction = ['s', 'i', 'd']
3 per = PhonemeErrorRate(lexicon)
----> 4 per(words=words, prediction=prediction)
Cell In[9], line 18, in PhonemeErrorRate.__call__(self, words, prediction)
15 for tag, i1, i2, j1, j2 in s.get_opcodes():
16 if tag != "equal":
17 # if there happens to be multiple valid phoneme swaps in current index
---> 18 if i1 == idx and len(stack[idx]) > 1:
19 # get current substring
20 expected = reference[i1:i2]
21 predicted = prediction[j1:j2]
IndexError: list index out of range
Phoneme timestamps extracted from wav2vec2 transcriber may be inaccurate due to excessive padding.
Passing padded outputs to ctc-segmentation results in inaccurate timings.
Phoneme-level accurate timings.
Last phoneme may have inaccurate timings as the algorithm tries to align padding tokens as well, which is unnecessary.
N/A
To potentially allow wav2vec2 as a duration extractor for other speech tasks like speech synthesis, and since w2v2 doesn't classify punctuations, implement a punctuation forced aligner (PFA). PFA takes predicted phoneme offsets, ground truth text with punctuations, and inserts punctuations into the predicted phoneme offsets.
PER breaks with the following lexicon entry:
lexicon = {
"4806": [
['f', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'],
['f', 'ɔ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'],
['f', 'ɔ', 'eɪ', 't', 'z', 'ɪ', 'ɹ', 'oʊ', 's', 'ɪ', 'k', 's'],
['f', 'l', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's']
]
}
per = PhonemeErrorRate(lexicon)
per.compute_measures(words=["4806"], prediction=['f', 'ɔ', 'ɹ', 'eɪ', 't', 'ə', 's', 'ɪ', 'k', 's'])
See the error below.
This happens due to the way epsilons are added into the pronunciation stack. First, ['f', 'ɔ', 'eɪ', 't', 'z', 'ɪ', 'ɹ', 'oʊ', 's', 'ɪ', 'k', 's']
is set as the longest pronunciation used to compare all other pronunciations. However, notice that there are insertions coming from the shorter sequences like ['f', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's']
(addition of ɹ
) and similarly ['f', 'l', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's']
(addition of l
and ɹ
). The current code only handles deletions, in which the shorter sequences are padded to the maximum length. But because of this, the original longest pronunciation ends up shorter than the padded once, because insertions aren't accounted for.
The longest pronunciation gets similarly padded by epsilon. All padded pronunciations must have the same length, only then can the pronunciation stack be generated and PER gets calculate accordingly.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[89], line 1
----> 1 per.compute_measures(words=["4806"], prediction=['f', 'ɔ', 'ɹ', 'eɪ', 't', 'ə', 's', 'ɪ', 'k', 's'])
Cell In[79], line 37, in PhonemeErrorRate.compute_measures(self, words, prediction)
34 def compute_measures(
35 self, words: List[str], prediction: List[str]
36 ) -> Dict[str, int]:
---> 37 stack = self._build_pronunciation_stack(words)
38 reference = [
39 phoneme for word in words for phoneme in max(self.lexicon[word], key=len)
40 ]
42 editops = Levenshtein.editops(reference, prediction)
Cell In[79], line 81, in PhonemeErrorRate._build_pronunciation_stack(self, words)
79 print(pronunciations)
80 length = len(pronunciations[0])
---> 81 word_stack = [
82 set(pron[i] for pron in pronunciations) for i in range(length)
83 ]
84 stack += word_stack
85 return stack
Cell In[79], line 82, in <listcomp>(.0)
79 print(pronunciations)
80 length = len(pronunciations[0])
81 word_stack = [
---> 82 set(pron[i] for pron in pronunciations) for i in range(length)
83 ]
84 stack += word_stack
85 return stack
Cell In[79], line 82, in <genexpr>(.0)
79 print(pronunciations)
80 length = len(pronunciations[0])
81 word_stack = [
---> 82 set(pron[i] for pron in pronunciations) for i in range(length)
83 ]
84 stack += word_stack
85 return stack
IndexError: list index out of range
Make Docstring-style uniform as follows:
"""
Add a lexicon validator to ensure that all words in the sequences passed have entries in the lexicon.
Design a flexible phoneme error rate, where users can specify their own lexicon, consisting of words and their allowed pronunciation variants.
In word-overlap segmentation method, we might want to immediately skip audios with empty transcripts -- because they can't possibly overlap with transcribed text anyway. Add an option to filter against these types of entries in the pipeline.
Integrate OpenAI Whisper into Runner.
Automatically format SpeechLine output files as a HuggingFace dataset via a script.
Extend PER calculation to allow for differing phoneme lengths during comparison. This is mainly to allow for phoneme insertions that may occur in different accent variations.
To better unify the API design, re-implement batched audio classification via HuggingFace pipeline
. Noting equivalence in logits, preprocessing, and runtime speed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.