Giter VIP home page Giter VIP logo

speechline's People

Contributors

anantoj avatar davidsamuell avatar dependabot[bot] avatar w11wo avatar

Stargazers

 avatar

Watchers

 avatar

speechline's Issues

Failed to Build Documentation

Description

Documentation fails to build on GitHub Actions.

Expected Behavior

Documentation gets built automatically and subsequently served on GitHub Pages.

Actual Behavior

Failed docs deployment. See Failed Job.

Steps to Reproduce

N/A

[ML] Unbatched Classification

As per more testing, the new batched classification pipeline seems to take up too much of RAM. In contrast, single-audio classification is more memory efficient, although likely slower than batching. However, since time is not really a constraint here, unbatched, single-audio classification is more preferable.

  • Source Code
  • Tests
  • Documentation

[ML] Noise Tagging

In every non-transcribed section, infer what kind of noise is present. Set this to be optional, and the minimum non-transcribed section length in seconds (so as to not keep inferring super-short segments). Inferred tag is the predicted label with max probability above a certain threshold. Add tag as chunk with special tag e.g. [speech].

[ML] End-to-End Pipeline

  • Source Code
    • Load Data
    • Classify Audios
    • Transcribe Audios
    • Segment Audio
  • Test Suite
  • Formatting & Static-Type Tests
  • Documentation

[ML] Support Other Model Architectures

Description

We need your help to support more models in SpeechLine. Currently, we support,

Automatic Speech Recognizers:

  • wav2vec 2.0
  • OpenAI Whisper
  • HuBERT
  • SpeechT5

Audio Classifiers:

  • wav2vec 2.0
  • Audio Spectrogram Transformer

You can view examples here: classifier and trancribers. We mainly use HuggingFace pipelines for convenience.

Phoneme Error Rate fails on "see"

Description

PER breaks when predicted phoneme is longer than pronunciation in lexicon, causing an out of range error.

Expected Behavior

PER has no index error

Actual Behavior

PER has an index error

lexicon = {"see": [['s', 'i']]}
words = ["see"]
prediction = ['s', 'i', 'd']
per = PhonemeErrorRate(lexicon)
per(words=words, prediction=prediction)
IndexError                                Traceback (most recent call last)
Cell In[11], line 4
      2 prediction = ['s', 'i', 'd']
      3 per = PhonemeErrorRate(lexicon)
----> 4 per(words=words, prediction=prediction)

Cell In[9], line 18, in PhonemeErrorRate.__call__(self, words, prediction)
     15 for tag, i1, i2, j1, j2 in s.get_opcodes():
     16     if tag != "equal":
     17         # if there happens to be multiple valid phoneme swaps in current index
---> 18         if i1 == idx and len(stack[idx]) > 1:
     19             # get current substring
     20             expected = reference[i1:i2]
     21             predicted = prediction[j1:j2]

IndexError: list index out of range

Inaccurate Phoneme Timestamps

Phoneme timestamps extracted from wav2vec2 transcriber may be inaccurate due to excessive padding.

Description

Passing padded outputs to ctc-segmentation results in inaccurate timings.

Expected Behavior

Phoneme-level accurate timings.

Actual Behavior

Last phoneme may have inaccurate timings as the algorithm tries to align padding tokens as well, which is unnecessary.

Steps to Reproduce

N/A

[ML] Phoneme Overlap Runner

  • Integrate Phoneme Overlap to Runner
  • Lexicon defaults to Lexikos, but also add additional optional config which points to JSON file for extra lexicon entries
  • Test Suite
  • Documentation

[ML] Audio Transcriber

  • Inference
  • Timestamp Extraction
  • Test Suites
  • Styling Test
  • Static-Type Test
  • Documentation

[UT] Punctuation Forced Aligner

To potentially allow wav2vec2 as a duration extractor for other speech tasks like speech synthesis, and since w2v2 doesn't classify punctuations, implement a punctuation forced aligner (PFA). PFA takes predicted phoneme offsets, ground truth text with punctuations, and inserts punctuations into the predicted phoneme offsets.

  • Source Code
  • Test Suites
  • Docs
  • Formatting

`IndexError` on Pronunciation Stack Building

Description

PER breaks with the following lexicon entry:

lexicon = {
    "4806": [
        ['f', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'],
        ['f', 'ɔ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'],
        ['f', 'ɔ', 'eɪ', 't', 'z', 'ɪ', 'ɹ', 'oʊ', 's', 'ɪ', 'k', 's'],
        ['f', 'l', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's']
    ]
}
per = PhonemeErrorRate(lexicon)
per.compute_measures(words=["4806"], prediction=['f', 'ɔ', 'ɹ', 'eɪ', 't', 'ə', 's', 'ɪ', 'k', 's'])

See the error below.

This happens due to the way epsilons are added into the pronunciation stack. First, ['f', 'ɔ', 'eɪ', 't', 'z', 'ɪ', 'ɹ', 'oʊ', 's', 'ɪ', 'k', 's'] is set as the longest pronunciation used to compare all other pronunciations. However, notice that there are insertions coming from the shorter sequences like ['f', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'] (addition of ɹ) and similarly ['f', 'l', 'ɔ', 'ɹ', 'eɪ', 't', 'oʊ', 's', 'ɪ', 'k', 's'] (addition of l and ɹ). The current code only handles deletions, in which the shorter sequences are padded to the maximum length. But because of this, the original longest pronunciation ends up shorter than the padded once, because insertions aren't accounted for.

Expected Behavior

The longest pronunciation gets similarly padded by epsilon. All padded pronunciations must have the same length, only then can the pronunciation stack be generated and PER gets calculate accordingly.

Actual Behavior

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[89], line 1
----> 1 per.compute_measures(words=["4806"], prediction=['f', 'ɔ', 'ɹ', 'eɪ', 't', 'ə', 's', 'ɪ', 'k', 's'])

Cell In[79], line 37, in PhonemeErrorRate.compute_measures(self, words, prediction)
     34 def compute_measures(
     35     self, words: List[str], prediction: List[str]
     36 ) -> Dict[str, int]:
---> 37     stack = self._build_pronunciation_stack(words)
     38     reference = [
     39         phoneme for word in words for phoneme in max(self.lexicon[word], key=len)
     40     ]
     42     editops = Levenshtein.editops(reference, prediction)

Cell In[79], line 81, in PhonemeErrorRate._build_pronunciation_stack(self, words)
     79     print(pronunciations)
     80     length = len(pronunciations[0])
---> 81     word_stack = [
     82         set(pron[i] for pron in pronunciations) for i in range(length)
     83     ]
     84     stack += word_stack
     85 return stack

Cell In[79], line 82, in <listcomp>(.0)
     79     print(pronunciations)
     80     length = len(pronunciations[0])
     81     word_stack = [
---> 82         set(pron[i] for pron in pronunciations) for i in range(length)
     83     ]
     84     stack += word_stack
     85 return stack

Cell In[79], line 82, in <genexpr>(.0)
     79     print(pronunciations)
     80     length = len(pronunciations[0])
     81     word_stack = [
---> 82         set(pron[i] for pron in pronunciations) for i in range(length)
     83     ]
     84     stack += word_stack
     85 return stack

IndexError: list index out of range

[OS] Pipeline Documentation

  • For example, what does the pandas dataframe contain? How often does it get populated, etc?
  • How do you convert from a pandas dataframe to a Hugging Face dataset? What is the purpose of the transformation?
  • What does the audio classifier do to separate audio (i.e. children voices from adult voices, noise from true audible speech, etc)?
  • Following classification, what is the purpose of the audio transcription?
  • What are the resulting audio chunks? At the phoneme level?

[OS] Docstrings Cleanup

Make Docstring-style uniform as follows:

  • Newline after """
  • Newline for arguments and returns
  • Markdown-wrap all Python statements.

[ML] PER Lexicon Validator

Add a lexicon validator to ensure that all words in the sequences passed have entries in the lexicon.

  • Source Code
  • Test Suites
  • Documentation
  • Formatting

[ML] Phoneme Error Rate

Design a flexible phoneme error rate, where users can specify their own lexicon, consisting of words and their allowed pronunciation variants.

  • Source Code
  • Test Suites
  • Documentation
  • Formatting

[ML] Add Filter Empty Transcript Option

In word-overlap segmentation method, we might want to immediately skip audios with empty transcripts -- because they can't possibly overlap with transcribed text anyway. Add an option to filter against these types of entries in the pipeline.

[ML] Improve PER

  • Replicate WER metric from HuggingFace Evaluate
  • Add lexicon validator
  • Test Suites
  • Documentation
  • Formatting

[OS] Improve Docs

  • Fix return types not correctly rendered
  • Change all example blocks to pycon

[ML] Variable Phoneme Length PER

Extend PER calculation to allow for differing phoneme lengths during comparison. This is mainly to allow for phoneme insertions that may occur in different accent variations.

  • Source Code
  • Test Suites
  • Documentation
  • Formatting

[ML] Modify Classification to use Pipeline

To better unify the API design, re-implement batched audio classification via HuggingFace pipeline. Noting equivalence in logits, preprocessing, and runtime speed.

  • Source Code
  • Test Suites
  • Formatting

[ML] Segmentation Strategies

  • Allow for unsegmented output
  • Allow for different segmentation techniques (currently only silence-based, also implement ground-truth overlapping).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.