Giter VIP home page Giter VIP logo

ctc-forced-aligner's Introduction

Forced Alignment with Hugging Face CTC Models

Build Status GitHub stars GitHub issues GitHub license Twitter

drawing Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!

This Python package provides an efficient way to perform forced alignment between text and audio using Hugging Face's pretrained models. It leverages the power of Wav2Vec2, HuBERT, and MMS models for accurate alignment, making it a powerful tool for creating speech corpuses.

Features

  • Atleast 5X less memory usage: Improved implementation to use much less memory than TorchAudio forced alignment API.
  • Wide range of language support: Works with multiple languages including English, Arabic, Russian, German, and 1126 more languages.
  • Flexibility in alignment granularity: Choose between aligning on a sentence, word, or character level.
  • Customizable alignment parameters: Control the frequency of <star> token insertion, merge threshold for segment merging, and more.
  • Integration with Hugging Face's models: Leverage the power of pretrained Wav2Vec2, HuBERT, and MMS models for accurate alignment.
  • GPU acceleration: Utilize your GPU for faster inference.
  • Output in JSON format: Provides clear and structured alignment results for easy analysis and integration.

Installation

pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git

Usage

ctc-forced-aligner --audio_path "path/to/audio.wav" --text_path "path/to/text.txt" --language "eng" --romanize
Terminal Usage

Arguments

Argument Description Default
--audio_path Path to the audio file Required
--text_path Path to the text file Required
--language Language in ISO 639-3 code Required
--romanize Enable romanization for non-latin scripts or for multilingual models regardless of the language, required when using the default model False
--split_size Alignment granularity: "sentence", "word", or "char" "word"
--star_frequency Frequency of <star> token: "segment" or "edges" "edges"
--merge_threshold Merge threshold for segment merging 0.00
--alignment_model Name of the alignment model MahmoudAshraf/mms-300m-1130-forced-aligner
--compute_dtype Compute dtype for inference "float32"
--batch_size Batch size for inference 4
--window_size Window size in seconds for audio chunking 30
--context_size Overlap between chunks in seconds 2
--attn_implementation Attention implementation "eager"
--device Device to use for inference: "cuda" or "cpu" "cuda" if available, else "cpu"

Examples

# Align an English audio file with the text file
ctc-forced-aligner --audio_path "english_audio.wav" --text_path "english_text.txt" --language "eng" --romanize

# Align a Russian audio file with romanized text
ctc-forced-aligner --audio_path "russian_audio.wav" --text_path "russian_text.txt" --language "rus" --romanize

# Align on a sentence level
ctc-forced-aligner --audio_path "audio.wav" --text_path "text.txt" --language "eng" --split_size "sentence" --romanize

# Align using a model with native vocabulary
ctc-forced-aligner --audio_path "audio.wav" --text_path "text.txt" --language "ara" --alignment_model "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
Python Usage

Python Usage

import torch
from ctc_forced_aligner import (
    load_audio,
    load_alignment_model,
    generate_emissions,
    preprocess_text,
    get_alignments,
    get_spans,
    postprocess_results,
)

audio_path = "your/audio/path"
text_path = "your/text/path"
language = "iso" # ISO-639-3 Language code
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16


alignment_model, alignment_tokenizer, alignment_dictionary = load_alignment_model(
    device,
    dtype=torch.float16 if device == "cuda" else torch.float32,
)

audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)


with open(text_path, "r") as f:
    lines = f.readlines()
text = "".join(line for line in lines).replace("\n", " ").strip()

emissions, stride = generate_emissions(
    alignment_model, audio_waveform, batch_size=batch_size
)

tokens_starred, text_starred = preprocess_text(
    text,
    romanize=True,
    language=language,
)

segments, scores, blank_id = get_alignments(
    emissions,
    tokens_starred,
    alignment_dictionary,
)

spans = get_spans(tokens_starred, segments, alignment_tokenizer.decode(blank_id))

word_timestamps = postprocess_results(text_starred, spans, stride, scores)

Output

The alignment results will be saved to a file containing the following information in JSON format:

  • text: The aligned text.
  • segments: A list of segments, each containing the start and end time of the corresponding text segment.
JSON
{
  "text": "This is a sample text to be aligned with the audio.",
  "segments": [
    {
      "start": 0.000,
      "end": 1.234,
      "text": "This"
    },
    {
      "start": 1.234,
      "end": 2.567,
      "text": "is"
    },
    {
      "start": 2.567,
      "end": 3.890,
      "text": "a"
    },
    {
      "start": 3.890,
      "end": 5.213,
      "text": "sample"
    },
    {
      "start": 5.213,
      "end": 6.536,
      "text": "text"
    },
    {
      "start": 6.536,
      "end": 7.859,
      "text": "to"
    },
    {
      "start": 7.859,
      "end": 9.182,
      "text": "be"
    },
    {
      "start": 9.182,
      "end": 10.405,
      "text": "aligned"
    },
    {
      "start": 10.405,
      "end": 11.728,
      "text": "with"
    },
    {
      "start": 11.728,
      "end": 13.051,
      "text": "the"
    },
    {
      "start": 13.051,
      "end": 14.374,
      "text": "audio."
    }
  ]
}

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

License

This project is licensed under the BSD License, note that the default model has CC-BY-NC 4.0 License, so make sure to use a different model for commercial usage.

Acknowledgements

This project is based on the work of FAIR MMS team.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.