Giter VIP home page Giter VIP logo

libricrowd's Introduction

LibriCrowd

A large-scale crowdsourced English speech corpus with clean and noisy human transcriptions.

Dataset Summary

LibriCrowd is a corpus of approximately 100 hours scripted English speech with both clean and noisy human transcriptions. The raw audio files and ground truth transcriptions are selected from a subset of the well-known LibriSpeech corpus. More information about the dataset statistics and baseline models trained on it can be found in our Paper "Human Transcription Quality Improvement"

Supported Tasks

  • Human Transcription Error Detection and Correction:

The dataset contains crowdsourced human transcriptions at various noisy levels. The transcription quality is measured by the Transcription Word Error Rate (TWER) of noisy human transcriptions and the ground truth reference. The task is to improve crowdsourced human transcription quality by developing Confidence Estimation Models (CEMs) to detect human errors, and Error Correction Models (ECMs) to refine human transcriptions. The CEM performance is evaluated by error prediction accuracy (precision, recall, F1) at word or utterance level. The ECM performance is evaluated by the TWER reduction between the raw and refined human transcripitons.

  • Robust Automatic Speech Recognition (ASR) System Evaluation:

The dataset can be used to finetune pretrained speech representations with the audio file and its human transcription text. The transcription has a clean version as well as a noisy version with a certain amount of human error. A robust ASR system is expect to have limited performance degradation when it's trained on nosiy data compared with the same model trained on clean data. The ASR system is evaluated by the Word Error Rate (WER %) of the ASR hypothesis compared to the ground truth reference.

The controled noisy level (TWER %) can be obtained by randomly mixing the noisy human transcriptions with the ground truth reference. The robustness of ASR models is evaluated by using different levels of nosiy transcriptions as the training data, and then measure the WER.

Dataset Structure and Statistics

This dataset can be split into three subsets. For training, the data is split into 'train-other-10h', 'train-other-60h', 'train-mixed-10h'; For evaluation, the data is split into 'dev-clean', 'dev-other', 'test-clean', 'test-other', which is the same as in the LibriSpeech dev/test subsets. The entire dataset contains approximately 100 hours of English speech. Detailed statistic is listed below:

Subset # Utterances speech hours # Workers # Responses
train-other-10h 3165 10.0 1258 18673
train-other-60h 17816 60.0 1136 20187
train-mixed-10h 2763 9.8 616 14231
dev-clean 2703 5.4 523 13994
test-clean 2620 5.4 527 13587
dev-other 2864 5.3 620 15235
test-other 2939 5.1 989 15950
all 34870 101.0 4433 111857

Download

Licensing Information

Acknowledgements

  • LibriSpeech dataset is used under the CC BY 4.0 license.
  • Libri-Light dataset is used under the the MIT license.
  • LibriVox project is a free public domain audiobooks read by volunteers from around the world. All LibriVox Recordings are in the Public Domain and free to use without any restriction.

libricrowd's People

Contributors

generateai avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.