Giter VIP home page Giter VIP logo

voxdiy's Introduction

About

This repository provides data and code for "CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription" paper.

The collected transcriptions stored in data/*-crowd.tsv, ground-truth transcriptions stored in data/*-gt.txt. We also provide a code for the annotation process and speech synthesis in annotation and speech_sythesis folders, respectively.

Data

CrowdSpeech and VoxDIY datasets stored in the data folder. Each dataset is associated with two filed: <dataset>-<split>-crowd.tsv and <dataset>-<split>-gt.txt. The first one contains three columns INPUT:audio — an audio file given to crowd workers, OUTPUT:transcription — worker's transcription and ASSIGNMENT:worker_id — a unique worker identifier. The second file contains two tab-separated columns without header: an audio file and the ground-truth transcription.

You can also download the CrowdSpeech dataset from HuggingFace.

Evaluation

First, you may need to install some dependencies:

pip3 install crowd-kit toloka-kit jiwer

Then, you can easily evaluate all our baseline aggregation methods by a single command:

python3 baselines.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

In order to get the Oracle result, run

python3 oracle.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

You can also get the Inter-Rater Agreement by running

python3 agreement.py data/<dataset>-crowd.tsv

VoxDIY

You can find an IPython notebook with a code for the data collection process for the VoxDIY. For the quality control, we use a special class, TaskProcessor, that gets all the submits that are not accepted or rejected at the moment, calculates workers' skills, and checks if a submit should be accepted or rejected.

T5 Model

Our data is also available at HuggingFace Hub as well as the T5 model trained on train-clean, dev-clean and dev-other parts of CrowdSpeech.

This snippet shows the example of the model's inference:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)  # sample text

License

Code

© YANDEX LLC, 2021. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.

Data

© YANDEX LLC, 2021. Licensed under the Creative Commons Attribution 4.0 license. See data/LICENSE file for more details.

Acknowledgements

LibriSpeech dataset is used under the Creative Commons Attribution 4.0 license.

CrowdWSA2019 dataset is used under the Creative Commons Attribution 4.0 license.

voxdiy's People

Contributors

dustalov avatar pilot7747 avatar

Stargazers

 avatar Olim avatar Qian Liu avatar Kiril Gashteovski avatar Sergey avatar  avatar  avatar TED Vortex (Teodor-Eugen Duțulescu) avatar Nicholas Cilfone avatar Josh Meyer avatar Volodymyr Kyrylov avatar Nickolay V. Shmyrev avatar  avatar  avatar Valentin Biryukov avatar

Watchers

James Cloos avatar Sergey avatar Nickolay V. Shmyrev avatar  avatar

Forkers

ksoky

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.