About

This repository provides data and code for "CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription" paper.

The collected transcriptions stored in data/*-crowd.tsv, ground-truth transcriptions stored in data/*-gt.txt. We also provide a code for the annotation process and speech synthesis in annotation and speech_sythesis folders, respectively.

Data

CrowdSpeech and VoxDIY datasets stored in the data folder. Each dataset is associated with two filed: <dataset>-<split>-crowd.tsv and <dataset>-<split>-gt.txt. The first one contains three columns INPUT:audio — an audio file given to crowd workers, OUTPUT:transcription — worker's transcription and ASSIGNMENT:worker_id — a unique worker identifier. The second file contains two tab-separated columns without header: an audio file and the ground-truth transcription.

You can also download the CrowdSpeech dataset from HuggingFace.

Evaluation

First, you may need to install some dependencies:

pip3 install crowd-kit toloka-kit jiwer

Then, you can easily evaluate all our baseline aggregation methods by a single command:

python3 baselines.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

In order to get the Oracle result, run

python3 oracle.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

You can also get the Inter-Rater Agreement by running

python3 agreement.py data/<dataset>-crowd.tsv

VoxDIY

You can find an IPython notebook with a code for the data collection process for the VoxDIY. For the quality control, we use a special class, TaskProcessor, that gets all the submits that are not accepted or rejected at the moment, calculates workers' skills, and checks if a submit should be accepted or rejected.

T5 Model

Our data is also available at HuggingFace Hub as well as the T5 model trained on train-clean, dev-clean and dev-other parts of CrowdSpeech.

This snippet shows the example of the model's inference:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)  # sample text

License

Code

Data

Acknowledgements

LibriSpeech dataset is used under the Creative Commons Attribution 4.0 license.

CrowdWSA2019 dataset is used under the Creative Commons Attribution 4.0 license.

pilot7747 / voxdiy Goto Github PK

voxdiy's Introduction

About

Data

Evaluation

VoxDIY

T5 Model

License

Code

Data

Acknowledgements

voxdiy's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent