Giter VIP home page Giter VIP logo

open-speech-corpora's Introduction

Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of Creative Commons license). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CommonVoice English English 780 hours (validated); 1,087 hours (total) 39,577 speakers (reported: 11% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice German German 325 hours (validated); 340 hours (total) 5,007 speakers (reported: 10% female / 68% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice French French 173 hours (validated); 184 hours (total) 3,005 speakers (reported: 9% female / 70% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Welsh Welsh 42 hours (validated); 48 hours (total) 748 speakers (reported: 18% female / 33% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Breton Breton 3 hours (validated); 10 hours (total) 118 speakers (reported: 3% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Chuvash Chuvash 1 hour (validated); 2 hours (total) 38 speakers (reported: 0% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Turkish Turkish 9 hours (validated); 10 hours (total) 344 speakers (reported: 11% female / 70% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Tatar Tatar 22 hours (validated); 26 hours (total) 132 speakers (reported: 2% female / 83% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Kyrgyz Kyrgyz 8 hours (validated); 20 hours (total) 97 speakers (reported: 47% female / 44% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Irish Irish 2 hour (validated); 3 hour (total) 63 speakers (reported: 15% female / 62% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Kabyle Kabyle 181 hours (validated); 192 hours (total) 584 speakers (reported: 15% female / 57% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Catalan Catalan 107 hours (validated); 120 hours (total) 1,834 speakers (reported: 37% female / 43% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Chinese (Taiwan) Mandarin (Taiwan) 33 hours (validated); 43 hours (total) 949 speakers (reported: 29% female / 46% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Slovenian Slovenian 2 hour (validated); 5 hours (total) 42 speakers (reported: 20% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Italian Italian 36 hours (validated); 40 hours (total) 602 speakers (reported: 18% female / 62% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Dutch Dutch 18 hours (validated); 23 hours (total) 502 speakers (reported: 2% female / 72% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Hakha Chin Hakha Chin 2 hours (validated); 4 hours (total) 280 speakers (reported: 20% female / 24% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Esperanto Esperanto 13 hours (validated); 16 hours (total) 129 speakers (reported: 11% female / 51% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Estonian Estonian 11 hours (validated); 12 hours (total) 225 speakers (reported: 37% female / 57% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Persian Persian 67 hours (validated); 70 hours (total) 1,240 speakers (reported: 13% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Basque Basque 46 hours (validated); 83 hours (total) 508 speakers (reported: 22% female / 53% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Spanish Spanish 27 hours (validated); 31 hours (total) 611 speakers (reported: 9% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Chinese (China) Mandarin (China) 11 hours (validated); 12 hours (total) 288 speakers (reported: 0% female / 76% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Mongolian Mongolian 8 hours (validated); 9 hours (total) 230 speakers (reported: 22% female / 35% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Sakha Sakha 3 hours (validated); 6 hours (total) 35 speakers (reported: 10% female / 54% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Dhivehi Dhivehi 5 hours (validated); 8 hours (total) 92 speakers (reported: 65% female / 27% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Kinyarwanda Kinyarwanda <1 hours (validated); 1 hours (total) 32 speakers (reported: 0% female / 13% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Swedish Swedish 3 hours (validated); 3 hours (total) 44 speakers (reported: 3% female / 75% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Russian Russian 27 hours (validated); 31 hours (total) 64 speakers (reported: 42% female / 55% male) https://voice.mozilla.org/en/datasets CC-0
Yesno Hebrew 6 mins one male http://www.openslr.org/1/ CC-0
LJ Speech Corpus English ~24 hours one female https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 CC-0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ARU Speech Corpus English (UK) 720 utterances / speaker 12 (6 femals; 6 male) http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip CC-BY 3.0
Althingi Parliamentary Speech Corpus Icelandic 542 hours and 25 minutes 196 speakers http://www.malfong.is/index.php?dlid=73&lang=en CC-BY 4.0
Alþingisumræður Parliamentary Speech Corpus Icelandic ~21 hours http://www.malfong.is/index.php?dlid=8&lang=en CC-BY 3.0
Hjal Corpus Icelandic ~41,000 recordings 883 speakers http://www.malfong.is/index.php?dlid=5&lang=en CC-BY 3.0
The Malromur Corpus Icelandic 152 hours 563 speakers http://www.malfong.is/index.php?dlid=65&lang=en CC-BY 4.0
Telecooperation German Corpus for Kinect German ~35 hours ~180 speakers http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz CC-BY 2.0
African Speech Technology English-English Speech Corpus English ~21 hours https://repo.sadilar.org/handle/20.500.12185/283 CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech Corpus isiXhosa ~26 hours https://repo.sadilar.org/handle/20.500.12185/305 CC-BY 2.5 South Africa
NCHLT Afrikaans Afrikaans 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/280 CC-BY 3.0
NCHLT English English 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/274 CC-BY 3.0
NCHLT isiNdebele isiNdebele 56 hours 148 speakers (78 female / 70 male) https://repo.sadilar.org/handle/20.500.12185/272 CC-BY 3.0
NCHLT isiXhosa isiXhosa 56 hours 209 speakers (106 female / 103 male) https://repo.sadilar.org/handle/20.500.12185/279 CC-BY 3.0
NCHLT isiZulu isiZulu 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/275 CC-BY 3.0
NCHLT Sepedi Sepedi 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/270 CC-BY 3.0
NCHLT Sesotho Sesotho 56 hours 210 speakers (113 female / 97 male) https://repo.sadilar.org/handle/20.500.12185/278 CC-BY 3.0
NCHLT Setswana Setswana 56 hours 210 speakers (109 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/281 CC-BY 3.0
NCHLT Siswati Siswati 56 hours 197 speakers (96 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/271 CC-BY 3.0
NCHLT Tshivenda Tshivenda 56 hours 208 speakers (83 female / 125 male) https://repo.sadilar.org/handle/20.500.12185/276 CC-BY 3.0
NCHLT Xitsonga Xitsonga 56 hours 198 speakers (95 female/103 male) https://repo.sadilar.org/handle/20.500.12185/277 CC-BY 3.0
Lwazi II Cross-lingual Proper Name Corpus Afrikaans; English; isiZulu; Sesotho 2 hours 5 mins 20 speakers https://repo.sadilar.org/handle/20.500.12185/445 CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone Corpus English 2 hours 7 mins https://repo.sadilar.org/handle/20.500.12185/448 CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking Corpus Afrikaans 4 hours one male https://repo.sadilar.org/handle/20.500.12185/442 CC-BY 3.0
LibriSpeech English ~1000 hours 2484 speakers (1201 female / 1283 male) http://www.openslr.org/12/ CC-BY 4.0
Zeroth-Korean Korean 52.8 hours 115 speakers http://www.openslr.org/40/ CC-BY 4.0
Speech Commands English 17.8 hours >1,000 speakers https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html CC-BY 4.0
ParlamentParla Catalan 320 hours https://www.openslr.org/59/ CC-BY 4.0
SIWIS French ~10 hours one female http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip CC-BY 4.0
VCTK English 44 hours 109 speakers http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip CC-BY 4.0
LibriTTS English 586 hours 2,456 speakers (1,185 female / 1,271 male) http://www.openslr.org/60/ CC-BY 4.0
Augmented LibriSpeech Audio (English); Text (English, French) 236 hours https://persyval-platform.univ-grenoble-alpes.fr/DS91/detaildataset CC-BY 4.0
Helsinki Prosody Corpus English 262.5 hours 1,230 speakers https://github.com/Helsinki-NLP/prosody CC-BY 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Iban Iban 8 hours http://www.openslr.org/24/ https://github.com/sarahjuan/iban CC-BY-SA 2.0
Vystadial English; Czech 41 hours; 15 hours http://www.openslr.org/6/ CC-BY-SA 3.0 US
Free Spoken Digit Dataset English 2,000 isolated digits 4 speakers https://github.com/Jakobovski/free-spoken-digit-dataset CC-BY-SA 4.0
Google Javanese Javanese 296 hours 1019 speakers http://www.openslr.org/35/ CC-BY-SA 4.0
Google Nepali Nepali 165 hours 527 speakers http://www.openslr.org/54/ CC-BY-SA 4.0
Google Bengali Bengali 229 hours 508 speakers http://www.openslr.org/53/ CC-BY-SA 4.0
Google Sinhala Sinhala 224 hours 478 speakers http://www.openslr.org/52/ CC-BY-SA 4.0
Google Sundanese Sundanese 333 hours 542 speakers http://www.openslr.org/36/ CC-BY-SA 4.0
Spokend Wikipedia Corpus (SWC-2017) English; German; Dutch 182 hours; 249 hours; 79 hours 395 speakers; 339 speakers; 145 speakers https://nats.gitlab.io/swc/ CC-BY-SA 4.0
Chuvash TTS Chuvash 4 hours 1 speaker https://github.com/ftyers/Turkic_TTS CC-BY-SA 4.0
Forschergeist German 2 hours 2 speakers (1 female; 1 male) female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz CC-BY-SA 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
IBM Recorded Debates v1 English 5 hours 10 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
IBM Recorded Debates v2 English ~14 hours 14 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
TV3Parla Catalan 240 hours http://laklak.eu/share/tv3_0.3.tar.gz CC-BY-NC 4.0
Russian Open STT Corpus Russian ~7000 hours https://github.com/snakers4/open_stt/#links CC-BY-NC 4.0 with some expections
Russian Open TTS Corpus Russian 145 hours 3 males https://github.com/snakers4/open_tts/#links CC-BY-NC 4.0 with some expections
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CHiME-Home English 6.8 hours https://archive.org/details/chime-home CC-BY-NC-SA 3.0
Cameroon Pidgin English Corpus Cameroon Pidgin English ~17 hours http://ota.ox.ac.uk/text/2563.zip CC-BY-NC-SA 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Tatoeba-Eng English ~250 hours (rough estimate) 6 speakers https://voice.mozilla.org/en/datasets CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text)
TED-LIUM English 118 hours 685 speakers (36h female / 81h male) http://www.openslr.org/7/ CC-BY-NC-ND 3.0
TED-LIUM-2 English 207 hours 1242 speakers (66h female / 141h male) http://www.openslr.org/19/ CC-BY-NC-ND 3.0
TED-LIUM-3 English 452 hours 2028 speakers (134h female / 316h male) http://www.openslr.org/51/ CC-BY-NC-ND 3.0
Pansori TEDxKR Korean 3 hours 41 speakers http://www.openslr.org/58/ CC-BY-NC-ND 4.0
Primewords Mandarin Mandarin 100 hours 296 speakers http://www.openslr.org/47/ CC-BY-NC-ND 4.0
MuST-C v1.0 Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair https://ict.fbk.eu/must-c-release-v1-0/ CC-BY-NC-ND 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
VoxForge English ~120 hours ~2966 speakers http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets GNU-GPL 3.0
VoxForge Russian http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
VoxForge German http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
AISHELL-1 Mandarin 170 hours 400 speakers http://www.openslr.org/33/ Apache 2.0
Tunisian_MSA Modern Standard Arabic (Tunisia) 11.2 hours 118 speakers http://www.openslr.org/46/ Apache 2.0
African Accented French French 22 hours 232 speakers http://www.openslr.org/57/ Apache 2.0
THCHS-30 Mandarin Chinese 33.57 hours (13,389 utterances) 40 speakers (31 female; 9 male) http://www.openslr.org/18/ Apache 2.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ALFFA Amharic;Hausa (paid); Swahili; Wolof http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC MIT
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
M-AILABS German Corpus German 237 hours and 22 minutes http://www.caito.de/data/Training/stt_tts/de_DE.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Queen's English Corpus Queen's English 45 hours and 35 minutes http://www.caito.de/data/Training/stt_tts/en_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS US English Corpus American English 102 hours and 7 minutes http://www.caito.de/data/Training/stt_tts/en_US.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Spanish Corpus Spanish Spanish 108 hours and 34 minutes http://www.caito.de/data/Training/stt_tts/es_ES.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Italian Corpus Italian 127 hours and 40 minutes http://www.caito.de/data/Training/stt_tts/it_IT.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Ukrainian Corpus Ukrainian 87 hours and 8 minutes http://www.caito.de/data/Training/stt_tts/uk_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Russian Corpus Russian 46 hours and 47 minutes http://www.caito.de/data/Training/stt_tts/ru_RU.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS French-v0.9 Corpus French 190 hours and 30 minutes http://www.caito.de/data/Training/stt_tts/fr_FR.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Polish Corpus Polish 53 hours and 50 minutes http://www.caito.de/data/Training/stt_tts/pl_PL.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Fluent Speech Commands Corpus English 19 hours (30,043 utterances) 97 speakers http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz Fluent Speech Commands Public License
CMU Wilderness 700 Langs Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours https://github.com/festvox/datasets-CMU_Wilderness Questionable Legality: https://live.bible.is/terms
CHiME-5 English 50 hours 48 speakers http://spandh.dcs.shef.ac.uk/chime_challenge/data.html CHiME-5 License
FalaBrasil-LAPS-Constituicao Brazilian-Portuguese 9 hours 1 speaker https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMail Brazilian-Portuguese 1 hour 25 speakers https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS Benchmark Brazilian-Portuguese 1 hour 1 speaker https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
Fearless Steps Corpus English 19,000 hours (20 hours transcribed) ~450 speakers http://fearlesssteps.exploreapollo.org/
Microsoft Speech Corpus (Indian languages) Telugu; Tamil; Gujarati https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e Non-Commercial Microsoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation Corpus English; Chinese; Japanese https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187 Non-Commercial Microsoft Research Data License Agreement
Hey Snips Corpus English 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances 2215 speakers (positive & negative) and 4028 speakers (negative only) https://research.snips.ai/datasets/keyword-spotting Snips Data License
Snips SLU Corpus English; French 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances English: 69 speakers; French: 30 speakers https://research.snips.ai/datasets/spoken-language-understanding Snips Data License

open-speech-corpora's People

Contributors

andrenatal avatar choufractal avatar ftyers avatar jrmeyer avatar shacharmirkin avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.