Giter VIP home page Giter VIP logo

magichub-awesome-datasets-and-competitions's Introduction

Magichub - Awesome Audio and Text Corpus Collections

Magichub是人工智能领域的数据服务商Magic Data为向整个行业提供免费、开源的自有数据集而搭建的社区站点。

目前Magichub已经开源的数据集有68个,并仍在持续更新中。

Competitions

Overview | 赛程说明 | Datasets for Training | 训练集 Baseline | 基线
PS: A logged in account is required for free downloading datasets.



Datasets

Datasets License

Magic Data Open-Source License

This open-source dataset consists of 6 hours of transcribed Mandarin Chinese scripted speech of keyword spotting in fast, normal, and slow speed, where 11,030utterances contributed by 37 speakers were contained.

这个开源数据集由6小时转录的普通话中文脚本的关键字点燃,快速,正常和慢速,其中包含37个发言者的11,030个发音。

This open-source dataset consists of 5.04 hours of transcribed English conversational speech beyond telephony, where 13 conversations were contained.

此数据集包含了5.04个小时的英语电话信道对话音频和转写文本,内容为10组对话。

This open-source dataset consists of 1.44 hours of transcribed Chinese English scripted speech from children, where 2,266 utterances contributed by ten speakers, aged 7 or less, were contained.

此数据集包含了5.04个小时的英语电话信道对话音频和转写文本,内容为10组对话。

This open-source dataset consists of 4 hours of transcribed Pakistani English scripted speech focusing on daily use sentences, where 2,191 utterances contributed by seven speakers were contained.

此数据集包含了4个小时的巴基斯坦英语朗读音频和转写 文本,内容为由7名说话人提供的2,191条日常用语语料。



French Audio Datasets

This open-source dataset consists of 1.1 hours of transcribed French conversational speech on certain topics, where six conversations between two speakers were contained.

此数据集包含1.1个小时的法语对话音频和转写文本,内容为2组说话人之间的6组自由对话。



Korean Audio Datasets

This open-source dataset consists of 5.22 hours of transcribed Korean conversational speech on certain topics, where 22 conversations between seven pairs of speakers were contained.

此数据集包含了5.22个小时的韩语对话音频和转写文本,内容为7组说话人之间的22组给定主题对话。



German Audio Datasets

This open-source dataset consists of 6.55 hours of transcribed German conversational speech on certain topics, where 10 conversations between two pairs of speakers were contained.

此数据集包含6.55小时的德语对话音频和转写文本,内容为两组说话人之间的10组特定主题对话。

This open-source dataset consists of 0.71 hours of transcribed German scripted speech focusing on commands and queries, where 597 utterances contributed by ten speakers were contained.

此数据集包含了0.71小时的德语朗读音频和转写文本,内容为命令和控制。共有597条语料,由10名说话人提供。



Japanese Audio Datasets

This open-source dataset consists of 18 hours of transcribed Japanese scripted speech focusing on daily use sentences, where 17,372 utterances contributed by 37 speakers were contained.

此数据集包含了18个小时的日语朗读音频和转写文本,有17,372条由37名说话人提供的日常用语语料。



Italian Audio Datasets

This open-source dataset consists of 0.9 hours of transcribed Italian scripted speech focusing on commands and queries, where 982 utterances contributed by ten speakers were contained.

此数据集包含了0.9个小时的意大利语朗读音频和转写文本,包含有982条由10名说话人提供的命令控制相关语料。

This open-source dataset consists of 10.43 hours of transcribed Italian conversational speech on certain topics, where 28 conversations between three pairs of speakers were contained.

此数据集包含了10.43个小时的意大利语对话音频和转写文本,内容为三组说话人之间的28组给定主题对话。



Spanish Audio Datasets

This open-source dataset consists of 5.56 hours of transcribed Peninsular Spanish conversational speech on certain topics, where 17 conversations between four pairs of speakers were contained.

此数据集包含了5.56个小时的西班牙半岛地区西班牙语对话音频和转写文本,内容为四组说话人之间的17组给定主题对话。

This open-source dataset consists of 4.08 hours of transcribed American Spanish scripted speech focusing on daily use sentences, where 5,159 utterances contributed by ten speakers were contained.

此数据集包含了4.08个小时的美洲西班牙语朗读音频和转写文本,有5,159条由10名说话人提供的日常用语语料。



Russian Audio Datasets

This open-source dataset consists of 6.57 hours of transcribed Russian scripted speech focusing on daily use sentences, where 3,842 utterances contributed by ten speakers were contained.

此数据集包含了6.57小时的俄语朗读音频和转写文本,内容为日常用语。共有3,842条语料,由10名说话人提供。



Indonesian Audio Datasets

This open-source dataset consists of 4.54 hours of transcribed Indonesian conversational speech on certain topics, where seven conversations between two pairs of speakers were contained.

此数据集包含4.54小时的印尼语对话音频和转写文本,内容为两组说话人之间的七组特定主题对话。

This open-source dataset consists of 3.5 hours of transcribed Indonesian scripted speech focusing on daily use sentences, where 3,296 utterances contributed by ten speakers were contained.

此数据集包含了3.5个小时的印尼语朗读音频和转写文本,有3,296条由10名说话人提供的日常用语语料。



English Text Datasets

This dataset contains 100 pieces of news.

此数据集包含100条新闻资料。

This open-source dataset consists of a hundred sentences of Chinese-English parallel corpus translated from Chinese to English, concerning finance-related daily use sentences.

此数据集由百句中的汉语平行语料库组成,包含中文和英语,关于金融领域日常使用的句子。

This open-source dataset consists of 50 dialogic interactions with texts in English, concerning healthcare-related customer service scenarios.

此数据集包含50个与英语文本的问答互动,关于医疗保健相关的客户服务场景。



Korean Text Datasets

This open-source dataset consists of a hundred sentences of commands and queries in Korean.

此数据集包含100条韩语命令控制相关文本语料。



Japanese Audio Datasets

This open-source dataset consists of a hundred sentences of commands and queries in Japanese.

此数据集包含100条日语命令控制相关文本语料。



Magic Data Proprietary Datasets

Contact us if you need more training datasets for ML. [email protected]

magichub-awesome-datasets-and-competitions's People

Contributors

magichub-opensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.