Giter VIP home page Giter VIP logo

kallaama-speech-dataset's Introduction

KALLAAMA

This repository contains data gathered for the KALLAAMA project.
This project was funded for 1 year in 2023 by Lacuna Fund.
It was led by Jokalante (Dakar, Senegal). Orange Innovation (Lannion, France) and Ecole Polytechnique de Thiès (Thiès, Senegal) were also involved as stakeholders.

Project description

KALLAAMA was a collaborative project which aims to create resources required for the development of speech technologies. This project is exclusively interested in the three most widely spoken languages in Senegal: Wolof, Pulaar and Sereer.

This work was carried out with the aim of creating resources that will one day make it possible to access information and all the digital resources available today, simply by querying one's device, using one's voice and language of daily use. This is something that can be done by developing robust speech-to-text (ASR) and text-to-speech (TTS) models, that are fundamental units of voicebots.
Currently, numerous Senegalese people are excluded from digital information due to the lack of development in this field.

As a result, this repository provides the created datasets required for the ASR modeling process.
Resources involve spoken recordings along with orthographic transcriptions, open source text collection gathered from the Web and wordlists along with phonetic transcription. We also provide a grapheme-to-phoneme model trained to phonetize out-of-vocabulary words for Wolof.

Data description

The main topic of the recordings is about agriculture.
Audio files initially belong to Jokalante SARL.

  • The Wolof (ISO Code 639-2: wol) speech dataset contains 55 hours of transcribed speech, including almost 13 hours of validated content check by an expert.
  • The Pulaar (ISO Code 639-2: fuc) speech dataset contains nearly 32 hours of transcribed speech, including almost almost 11 hours of validated content check by an expert.
  • The Sereer (ISO Code 639-2: srr) speech dataset contains 38 hours of transcribed speech, including almost 11 hours of validated content check by an expert.

In total, we provide 125 hours of transcribed speech, including 35 hours of checked transcriptions.

Citation

See the following publication for more details on data collection (please cite the bibtex if you use the data):

@inproceedings{kallaama2024dataset,    
  title={Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal}    
  author={Gauthier, Elodie and Ndiaye, Aminata and Guissé, Abdoulaye}    
  booktitle={Proceedings of the Fifth workshop on Resources for African Indigenous Languages (RAIL 2024)},    
  year={2024}    
}  

Repository structure

.    
├── LICENSE    
├── README.md    
└── data/    
    ├── README.md    
    ├── lexicons/    
    ├── text_corpora/    
    └── transcriptions/    
        ├── checked/    
        └── raw/    

Contacts

kallaama-speech-dataset's People

Stargazers

xucan avatar  avatar Nickolay V. Shmyrev avatar Colin Wilson avatar Derguene avatar  avatar Aïmérou Ndiaye avatar Abdou Aziz Diop avatar

Watchers

Nickolay V. Shmyrev avatar Elo Gth avatar

Forkers

derxter motall-u

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.