Giter VIP home page Giter VIP logo

mtedx_auxiliary's Introduction

mTEDx_auxiliary

These are different files I created to do different tasks when I was working on creating ASR model for mTEDx dataset. The following is a description of the different files you can find in this repo:

  • download_mted.sh: A bash script to download the ASR part of the mTEDx dataset.
  • process_mtedx: A file to process the mTEDx dataset. By processing, I mean:
    • Split the audio of the ted talks found in the datasets into shorter segments that can be used for training ASR models.
    • Normalizes the audio files to a sample rate of16000 and number of channels of 1.
    • Align between the audio segments and the text.
  • stats.xlsx: An excel file gathering all the details of the data found in the dataset.

How to download the data

You can download mTEDx data by running the following bash script:

bash download_mtedx.sh

How the data should look like

After downloading the audio data, you need to process the audio data and split the whole talks into different segments and align it with the correct text. All of that can be done by running this python script like so:

python process_mtedx.py \
  --in [IN_DATA_PATH] \
  --out [OUT_DATA_PATH] \
  --langs [LANGS] \
  --groups [GROUPS]

For example:

python process_mtedx.py \
  --in /scratch/1/user/manwar/data/mTEDx \
  --out /scratch/1/user/manwar/data/mTEDx_wav \
  --langs ar el es de fr it pt ru\
  --groups test valid train

This will process all audio data and organize it in the following tree:

mTEDx
├── ar
│   ├── test
│   ├── test.json
│   ├── train
│   ├── train.json
│   ├── valid
│   └── valid.json
├── de
│   ├── test
│   ├── test.json
│   ├── train
│   ├── train.json
│   ├── valid
│   └── valid.json
├── el
│   ├── test
│   ├── test.json
│   ├── train
│   ├── train.json
│   ├── valid
│   └── valid.json
├── es
│   ├── test
│   ├── test.json
│   ├── train
│   ├── train.json
│   ├── valid
│   └── valid.json
├── fr
│   ├── test
│   ├── test.json
│   ├── train
│   ├── train.json
│   ├── valid
│   └── valid.json
├── it
│   ├── test
│   ├── test.json
│   ├── train
│   ├── train.json
│   ├── valid
│   └── valid.json
├── pt
│   ├── test
│   ├── test.json
│   ├── train
│   ├── train.json
│   ├── valid
│   └── valid.json
└── ru
    ├── test
    ├── test.json
    ├── train
    ├── train.json
    ├── valid
    └── valid.json

mtedx_auxiliary's People

Contributors

anwarvic avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

runngezhang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.