Giter VIP home page Giter VIP logo

arabicasrchallenge2016's Introduction

MGB Challenge 2016 Arabic Track

Introduction

This repository is for additional resources for Arabic Broadcast Transcription Challenge 2016. For introduction and dataset released for the challenge, please refer to the challenge website http://alt.qcri.org/mgb-ar-2016.

Resources

  • SCLITE

    wget ftp://jaguar.ncsl.nist.gov/pub/sctk-2.4.10-20151007-1312Z.tar.bz2

  • DDSCORE

Transcript Generation

This section explains the xml transcript generation process that we followed to generate training transcripts (xml) and the dev transcripts (xml) from the original transcripts provided by AL Jazeera

  • Aligning data:

First, we perform a lightly supervised alignment between the ctm transcription file generated by the ASR system and the manual transcription provided by AL Jazeera for each program (which does not have any timing information). For more details read this paper : http://alt.qcri.org/MGB_challenge_Arabic_Track_2016/MGB_Arabic_description_2016.pdf

The code for the alignment is at the location: /transcriptgeneration/src/mgbmain/ArabicASRChallenge

When this code is run giving it the appropriate arguments, it generates a transcript file: 'ALL_MOD.tra' which a tab delimited file, whose each line is a speech utterance and its details:

01E4B05D-E8C2-4F92-819F-140B607ABA50.xml_mtHdv-bAsm-dAr-Al<ftA'-AlflsTynyp_00:00:34,65_00:00:37,15 الأقصى المبارك فإنه قد ثبت بالوجه الشرعي Words:7 Correct:3 Correct:42 Ins:0 Del:4 WMER:57.0 PMER:50.0 AWD:2.8 Start:1 End:0

Semantics of the above line are as follows: <program_ID>.xml_<Speaker_name (in Buckwalter Format)><space><utterance_text><\t><some statistics from the alignment><\t><correct%><\t><insertions and deletions (alignments)><\t><word error rate%><\t><graheme match error rate%><\t><Average Word Duration>

The word error rate is calculated using levenshtein distance (No Substitutions, Insertions and Deletions cost 1)

The transcript file 'ALL_MOD.tra' takes considerable time to generate but it is ok, as it is a one time process. We are currently looking into making the code faster

  • Generating transcript xmls

Once we have the 'ALL_MOD.tra', we now parse this file to generate the trancript xml files. XML files are generated using the java class /transcriptgeneration/src/mgbmain/XMLGeneration.java.

We generate both the bukwalter and the utf8 formats. The flags that need to be set for the program are:

-a <arg> Yes/No: annotate latin words in the corpus with the tag @@LAT@@

-b <arg> Yes/No: remove segments from the dev set that have ovelap speech

-bw <arg> Yes/No: Convert to bukwalter or not

-d <arg> transcript xml files destination folder

-h <arg> Yes/No: remove hesitation mark or not

-hh <arg> Yes/No: remove double ## at the start of a word

-norm <arg> Yes/No: normalize the arabic text or not, remove punctuations

-num <arg> Yes/No: flag for mapping the arabic indic numbers the Arabic numberals

-p <arg> Absolute Path: program tdf file path

-rd <arg> Yes/No: remove diacritics

-rmEnglish <arg> Yes/No Remove the english words

-rmNum <arg> Yes/No Remove numbers

-t <arg> Train/Dev: Generate xml for training or dev set

-tra <arg> Absolute Path: Give the ALL.tra file `

The settings that we use for generating the train xml trancripts for bukwalter are:

-a yes

-b no

-bw yes

-d /Users/alt-sameerk/Documents/mgb/dev_xml_2016_03_1/bw/

-h yes

-hh no

-norm yes

-num yes

-rmNum no

-rmEnglish no

-p /Users/alt-sameerk/Documents/mgb/exp-2015-11-10/aja_tdf.txt

-rd yes

-t train

-tra /Users/alt-sameerk/Documents/mgb/ALL_MOD.tra

and for the utf8 xml transcripts:

-a no

-b no

-bw no

-d /Users/alt-sameerk/Documents/mgb/dev_xml_2016_03_1/utf8/

-h no

-hh no

-norm no

-num no

-rmNum no

-rmEnglish no

-p /Users/alt-sameerk/Documents/mgb/exp-2015-11-10/aja_tdf.txt

-rd no

-t train

-tra /Users/alt-sameerk/Documents/mgb/ALL_MOD.tra

For the dev xml transcript generation we use the same class, but just changing the -t flag to dev and also generating the dev transcript file using the python script /extras/trs2xml.py

Language Modelling Text

We provide the normalized and unnormalized language modelling text by parsing the articles archive provided by AL Jazeera. The normalized bukwalter LM text is generated using the java class /transcriptgeneration/src/mgbmain/ExtractLMText.java

Kaldi Recipe

To get you started with an ASR system, we have provided a recipe (under progress) which can be used as a starting point. The recipe is at /baseline/recipe. The recipe scripts are well commented and you can get all the information by looking at the scripts

Lexicon

Lexicon is under progress and has been put in github and participants will be notified when any changes are made to the lexicon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.