Giter VIP home page Giter VIP logo

code-switching-sentence-generation-by-gan's Introduction

Code switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation

Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee

Interspeech 2019

arXiv:1811.02356

Abstract

Code-switching is about dealing with alternative languages in speech or text. It is partially speaker-depend and domain-related, so completely explaining the phenomenon by linguistic rules is challenging. Compared to most monolingual tasks, insufficient data is an issue for code-switching. To mitigate the issue without expensive human annotation, we proposed an unsupervised method for code-switching data augmentation. By utilizing a generative adversarial network, we can generate intra-sentential code-switching sentences from monolingual sentences. We applied proposed method on two corpora, and the result shows that the generated code-switching sentences improve the performance of code-switching language models.

Outline

  1. Introduction
  2. Methodology
  3. Experimental setup
    • Corpora
    • Model Setup
  4. Results
    • Code-switching Point Prediction
    • Generated Text Quality
    • Language Modeling
    • Examples
  5. Conclusion

Corpora

  1. LectureSS: The recording of “Signal and System” (SS) course by one Tai-wanese instructor at National Taiwan University in 2006.
  2. SEAME: South East Asia Mandarin-English, a conversational speech by Singapore and Malaysia speakers with almost balanced gender in Nanyang Technological University and Universities Sains Malaysia.

Experimental setup

Prerequisites

  1. Python packages
    • python 3
    • keras 2
    • numpy
    • jieba
    • h5py
    • tqdm
  2. Data
    • text files
      • Training set
        1. corpus/XXX/text/train.mono.txt: Mono sentences in H
        2. corpus/XXX/text/train.cs.txt: CS sentences
      • Development set
        1. corpus/XXX/text/dev.mono.txt: Mono sentences in H translated from CS sentences (aligned to 2.)
        2. corpus/XXX/text/dev.cs.txt: CS sentences
      • Testing set
        1. corpus/XXX/text/test.mono.txt: Mono sentences in H
      • Note
        • Sentences should be segmented into words by space.
        • Words are based on H language
        • If a word in H language is mapped to a phrase in G language, we use dash to connect the words into one word.
    • local/XXX/translator.txt: Translating table from H language to G language
    • local/XXX/dict.txt: Word list for traning word-embedding
    • local/postag.txt: POS tag list for traning pos-embedding
Type Example
CS Causality 這個 也是 你 所 讀 過 的 就是 指 我 output at-any-time 只 depend-on input
Mono from CS in H 因果性 這個 也是 你 所 讀 過 的 就是 指 我 輸出 在任意時間 只 取決於 輸入
  • Note
    • Mono: monolingual
    • CS: code-switching
    • H: host (language)
    • G: guest (language)
    • ASR: automatic speech recognition

Preprocess Data

  • Use Jieba to get the part-of-speech (POS) tagger of text files for proposed + POS
    • Path:
      • Training set
        1. corpus/XXX/pos/train.mono.txt: POS of Mono sentences of training set
        2. corpus/XXX/pos/train.cs.txt: POS of CS sentences of training set
      • Development set
        1. corpus/XXX/pos/dev.mono.txt: POS of Mono sentences of development set set
      • Testing set
        1. corpus/XXX/pos/test.mono.txt: POS of Mono sentences of testing set

Train Model

Results

  • Baselines:
    • ZH
    • EN
    • Random
    • Noun

Code-switching Point Prediction

  • Precision
  • Recall
  • F-measure
  • BLEU-1
  • Word Error Rate (WER)

Generated Text Quality

Prerequisites

  1. Installation
  • N-gram model
  • Recurrent Neural Networks based Language Model (RNNLM)

Language Modeling

Automatic Speech Recognition

It's the extended experiment which is not shown in paper.

Prerequisites

  1. Installation
  2. Data
    • speech wav files & its text files

code-switching-sentence-generation-by-gan's People

Contributors

chingtingc avatar

Stargazers

 avatar

Watchers

 avatar

code-switching-sentence-generation-by-gan's Issues

Problem in training GAN model

Hi, I was trying to use the code for generating CS data from monolingual corpus.
But while running GAN, I realized that training was ending up as soon as it starts within few seconds. My input data size is ~15000 sentences and when I run for 100 epochs it completes the entire training process within a minute. Though I get an output file with CS data generated by model, would you kindly help me to debug what could be the issue ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.