Giter VIP home page Giter VIP logo

germancs's Introduction

Detecting German Terms in English Speech Transcripts of Language Learners

Abstract

Code switching (CS) has emerged as a critical challenge in natural language processing (NLP), particularly when dealing with speech and text generated by individuals from multilingual backgrounds in informal settings. This paper focuses on developing two distinct strategies for language identification (LID) within a code-switching corpus, originating from English language learners who are native German speakers. The corpus is generated through a speech-to-text (STT) model.

In our first approach, we leverage the Bidirectional Encoder Representations from Transformers (BERT) model to encode words into vectors. Subsequently, a multi-layer perceptron (MLP) is employed for language classification. We explore various word encoding methods, optimize hyperparameters, and address corpus imbalances through augmentation techniques and weight adjustments. Despite achieving high accuracy, the F1-score remains a challenge in this model.

Additionally, we investigate fine-tuning the TongueSwitcher BERT (tsBERT) model on our code-switching corpus. The resulting classifier demonstrates notable accuracy improvements and a better, although still not exceptional, F1 score.

GitHub structure

The filestructure for the project is as follows:

  • BERT_MLP_pipeline.ipynb is a notebook containing the pipeline for BERT-MLP implementation.

  • tsBERT_fine_tuning_and_evaluation.ipynb is a notebook containing tsBERT implementation.

  • corpus_augmentation.ipynb can be used to augment the corpus to contain more German words.

  • bert_encoder.py contains functions used for BERT part of BERT-MLP pipeline.

  • data_loading.py contains functions for loading and cleaning the data.

  • mlp.py contains classes and functions for MLP part of BERT-MLP model.

  • translation.py contains functions for augmenting the dataset.

  • tsBERT_data_processing contains functions for data processing in tsBERT pipeline.

  • BERT_MLP_variants contains multiple variants of the BERT-MLP pipeline that were tested for the project.

    • each folder contains a .ipynb file with pipeline implementation and .csv files with results.
    • gs_data_test.csv contains results of grid-search, if it was performed.
    • results.csv contains statistical results of the model.
    • word_labels.csv contains model's labels for German words.

Dependencies

Dependencies needed to run the notebooks in this repository:

  • googletrans
  • numpy
  • pandas
  • pickle
  • pytorch
  • seaborn
  • scikit-learn
  • tqdm
  • transformers

They can be installed using pip.

Please note that this repo does not contain the data used for training the models.

Authors

Marko Simić, Dušan Cvijetić and Danae Papadopoulos

germancs's People

Contributors

dyc0 avatar danaepapadopoulos avatar markosimic14 avatar github-classroom[bot] avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.