Detecting German Terms in English Speech Transcripts of Language Learners

Abstract

Code switching (CS) has emerged as a critical challenge in natural language processing (NLP), particularly when dealing with speech and text generated by individuals from multilingual backgrounds in informal settings. This paper focuses on developing two distinct strategies for language identification (LID) within a code-switching corpus, originating from English language learners who are native German speakers. The corpus is generated through a speech-to-text (STT) model.

In our first approach, we leverage the Bidirectional Encoder Representations from Transformers (BERT) model to encode words into vectors. Subsequently, a multi-layer perceptron (MLP) is employed for language classification. We explore various word encoding methods, optimize hyperparameters, and address corpus imbalances through augmentation techniques and weight adjustments. Despite achieving high accuracy, the F1-score remains a challenge in this model.

Additionally, we investigate fine-tuning the TongueSwitcher BERT (tsBERT) model on our code-switching corpus. The resulting classifier demonstrates notable accuracy improvements and a better, although still not exceptional, F1 score.

GitHub structure

The filestructure for the project is as follows:

BERT_MLP_pipeline.ipynb is a notebook containing the pipeline for BERT-MLP implementation.
tsBERT_fine_tuning_and_evaluation.ipynb is a notebook containing tsBERT implementation.
corpus_augmentation.ipynb can be used to augment the corpus to contain more German words.
bert_encoder.py contains functions used for BERT part of BERT-MLP pipeline.
data_loading.py contains functions for loading and cleaning the data.
mlp.py contains classes and functions for MLP part of BERT-MLP model.
translation.py contains functions for augmenting the dataset.
tsBERT_data_processing contains functions for data processing in tsBERT pipeline.
BERT_MLP_variants contains multiple variants of the BERT-MLP pipeline that were tested for the project.
- each folder contains a .ipynb file with pipeline implementation and .csv files with results.
- gs_data_test.csv contains results of grid-search, if it was performed.
- results.csv contains statistical results of the model.
- word_labels.csv contains model's labels for German words.

Dependencies

Dependencies needed to run the notebooks in this repository:

googletrans
numpy
pandas
pickle
pytorch
seaborn
scikit-learn
tqdm
transformers

They can be installed using pip.

Please note that this repo does not contain the data used for training the models.

Authors

Marko Simić, Dušan Cvijetić and Danae Papadopoulos

dyc0 / germancs Goto Github PK

germancs's Introduction

Detecting German Terms in English Speech Transcripts of Language Learners

Abstract

GitHub structure

Dependencies

Authors

germancs's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent