Giter VIP home page Giter VIP logo

aengmusae's Introduction

Aengmusae (앵무새)

A demo is available at aengmusae.tttm.us.

Instructions for how to train/run the model are at the bottom.

Reversing Korean romanization with AI

Romanizing Korean has always been inconsistent because of the lack of consensus between the many different Korean romanization systems. In addition, the language is difficult to romanize due to how many words have the same pronunciation, yet have spelling differences that are difficult to express with the english alphabet. This prevents something like Chinese Pinyin from existing.

Since 2000, the Revised Romanization of Korean has been the official romanization system of South Korea. Even with an official standard, non-standardized romanization is common due to the lack of enforcement/care outside of government related material.

Interested in seeing how a neural network would interpret the different romanization methods and create predictions, I created an LSTM RNN model trained on romanized/korean word pairs. I didn't have high hopes for this model at all due to how the model completely ignores the issue of words with the same pronunciation having different spellings in the way it throws out all the context in the sentence and trains on the words individually.

Dataset

I scraped romanized/korean lyrics from Color Coded Lyrics in hope that community romanization would have more variability in its romanization and possibly have unstandardized patterns the model could learn off of.

Additionally, I got more romanized/korean pairs from the Korean dictionary from Kaikki, which was created using Wiktextract: Wiktionary as Machine-Readable Structured Data. Using this dictionary in this case could have been a mistake since presumably, something like this would have used a single romanization system, reducing the model's overall ability to generalize romanization patterns.

In total, there were 465,247 romanized/korean word pairs to train on.

train.csv

Issues

The model does not address the issue of different words having the same pronunciation. As mentioned above, the training data is split into individual words and the sentences (and the context of the word) are discarded. Since Korean is a language that depends on the context of a word to know its spelling, this seemingly minor issue makes this model essentially useless in terms of reliably getting the correct Korean word from a romanized input.

Although the dataset had varied romanization standards incorporated, it was still incredibly biased towards the Revised Romanization of Korean, the official standard, and often returned incorrect Korean for words that were not formatted in that standard.

I also noticed how the model failed to predict the last syllable of a word correctly unless you repeated the last word multiple times.

Expected: nan mwonga dalla dalla => 난 뭔가 달라 달라

Input: nan mwonga dalla dalla
Output: 난 뭔가 달ㄹ 달ㄹ
                       .      .
Input: nan mwonga dallaa dallaa
Output: 난 뭔가 달라 달라
Expected: jagiya => 자기야

Input: jagiya
Output: 자기
             .
Input: jagiyaa
Output: 자기ㅇ
             ..
Input: jagiyaaa
Output: 자기야

Possible Improvement

I really think this model could be more useful if it was trained with context (on sentences instead of words) so that this model could actually be useful when turning large amounts of unstandardized romanized Korean text to Korean.

What I learned

  • Generalizing human-made content is very hard
  • I need more data (look mom I'm becoming a large corporation)

How to Run

You can download the dataset from data.csv and skip the data collection process.

You can download the model from model.pt and skip the training process.

Put the files in /out.

Make sure to download both to run the API.

Data Collection

# Install dependencies
pip3 install bs4 requests

# Run data collection
python3 data.py

This creates a data.csv file in /out. Takes quite a while to scrape, I recommend just downloading the premade dataset if you want to train the model.

Training

Training requires data.csv to be in /out.

Install torch for your system

# Run your specific torch installation command
# e.g. pip3 install torch

# Install other dependencies
pip3 install jamo

# Train the model
python3 model.py

Make sure to use any acceleration you have, this takes quite a bit of time.

GPU Time (24 epochs) Average time/epoch Rental price
NVIDIA GeForce GTX 980 9h 42m 43s ~24 Minutes N/A
NVIDIA GeForce RTX 4090 39m 36s ~1 Minute 30 Seconds ~$0.26/hour
NVIDIA H100 33m 44s ~1 Minute 20 Seconds ~$2.80/hour

(The rental H100 was severely bottlenecked by its processor so it probably would have been faster)

The RTX 4090 definitely had the best value to rent, especially for a small project like this. Poor GTX 980 probably cost more in electricity lol 😭

The training process creates a model.pt file in /out.

API

Running the API requires the same dependenices as training the model. It requires both data.csv and model.pt to exist in /out.

# Install dependencies (+training dependencies)
pip3 install flask flask_cors waitress

# Run the API
python api.py

This runs a server on port 8080. To make a query, send a POST request to / with a JSON body with query set to your input (romanized korean).

POST localhost:8080 { "query": "aengmusae" }

=> 앵무새


Thank you so much for reading!

aengmusae's People

Contributors

33tm avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.