deepchar / data-archived Goto Github PK

Datasets and data generation for transliteration

Python 100.00%

data-archived's Issues

Update project to accept multiple languages.

Currently project downloads only monolingual corpus during each execution. We want to get extract multilingual corpuses as well.

Data generation

Can basically extend generate.js in https://github.com/deeplanguageclass/fairseq-transliteration-data.

The most important part besides applying the mapping randomly is throwing out the lines which do not change.

Script for downloading common corpora

For example, a Wikipedia, but potentially others, like film subtitles, which will have more conversational data but still be free of translit.

Update requirements files

As has been added OpenSubClient.py need to update dependencies in requirements.txt
Also need Add in Readme.txt about OpenSubClient.py

Update readme file

Add descriptions how use main script to extract text and generate character level transliterations.

Clean downloaded data

We should at least uniquify, which will downweight the boilerplate like headers and footers. But it would be even better to remove the most common lines.

Basically we should only preserve lines in a document that did not appear in any other document in the corpus.

We can clean very aggressively because lack of data is not a problem.

deepchar / data-archived Goto Github PK

data-archived's Issues

Update project to accept multiple languages.

Data generation

Script for downloading common corpora

Update requirements files

Update readme file

Clean downloaded data

More realistic data generation

Add documentation for "How to add a new language.script pair"

Implement a proper slicing in generation script, to provide an overlap of lines

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent