deepchar / data-archived Goto Github PK
View Code? Open in Web Editor NEWDatasets and data generation for transliteration
Datasets and data generation for transliteration
Currently project downloads only monolingual corpus during each execution. We want to get extract multilingual corpuses as well.
Can basically extend generate.js
in https://github.com/deeplanguageclass/fairseq-transliteration-data.
The most important part besides applying the mapping randomly is throwing out the lines which do not change.
For example, a Wikipedia, but potentially others, like film subtitles, which will have more conversational data but still be free of translit.
As has been added OpenSubClient.py need to update dependencies in requirements.txt
Also need Add in Readme.txt about OpenSubClient.py
Add descriptions how use main script to extract text and generate character level transliterations.
We should at least uniquify, which will downweight the boilerplate like headers and footers. But it would be even better to remove the most common lines.
Basically we should only preserve lines in a document that did not appear in any other document in the corpus.
We can clean very aggressively because lack of data is not a problem.
Consistent style inside of one word and one row
Be smart about uppercase digraphs (eg ิฝ
=> KH
vs Kh
depending on context)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.