Giter VIP home page Giter VIP logo

sanakirju-simplifier's Introduction

Sanakirju Simplifier

Simplify Sanakirju XML dataset for easier parsing.

Usage

pipenv install

pipenv run python main.py

Will generate simplified XML dataset in src/sanakirju_simplifier/build

Motivation

The original dataset Sanakirju uses is huge and deeply nested set of XML. Automatically parsing it using common Node.js libraries causes some incorrectly parsed data. One example of these parsing issues would be additional XML-element inside XML-text content. Most parsers pop the element out as its own element, which makes it quite tricky to place its text content back in correct location.

Sanakirju does not really need most of that XML-data; it only needs the text elements inside them.

What the simplifier does.

Most of the problematic tags can just be search/replaced with regex. This simplifier just goes through the whole dataset, and resaves them as new XML-files that have fewer and less-deeply nested elements. In short, the endgoal is to find, replace and remove content that would be incorrectly parsed, while keeping text content inside them.

Sources.

Words & translations are from Karjalan Kielen Sanakirja created by Institute for the Languages of Finland. The original material is licenced under Creative Commons International (CC BY 4.0).

sanakirju-simplifier's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.