Giter VIP home page Giter VIP logo

dewiktionary-htmldump-parser's Introduction

De-Wiktionary HTML dump parser

This project parses the German Wiktionary HTML dump (right now only to extract Czech-German data) into JSON that can easily be used in any application.

Besides using the the HTML dump it also scrapes some pages, because the Flexion namespace in the German Wiktionary is not included the HTML dump (hopefully this will change), it has some code to also scrape them.

Comparison with other projects

Other projects parsing the German Wiktionary use the XML dump, but this has the downside of not getting any data from expanded templates. For many languages the inflections are not manually typed in, but instead generated by a Lua template. These can only be accessed by parsing the HTML directly (using beautifulsoup). (The other option would be expanding the templates manually, which takes more CPU power and is pretty difficult to implement - for English the wikitextprocessor has been developed to do this for example)

Running the project

You should download the newest dewiktionarydump-NS0 dump from https://dumps.wikimedia.org/other/enterprise_html/runs/

Then should clone the project, install poetry and then run poetry install. You can run python files like poetry run python main.py. Main.py downloads the basic data from HTML dump. Scrape_category_html.py downloads and parses the Flexion pages that are not included in the HTML dump. Generate_dictionary.py generates a Tabfile dictionary, but this can be changed to one of the many output formats that pyglossary supports.

Interesting projects

For the English wiktionary I can recommend https://kaikki.org/, which has parsed the Wiktionary data in a great format and also retains all grammatical information for inflections, among other things.

dewiktionary-htmldump-parser's People

Contributors

vuizur avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.