Giter VIP home page Giter VIP logo

wiki-utils's Introduction

Converting MediaWiki Dump to MapTool Properties Help File

These scripts and auxiliary files are used to extract the documentation from the wiki.rptools.info website and store them here under the processed/ directory.

The MediaWiki software adds some boilerplate that is removed in these files – we only keep the contents of <div id="mw-content-text"> but even that has pieces removed, like the table of contents and all HTML comments.

The end result is a Java properties file that we can manually copy to ../../src/main/resources/net/rptools/maptool/language/macro_descriptions/ under the name i18n.properties.

Hopefully, at some point these scripts can be modified to handle other languages!

Conversion Process

I recommend that you execute these scripts as shown and compare the output to the tests/ directory.

It’s possible that changes to the wiki dump file format or the content of the pages themseelves could cause the scripts to break. The easiest way to detect that is to run the script(s) and capture the output, then compare that output to the previous version.

In my latest tests, this detected a change to the Python lxml module that cause the script(s) to produce incorrect output but without any errors. If I hadn’t done that, I could’ve spent a lot of time trying to figure out what went wrong with the wiki data.

By producing intermediate files, a human can look at the content — and even modify it — before proceeding to the next step.

Overall operation

  1. Extract a list of pages that appear to be macro functions and save the list in the file wiki-has.txt

  2. Read that list and retrieve the specified pages, storing them under wiki.rptools.info/.

    In the past, the list needed to be tweaked manually as some pages were not appearing for various reasons. The pages have since been updated so that valid pages are included and invalid pages are not.

  3. Convert the wiki page HTML into small snippets of HTML with just the page content.

    The HTML for the wiki pages can be messy, but it includes all of the formatting that we want (bold, italics, etc) and some things we don’t want (like the table of contents and the "stub" template content). So we run the HTML through an XSLT that extracts just the div#mw-content-text text and throws away everything else.

    The results are stored under the processed/ directory under the name of the page.

  4. Create a one-line description for each macro function.

    The content under the processed/ directory contains everything needed for the i18n.properties file, but the macro editor also wants a one-line description for each function. So we read those files, extract the first sentence, and write the result to processed.txt.

    This is a great place to eyeball the content and see if anything looks out of place. For example, if MediaWiki changed the structure of the generated page, you might see the wrong description lines show up here. But if everything looks good, continue to the next step.

    (Note that some parts of this should probably change. Right now, the macro editor doesn’t support links in the help doc in all cases. But if it does in the future, we’ll want the links in the href attribute to point to the corresponding wiki page. Until then, we should probably remove all attributes for anchor tags; that allows them to continue to display as links, but there’s no chance of them being activated.)

  5. Combine the one-line description and the full description into the i18n.properties file.

    The lines are read from processed.txt and combined with the full description from the file under processed/ to create the properties file.

All done!

1-wiki-getfnnames.py > wiki-has.txt

This Python3 script reads the MediaWiki dump file and makes a list of all pages that appear to describe macro functions.

This list is stored in wiki-has.txt and is used in the next step.

2-get-wiki-pages.sh

This script reads the wiki-has.txt file to determine which pages to retrieve.

It grabs them using wget and puts them under the wiki.rptools.info directory. The actual pages are under index.php/ and auxiliary files are under other subdirectories (like JavaScripts, CSS, and images).

The auxiliary files will be ignored.

3-extract-mw-content.sh

This script extracts the proper <div> from the pages under wiki.rptools.info/index.php/

It checks to ensure the HTML is valid, then reformats and reindents the document when generating the output.

That output is put under processed/ for use in the next step.

4-generate-macro-list.sh > processed.txt

4-generate-macro-list.pl > processed.txt

This script reads the files under processed/ and creates an index that lists each filename and the one-line summary of the macro that the file documents.

It looks for <div class="template_description"> to isolate the description text. If not found, it falls back to <div class="template_description">.

The output of this script becomes input for the next step.

5-create-properties.sh > i18n.properties

This script reads the index from the previous step, processed.txt and uses it to generate the .description records in the properties file.

It also reads the HTML snippets stored under the processed/ directory to create the .summary records.

The output of this script should be redirected to i18n.properties; it is the final output from this process and should be copied to ../../src/main/resources/net/rptools/maptool/language/macro_descriptions/ after completion.

These scripts aren’t perfect, as they doesn’t always detect the .description field properly. Future updates to the wiki pages may correct that.

(It currently tries to grab the first "sentence", meaning a string of text that ends with a period, question mark, or exclamation mark, followed by at least one space. Unfortunately, it can be confused by nested HTML elements within that first sentence.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.