Giter VIP home page Giter VIP logo

wiki_as_base-py's Introduction

wiki_as_base-py

[MVP] Use MediaWiki Wiki page content as read-only database. Python library implementation. See https://github.com/fititnt/openstreetmap-serverless-functions/tree/main/function/wiki-as-base

GitHub Pypi: wiki_as_base



Installing

pip install wiki_as_base --upgrade

## Alternative:
# pip install wiki_as_base==0.5.10

Environment variables

Customize for your needs. They're shared between command line and the library.

export WIKI_API='https://wiki.openstreetmap.org/w/api.php'
export WIKI_NS='osmwiki'
export CACHE_TTL='82800'  # 82800 seconds = 23 hours

Suggested. Customize user agent. Follows the logic of MediaWiki user agent. Without WIKI_AS_BASE_BOT_CONTACT customization, for recursive and pagination requests already not cached locally will far slower, with a delay 10 seconds. This default may increase in future releases. Does not affect direct requests (likely ones with less than 50 pages).

export WIKI_AS_BASE_BOT_CONTACT='https://github.com/fititnt/wiki_as_base-py; [email protected]'

Command line Usage

Quickstart

These examples will request two wikies, OpenStreetMap (default) live page and Wikidata live page.

wiki_as_base --help

## Use remote storage (defined on WIKI_API)
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base'

# The output is JSON-LD. Feel free to further filter the data
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' | jq .data[1]

## Example of, instead of use WIKI_API, parse Wiki markup directly. Output JSON- LD
cat tests/data/multiple.wiki.txt | wiki_as_base --input-stdin

## Output zip file instead of JSON-LD. --verbose also adds wikiasbase.jsonld to file
cat tests/data/chatbot-por.wiki.txt | wiki_as_base --input-stdin --verbose --output-zip-file tests/temp/chatbot-por.zip

## Use different Wiki with ad-hoc change of the env WIKI_API and WIKI_NS
WIKI_NS=wikidatawiki \
  WIKI_API=https://www.wikidata.org/w/api.php \
  wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base'
Click to see more examples for other wikies
# For suggestion of RDF namespaces, see https://dumps.wikimedia.org/backup-index.html
WIKI_NS=specieswiki \
  WIKI_API=https://species.wikimedia.org/w/api.php \
  wiki_as_base --titles 'Paubrasilia_echinata'

# @TODO implement support for MediaWiki version used by wikies like this one
WIKI_NS=smwwiki \
  WIKI_API=https://www.semantic-mediawiki.org/w/api.php \
  wiki_as_base --titles 'Help:Using_SPARQL_and_RDF_stores'

Use of permanent IDs for pages, the WikiMedia pageids

In case the pages are already know upfront (such as automation) then the use of numeric pageid is a better choice.

# "--pageids '295916'" is equivalent to "--titles 'User:EmericusPetro/sandbox/Wiki-as-base'"
wiki_as_base --pageids '295916'

However, if for some reason (such as strictly enforce not just an exact page, but exact version of one or more pages) and getting the latest version is not fully essential, then you can use revids,

# "--revids '2460131'" is an older version of --pageids '295916' and
# "--titles 'User:EmericusPetro/sandbox/Wiki-as-base'"
wiki_as_base --revids '2460131'

Request multiple pages at once, either by pageid or titles

Each MediaWiki API may have different limits for batch requests, however even unauthenticated users often have decent limits (e.g. 50 pages).

Some Wikies may allow very high limits for authenticated accounts (500 pages), however the current version does not implement authenticated requests.

## All the following commands are equivalent for the default WIKI_API

wiki_as_base --input-autodetect '295916|296167'
wiki_as_base --input-autodetect 'User:EmericusPetro/sandbox/Wiki-as-base|User:EmericusPetro/sandbox/Wiki-as-base/data-validation'
wiki_as_base --pageids '295916|296167'
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base|User:EmericusPetro/sandbox/Wiki-as-base/data-validation'

Trivia: since this library and CLI fetch directly from WikiMedia API, and parse Wikitext (not raw HTML), it causes much less server load to request several pages this way than big ones with higher number of template calls ๐Ÿ˜‰.

Advanced filter with jq

When working with the JSON-LD output, you can use jq ("jq is a lightweight and flexible command-line JSON processor."), see more on https://stedolan.github.io/jq/, to filter the data

## Filter tables
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' | jq '.data[] | select(.["@type"] == "wtxt:Table")'

## Filter Templates
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' | jq '.data[] | select(.["@type"] == "wtxt:Template")'

Save JSON-LD extracted as files

Use --output-zip-file parameter. One example:

wiki_as_base --input-autodetect 'Category:References' --output-zip-file ~/Downloads/Category:References.zip

Library usage

NOTE: for production usage (if you can't review releases or are not locked into Docker images) consider enforce a very specific release

Production usage

# requirements.txt
wiki_as_base==0.5.10

Other cases (or use in your local machine)

# Run this via cli for force redownload lastest. Do not use --pre (pre-releases)
pip install wiki_as_base --upgrade
# requirements.txt
wiki_as_base

Basic use

import json
from wiki_as_base import WikitextAsData

wtxt = WikitextAsData().set_pages_autodetect("295916|296167")
wtxt_jsonld = wtxt.output_jsonld()

print(f'Total: {len(wtxt_jsonld["data"])}')

for resource in wtxt_jsonld["data"]:
    if resource["@type"] == "wtxt:Table":
        print("table found!")
        print(resource["wtxt:tableData"])

print("Pretty print full JSON output")

print(json.dumps(wtxt.output_jsonld(), ensure_ascii=False, indent=2))

Cache remote requests locally

TODO: port the requests-cache approach (local SQLite cache database) used on https://github.com/fititnt/openstreetmap-serverless-functions/blob/main/function/wiki-as-base/handler.py .

Safe inferred data as individual files

import sys
import zipfile
from wiki_as_base import WikitextAsData

wtxt = WikitextAsData().set_pages_autodetect("295916|296167")

# Both output_jsonld() and output_zip() call prepare() (which actually
# make the remote request) plus is_success() on demand.
# However the pythonic way woud be try/except
if not wtxt.prepare().is_success():
    print("error")
    print(wtxt.errors)
    sys.exit(1)

wtxt.output_zip("/tmp/wikitext.zip")

# Using Python zipfile.ZipFile, you can process the file with python
zip = zipfile.ZipFile("/tmp/wikitext.zip")

print("Files inside the zip:")
print(zip.namelist())

# @TODO improve this example on future releases

The JSON-LD Specification

NOTE: work in progress.

https://wtxt.etica.ai/

Disclaimer / Trivia

The wiki_as_base allows no-as-complete data extraction from MediaWiki markup text directly by its API or direct input, without need to install server extensions.

Check also the wikimedia/Wikibase, a full server version (which inspired the name).

License

Public domain

wiki_as_base-py's People

Contributors

fititnt avatar

Watchers

 avatar  avatar  avatar

wiki_as_base-py's Issues

Implement pagination for cli interface

While the current version only allows users to select pages one by one, the next release would also detect the prefix "Category:" and make an implicit call for the user. WikiMedia API defaults to 50 pages at once (500 for users with special limits, which we still do not support, but eventually could.

However, since the "Category:" already may sometimes load a bit more than 50 pages, we get an error like

{
      "error": {
        "code": "toomanyvalues",
        "info": "Too many values supplied for parameter \"pageids\". The limit is 50.",
        "limit": 50,
        "lowlimit": 50,
        "highlimit": 500,
        "docref": "See https://wiki.openstreetmap.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
      }

So this issue is about we make by default ignore if user add more than the typical low limit of number of pages, and then allow it paginate via additional cli request. One limitation that (unless cache is enabled) the next request would also ask again for the categories then load up to the next portion of the pages.

Anyway, the implementation of "#1", even if by default would have a short limit, makes sense, because the limit for pages in one categoryallows up to 500 even for non authenticated requests.

v0.5.x code refactoting

The current version, released on pip package, v0.5.5 works. However, migth worth refactor to use the python package https://github.com/5j9/wikitextparser instead of continue redoing everything from scratch.

The parsing of the tables somewhat works, but we still do not implemented corner cases. Also, the Wiki Templates parsing (aka the infoboxes) is even less complete than the one about tables, too much hardcoded to exact cases.

So the idea here is do some code refactoring. Cli interface and maybe even the library will not change, but the more low level, will.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.