mcwiki

A scraping library for the Minecraft Wiki.

import mcwiki

page = mcwiki.load("Data Pack")
print(page["pack.mcmeta"].extract(mcwiki.TREE))

[TAG_Compound]
The root object.
└─ pack
   [TAG_Compound]
   Holds the data pack information.
   ├─ description
   │  [TAG_String, TAG_List, TAG_Compound]
   │  A JSON text that appears when hovering over the data pack's name in
   │  the list given by the /datapack list command, or when viewing the pack
   │  in the Create World screen.
   └─ pack_format
      [TAG_Int]
      Pack version: If this number does not match the current required
      number, the data pack displays a warning and requires additional
      confirmation to load the pack. Requires 4 for 1.13–1.14.4. Requires 5
      for 1.15–1.16.1. Requires 6 for 1.16.2–1.16.5. Requires 7 for 1.17.

Introduction

The Minecraft Wiki is a well-maintained source of information but is a bit too organic to be used as anything more than a reference. This project tries its best to make it possible to locate and extract the information you're interested in and use it as a programmatic source of truth for developing Minecraft-related tooling.

Features

Easily navigate through page sections
Extract paragraphs, code blocks and recursive tree-like hierarchies
Create custom extractors or extend the provided ones

Installation

The package can be installed with pip.

$ pip install mcwiki

Getting Started

The load function allows you to load a page from the Minecraft Wiki. The page can be specified by providing a URL or simply the title of the page.

mcwiki.load("https://minecraft.fandom.com/wiki/Data_Pack")
mcwiki.load("Data Pack")

You can use the load_file function to read from a page downloaded locally or the from_markup function if you already have the html loaded in a string.

mcwiki.load_file("Data_Pack.html")
mcwiki.from_markup("<!DOCTYPE html>\n<html ...")

Page sections can then be manipulated like dictionaries. Keys are case-insensitive and are associated to subsections.

page = mcwiki.load("https://minecraft.fandom.com/wiki/Advancement/JSON_format")

print(page["List of triggers"])

<PageSection ['minecraft:bee_nest_destroyed', 'minecraft:bred_animals', ...]>

Extracting Data

There are 4 built-in extractors. Extractors are instantiated with a CSS selector and define a process method that produces an item for each element returned by the selector.

Extractor	Type	Extracted Item
`PARAGRAPH`	`TextExtractor("p")`	String containing the text content of a paragraph
`CODE`	`TextExtractor("code")`	String containing the text content of a code span
`CODE_BLOCK`	`TextExtractor("pre")`	String containing the text content of a code block
`TREE`	`TreeExtractor()`	An instance of `mcwiki.Tree` containing the treeview data

Page sections can invoke extractors by using the extract and extract_all methods. The extract method will return the first item in the page section or None if the extractor couldn't extract anything.

print(page.extract(mcwiki.PARAGRAPH))

Custom advancements in data packs of a Minecraft world store the advancement data for that world as separate JSON files.

You can use the index argument to specify which paragraph to extract.

print(page.extract(mcwiki.PARAGRAPH, index=1))

All advancement JSON files are structured according to the following format:

The extract_all method will return a lazy sequence-like container of all the items the extractor could extract from the page section.

for paragraph in page.extract_all(mcwiki.PARAGRAPH):
    print(paragraph)

You can use the limit argument or slice the returned sequence to limit the number of extracted items.

# Both yield exactly the same list
paragraphs = page.extract_all(mcwiki.PARAGRAPH)[:10]
paragraphs = list(page.extract_all(mcwiki.PARAGRAPH, limit=10))

Tree Structures

The TREE extractor returns recursive tree-like hierarchies. You can use the children property to iterate through the direct children of a tree.

def print_nodes(tree: mcwiki.Tree):
    for key, node in tree.children:
        print(key, node.text, node.icons)
        print_nodes(node.content)

print_nodes(section.extract(mcwiki.TREE))

Folded entries are automatically fetched, inlined, and cached. This means that iterating over the children property can yield a node that's already been visited so make sure to handle infinite recursions where appropriate.

Tree nodes have 3 attributes that can all be empty:

The text attribute holds the text content of the node
The icons attribute is a tuple that stores the names of the icons associated to the node
The content attribute is a tree containing the children of the node

You can transform the tree into a shallow dictionary with the as_dict method.

# Both yield exactly the same dictionary
nodes = tree.as_dict()
nodes = dict(tree.children)

Contributing

Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses poetry.

$ poetry install

You can run the tests with poetry run pytest.

$ poetry run pytest

The project must type-check with pyright. If you're using VSCode the pylance extension should report diagnostics automatically. You can also install the type-checker locally with npm install and run it from the command-line.

$ npm run watch
$ npm run check

The code follows the black code style. Import statements are sorted with isort.

$ poetry run isort mcwiki tests
$ poetry run black mcwiki tests
$ poetry run black --check mcwiki tests

License - MIT

vberlier / mcwiki Goto Github PK

mcwiki's Introduction

mcwiki

Introduction

Features

Installation

Getting Started

Extracting Data

Tree Structures

Contributing

mcwiki's People

Contributors

Stargazers

Watchers

Forkers

mcwiki's Issues

Update base URL

Could I crawl the full context in the minecraft wiki?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent