Giter VIP home page Giter VIP logo

mcwiki's Introduction

mcwiki

GitHub Actions PyPI PyPI - Python Version Code style: black

A scraping library for the Minecraft Wiki.

import mcwiki

page = mcwiki.load("Data Pack")
print(page["pack.mcmeta"].extract(mcwiki.TREE))
[TAG_Compound]
The root object.
└─ pack
   [TAG_Compound]
   Holds the data pack information.
   ├─ description
   │  [TAG_String, TAG_List, TAG_Compound]
   │  A JSON text that appears when hovering over the data pack's name in
   │  the list given by the /datapack list command, or when viewing the pack
   │  in the Create World screen.
   └─ pack_format
      [TAG_Int]
      Pack version: If this number does not match the current required
      number, the data pack displays a warning and requires additional
      confirmation to load the pack. Requires 4 for 1.13–1.14.4. Requires 5
      for 1.15–1.16.1. Requires 6 for 1.16.2–1.16.5. Requires 7 for 1.17.

Introduction

The Minecraft Wiki is a well-maintained source of information but is a bit too organic to be used as anything more than a reference. This project tries its best to make it possible to locate and extract the information you're interested in and use it as a programmatic source of truth for developing Minecraft-related tooling.

Features

  • Easily navigate through page sections
  • Extract paragraphs, code blocks and recursive tree-like hierarchies
  • Create custom extractors or extend the provided ones

Installation

The package can be installed with pip.

$ pip install mcwiki

Getting Started

The load function allows you to load a page from the Minecraft Wiki. The page can be specified by providing a URL or simply the title of the page.

mcwiki.load("https://minecraft.fandom.com/wiki/Data_Pack")
mcwiki.load("Data Pack")

You can use the load_file function to read from a page downloaded locally or the from_markup function if you already have the html loaded in a string.

mcwiki.load_file("Data_Pack.html")
mcwiki.from_markup("<!DOCTYPE html>\n<html ...")

Page sections can then be manipulated like dictionaries. Keys are case-insensitive and are associated to subsections.

page = mcwiki.load("https://minecraft.fandom.com/wiki/Advancement/JSON_format")

print(page["List of triggers"])
<PageSection ['minecraft:bee_nest_destroyed', 'minecraft:bred_animals', ...]>

Extracting Data

There are 4 built-in extractors. Extractors are instantiated with a CSS selector and define a process method that produces an item for each element returned by the selector.

Extractor Type Extracted Item
PARAGRAPH TextExtractor("p") String containing the text content of a paragraph
CODE TextExtractor("code") String containing the text content of a code span
CODE_BLOCK TextExtractor("pre") String containing the text content of a code block
TREE TreeExtractor() An instance of mcwiki.Tree containing the treeview data

Page sections can invoke extractors by using the extract and extract_all methods. The extract method will return the first item in the page section or None if the extractor couldn't extract anything.

print(page.extract(mcwiki.PARAGRAPH))
Custom advancements in data packs of a Minecraft world store the advancement data for that world as separate JSON files.

You can use the index argument to specify which paragraph to extract.

print(page.extract(mcwiki.PARAGRAPH, index=1))
All advancement JSON files are structured according to the following format:

The extract_all method will return a lazy sequence-like container of all the items the extractor could extract from the page section.

for paragraph in page.extract_all(mcwiki.PARAGRAPH):
    print(paragraph)

You can use the limit argument or slice the returned sequence to limit the number of extracted items.

# Both yield exactly the same list
paragraphs = page.extract_all(mcwiki.PARAGRAPH)[:10]
paragraphs = list(page.extract_all(mcwiki.PARAGRAPH, limit=10))

Tree Structures

The TREE extractor returns recursive tree-like hierarchies. You can use the children property to iterate through the direct children of a tree.

def print_nodes(tree: mcwiki.Tree):
    for key, node in tree.children:
        print(key, node.text, node.icons)
        print_nodes(node.content)

print_nodes(section.extract(mcwiki.TREE))

Folded entries are automatically fetched, inlined, and cached. This means that iterating over the children property can yield a node that's already been visited so make sure to handle infinite recursions where appropriate.

Tree nodes have 3 attributes that can all be empty:

  • The text attribute holds the text content of the node
  • The icons attribute is a tuple that stores the names of the icons associated to the node
  • The content attribute is a tree containing the children of the node

You can transform the tree into a shallow dictionary with the as_dict method.

# Both yield exactly the same dictionary
nodes = tree.as_dict()
nodes = dict(tree.children)

Contributing

Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses poetry.

$ poetry install

You can run the tests with poetry run pytest.

$ poetry run pytest

The project must type-check with pyright. If you're using VSCode the pylance extension should report diagnostics automatically. You can also install the type-checker locally with npm install and run it from the command-line.

$ npm run watch
$ npm run check

The code follows the black code style. Import statements are sorted with isort.

$ poetry run isort mcwiki tests
$ poetry run black mcwiki tests
$ poetry run black --check mcwiki tests

License - MIT

mcwiki's People

Contributors

actions-user avatar dependabot-preview[bot] avatar vberlier avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

misode

mcwiki's Issues

Could I crawl the full context in the minecraft wiki?

Hello, I would like to ask if it is not possible to crawl all the web nodes. I noticed that the children structure of the TREE only allows crawling subsections under the article structure, but does it mean that it cannot crawl all the hyperlinks on this page?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.