Hi. I am opening this as a general issue, but describing it through the example of

How to write an extension that looks at raw HTML? about markdown HOT 13 CLOSED

oprypin commented on June 9, 2024

How to write an extension that looks at raw HTML?

from markdown.

Comments (13)

waylan commented on June 9, 2024

If I were to ignore your various explanations and simply was to answer the question in the title, I would give the following answer:

How you access raw HTML would depend on when you want to process it and what you want to do with it. One option is to override/replace the existing preprocessor. Another is to access the HTML fragments stored in the stash. and the third is to use a postprocessor after the final HTML output has been rendered.

And now for some more specific information...

The library has always parsed raw HTML as a preprocessor (even years before I took over). In the most recent refactor, I first attempted to incorporate HTML parsing into the blockparser. However, that would have required some backward incompatible changes to the existing extension API so it was abandoned as compatibility with existing extensions was deemed more important than a better internal implementation. Believe me, I would really prefer the changes to both (the blockparser and the htmlparser), but we would rather not break every third-party extension out there that uses a block processor. And so we are stuck with what we have.

In most cases when someone wants to take action on HTML, we have always directed them to use a post-processor as they generally want to operate on the final rendered document. However, I get the sense that is not the case here. Although, as I don't know what you are trying to do with ids, maybe a regex which finds all instances of id="sometext" in a postprocessor could work. But I doubt it. Presumably you would also want the element it is connected with.

The issue with getting the element is that that would require additional HTML parsing. Even when I attempted to incorporate HTML parsing into the blockparser, each block was still going to be saved to the stash. One of the things about Markdown is that it does not make any changes to a block of raw HTML. Even all insignificant whitespace is preserved. However, if we parsed the raw HTML into etree Element objects, then that would become more difficult to ensure.

And so, the best way for you to access the individual blocks of raw HTML, is to get them from the instance of the HtmlStash which is on the Markdown class as Markdown.htmlStash. Of perhaps you could replace the RawHtmlPostprocessor and intercept the blocks as they are being placed back into the document.

As an aside, I had previously considered using a blockparser to place the stashed items back into the document tree as raw blocks. However, this would require creating a custom etree Element type and would require the serializer to special case that type by having it skip HTML escaping. As some users are defining their own serializers, that could break things for them and so it was never attempted.

The final option is for you to replace the preprocessor. Of course, that is supported as we do that ourselves. I'll have to look more closely at your proposals in that regard later and follow up.

from markdown.

oprypin commented on June 9, 2024

I am aiming to only look at HTML that the user explicitly wrote. And I don't need it as an element. Indeed it would be enough for my case to determine that id="sometext" is present anywhere.
I will not consider parsing HTML with regexp as an option

from markdown.

oprypin commented on June 9, 2024

I haven't considered the possibility of looking inside htmlStash, that could be interesting.
Still requires an extra pass of parsing but maybe it's ok

from markdown.

waylan commented on June 9, 2024

So it sounds like you just want a list of all ids in the final document and nothing more. Yeah, I would be inclined to get them from a treeprocessor (from rendered Markdown) and from the stash (from raw HTML). The question is which way to access the stash. (1) You could iterate over the items in md.htmlStash.rawHtmlBlocks (I see that attribute is not documented; we should probably fix that) after the last of the treeprocessors has completed. Or (2) you could override the RawHtmlPostprocessor and record the id for any elements swapped back into the document.

Of the two, I might suggest the later. It is possible that an item in the stash will not end up in the final document. After all, other third-party extensions have been known to use the stash for other non-HTML strings. Even the codehilite extension uses the stash, although that is HTML which could conceivably contain ids; so you would want that. In any event, when we swap out the placeholders for the raw text, we find the placeholders in the document and then look them up in the stash. We do not iterate through the stash and then look for them in the document as we can't be sure every placeholder will exist in the final rendered document. The point is, by overriding RawHtmlPostprocessor you can ensure that only ids which get included in the final document will be collected.

That said, as you note, either of the above would require additional parsing the the HTML. Of course, to avoid that one could conceivably override the preprocessor with its HTML parser and simply inject some code to collect all ids. In fact, your example is exactly how I would do this myself if that was the method I chose:

from markdown.htmlparser import htmlparser

class MyParser(markdown.htmlparser.htmlparser.HTMLParser):
    idcollection = []

    def handle_starttag(self, tag, attrs):
        # save attrs
        if 'id' in attrs:
            self.idcollection.append(attrs['id'])

        return super().handle_starttag(tag, attrs)

class MyHtmlBlockPreprocessor(Preprocessor):
    def run(self, lines):
        parser = MyParser()
        parser.feed('\n'.join(lines))
        parser.close()
        self.md.idcollection = parser.idcollection
        return ''.join(parser.cleandoc).split('\n')

I will note that even if we provided a way to override the parser class in a subclass, you would still need to override the run method to get the ids anyway (second to last line in example above). And the run method is only a few lines of code, so that's not a big deal (in fact, our own subclass in the md_in_html extension needs to do the same). I suppose your parser could even work completely differently and return its result using a different API. You have complete flexibility there.

However, as noted above, this method could result in ids in your collection which do not end up in the final document. In the simple cases, that should never happen, but I don't make any assumptions about what third-party extensions a user may be using or what weird input they provide that could result in unexpected output.

from markdown.

oprypin commented on June 9, 2024

Thanks.

I will note that even if we provided a way to override the parser class in a subclass, you would still need to override the run method to get the ids anyway (second to last line in example above). And the run method is only a few lines of code, so that's not a big deal (in fact, our own subclass in the md_in_html extension needs to do the same). I suppose your parser could even work completely differently and return its result using a different API. You have complete flexibility there.

I would like to point out 2 nuances here.

You wouldn't have to override the run method with my proposal. Not that it's a big deal indeed, but yes..
This still requires two passes of HTML parsing (original behavior and my new class separately) however if the class is overridable then I can find a way to tack additional behavior onto the existing HTMLParser subclass

But if we made a small edit to Python-Markdown, like this:
oprypin@c39fb9b
-to make the choice of the HTMLExtractor class overridable by a variable - then I could indeed replace that class with a class that does the same work but also has a handle_starttag method that does a bit of additional work.

from markdown.

waylan commented on June 9, 2024

You wouldn't have to override the run method with my proposal. Not that it's a big deal indeed, but yes..

Yes you would. It would be necessary to obtain the collected ids from the parser as I demonstrate in my example. I'm not sure how else you would do it.

This still requires two passes of HTML parsing (original behavior and my new class separately) however if the class is overridable then I can find a way to tack additional behavior onto the existing HTMLParser subclass

No, just use your subclass of the HTML parser. It will only get parsed once.

from markdown.

oprypin commented on June 9, 2024

Again the nuance that is being missed is that md.preprocessors['html_block'] is not necessarily an instance of markdown.htmlparser.htmlparser.HTMLParser. It could have been replaced by the md_in_html extension. So forcefully overwriting it with extended version of that specific class will definitely lose functionality. So I have to detect md_in_html and provide yet another pair of subclasses. Or maybe there's even yet another subclass from another unknown extension, which I can't even predict at all.

So please compare my attempt of implementing these 2×2 subclasses - very verbose and with a limitation:

oprypin/mkdocs@cf17645

Versus a small override if the ability to do so is unlocked via oprypin@c39fb9b:

oprypin/mkdocs@8ec1eb8

(and yes, I checked that the code actually works)

from markdown.

waylan commented on June 9, 2024

I finally had time to look at this. Thank you for your patience.

Again the nuance that is being missed is that md.preprocessors['html_block'] is not necessarily an instance of markdown.htmlparser.htmlparser.HTMLParser. It could have been replaced by the md_in_html extension.

Actually, that is my point. It could also be replaced by something else entirely. We don't special case for built-in extensions in the core. The assumption in the core is that anything could replace md.preprocessors['html_block'] including something which uses a completely different way to parse HTML.

True your proposed change to the core won't break anything in the library, but it creates a false comfort that extension devs can assume that that preprcessor will always make use of a subclass of the standard lib htmlparser. That is what is giving me pause.

In fact your example usage has no safegaurds against such a thing. Of course, it wouldn't be terribly difficult to add them, but you did to the counterexample. The fact that you didn't bother to implement the same safeguards when using your proposed change suggests to me that my concern is well founded.

from markdown.

oprypin commented on June 9, 2024

Hmm what you say is a safeguard was actually a limitation. I didn't "not bother", instead I was claiming that it is totally generic and that the limitation doesn't need to exist. You're right that's it's not fully safe to assume that it's generic, but I thought that any subclasses should be compatible.

from markdown.

waylan commented on June 9, 2024

I thought that any subclasses should be compatible.

But you are not checking that you are working with a subclass of markdown.preprocessors.HtmlBlockPreprocessor. In fact, another extension could have replaced that class with a completely difference preprocessor class which works in a completely different way. And even if it was a subclass, what's to stop the extension dev from doing something incompatible?

For that matter, an extension could have completely removed any reference to a html_block preprocessor altogether. I know from experience that a subset of users are in that situation, which would result in a KeyError for both of your example implementations. The fact that you didn't account for this reinforces the idea that you are making too many assumptions about what other extensions are doing.

from markdown.

oprypin commented on June 9, 2024

Thanks for the responses.

Nothing that can be done here then, it has to be another pass of parsing HTML.

I already also had an implementation ready that just does that.
mkdocs/mkdocs@3bcc2e1#diff-81f285125d041f8ec1fca83e7d0a170b496aea52e8f288edf3fd9a4d5773bfa4

Just that I was actually getting some really weird import errors only in CI
https://github.com/mkdocs/mkdocs/actions/runs/6843595123/job/18606357439#step:5:88
so I'll need to keep looking into that and somehow fix it.

    import markdown.htmlparser
ModuleNotFoundError: No module named 'markdown.htmlparser'

from markdown.

oprypin commented on June 9, 2024

I have opened another thread to try to get some help - still no idea why in CI it always says "No module named 'markdown.htmlparser'"

mkdocs/mkdocs#3504

from markdown.

oprypin commented on June 9, 2024

Ah I figured it out 🤦 - the minimum required version of Markdown for this is 3.3 but I had an older version configured for the "minimum requirements" test

from markdown.

How to write an extension that looks at raw HTML? about markdown HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent