Hello, Just looking for assistance on what would be the best way to

The likely root cause of the issue has been identified (use of <code class="notranslat

That make sense. I already tried for terraform docs to replace <code class="notranslat

Markdown validator about md2conf HOT 8 CLOSED

hunyadi commented on August 14, 2024

Markdown validator

from md2conf.

Comments (8)

hunyadi commented on August 14, 2024 1

The relevant part of the exception trace appears to be the following:

md2conf.converter.ParseError: Opening and ending tag mismatch: br line 201 and pre, line 201, column 70 (, line 201)

The parser lxml, which md2conf uses internally is a strict parser, a sequence such as

<pre><br></pre>

will likely be treated as invalid because the parent element <pre> is closed with </pre> before the child element   would be closed. The correct way to say this (in this particular context) would be

<pre><br/></pre>

i.e.   is a self-closing element.

Unfortunately, without any content, I can't tell where   is coming from. I would recommend that you share a minimum ("non-working") example with all personal information removed but still triggering the exception. This should help us locate (and hopefully eliminate) the root cause.

from md2conf.

hunyadi commented on August 14, 2024 1

The likely root cause of the issue has been identified (use of   instead of XHTML-compatible  ). Further action requires offending content (with personal information removed).

from md2conf.

ojacques commented on August 14, 2024 1

I had the same issue, with markdown generated by terraform-docs. In converter.py, it looks like it can be fixed by using a more permissive parser (note the recover=True), as per lxml documentation:

    parser = ET.XMLParser(
        remove_blank_text=True,
        strip_cdata=False,
        load_dtd=True,
        recover=True
    )

What do you think @hunyadi ?

/cc @chell0veck - if you want to give this a try on your own.

from md2conf.

hunyadi commented on August 14, 2024 1

I am somewhat concerned about making the parser more permissive in general, as it can lead to accepting malformed input such as  in which case it's usually a bad idea to have the implementation second-guess the original intent. However, if there are specific cases with terraform-docs that need to be handled, e.g. replace   with  , I am more than happy to incorporate this in a pre-processing step. Can you share a minimum example that fails?

from md2conf.

chell0veck commented on August 14, 2024

Thank you.
We'll try to check with terraform-docs to find out what kind of markdown they generating.

from md2conf.

ojacques commented on August 14, 2024

That make sense. I already tried for terraform docs to replace   with  , but that was only part of the issue, and bumped into more parsing challenges (attributed parsing).

Thing is that terraform-docs tool has some implementation choices to ensure that the doc it generates is readable with GitHub and GitLab flavored markdown, which is where it is mostly used. There is an interesting discussion in the project about it in issue 500.
But the workaround mentioned in the issue (no-html no-anchor) is not enough to pass lxml strict parsing. In this issue, the use case was the same - publishing Terraform modules documentation to Confluence (but using another tool).

I tried 3 options:

replace br's - not enough, I got into further issues
use no-html from terraform-docs, but the doc was not well formatted for my use case: code blocks inside tables were missing line returns
make lxml parsing a bit more relaxed: proposal above, which eventually worked for this case, providing an acceptable layout on Confluence

It all depends on this md2conf project direction you are setting @hunyadi . If the goal is to provide a means to publish strict markdown to Confluence, then strict parsing is necessary. If the goal is to offer a Swiss Army knife to publish many types of markdown to Confluence, then relaxing the parsing may be an option.

I personally have 2 use cases: markdown used by mkdocs and especially mkdocs-material theme, and terraform-docs generated markdown. The former uses many markdown extensions, but as md2conf uses pymdownx, this is (mostly) compatible.

Let us know where you want to go.

from md2conf.

hunyadi commented on August 14, 2024

Does it help if we switch to lxml.etree.XHTMLParser or lxml.etree.HTMLParser? As a last resort, we can set recover = True on lxml.etree.XMLParser but I would like to have that as a last resort when we have exhausted other options. Lenient parsing has a tendency to hide true document structure errors.

Currently, the parser is called in the function elements_from_strings. This is used to parse the output of pymdownx and to parse Confluence storage format XML documents. We could trust pymdownx to detect document structure errors and assume that whatever pymdownx outputs is OK from our perspective.

Can you confirm that there are no phantom changes with recover = True? In other words, no documents are reported as if they were changed (e.g. due to Confluence reordering tags) unless they have been intentionally changed?

from md2conf.

ojacques commented on August 14, 2024

Does it help if we switch to lxml.etree.XHTMLParser or lxml.etree.HTMLParser? As a last resort, we can set recover = True on lxml.etree.XMLParser but I would like to have that as a last resort when we have exhausted other options. Lenient parsing has a tendency to hide true document structure errors.

I have tried both and was unsuccessful - with parsing errors. But I am not educated on how to tweak those parsers - so I may be doing something incorrectly.

Currently, the parser is called in the function elements_from_strings. This is used to parse the output of pymdownx and to parse Confluence storage format XML documents.

Yes! And the usage of pymdownx was THE argument which makes me want to adopt your implementation. For my docs, I use mkdocs material - which leverages pymdownx. I have a branch where I added some extensions and appropriate configuration to enhance the compatibility with markdown written for mkdocs material.

Can you confirm that there are no phantom changes with recover = True? In other words, no documents are reported as if they were changed (e.g. due to Confluence reordering

I confirm that there are phantom changes and that those problematic pages are republished every time. Not an issue on my end, as it does not seem to affect the Confluence page version (I have to double check this).

Next steps for me:

submit PRs with enhanced support for pymdownx styled markdown (for example code blocks indented under a bullet list)
maybe issue a PR for a CLI option to allow parsing recovery at the cost of phantom changes?

from md2conf.

Markdown validator about md2conf HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent