Giter VIP home page Giter VIP logo

Comments (8)

hunyadi avatar hunyadi commented on August 14, 2024 1

The relevant part of the exception trace appears to be the following:

md2conf.converter.ParseError: Opening and ending tag mismatch: br line 201 and pre, line 201, column 70 (, line 201)

The parser lxml, which md2conf uses internally is a strict parser, a sequence such as

<pre><br></pre>

will likely be treated as invalid because the parent element <pre> is closed with </pre> before the child element <br> would be closed. The correct way to say this (in this particular context) would be

<pre><br/></pre>

i.e. <br/> is a self-closing element.

Unfortunately, without any content, I can't tell where <br> is coming from. I would recommend that you share a minimum ("non-working") example with all personal information removed but still triggering the exception. This should help us locate (and hopefully eliminate) the root cause.

from md2conf.

hunyadi avatar hunyadi commented on August 14, 2024 1

The likely root cause of the issue has been identified (use of <br> instead of XHTML-compatible <br/>). Further action requires offending content (with personal information removed).

from md2conf.

ojacques avatar ojacques commented on August 14, 2024 1

I had the same issue, with markdown generated by terraform-docs. In converter.py, it looks like it can be fixed by using a more permissive parser (note the recover=True), as per lxml documentation:

    parser = ET.XMLParser(
        remove_blank_text=True,
        strip_cdata=False,
        load_dtd=True,
        recover=True
    )

What do you think @hunyadi ?

/cc @chell0veck - if you want to give this a try on your own.

from md2conf.

hunyadi avatar hunyadi commented on August 14, 2024 1

I am somewhat concerned about making the parser more permissive in general, as it can lead to accepting malformed input such as <p><b></p></b> in which case it's usually a bad idea to have the implementation second-guess the original intent. However, if there are specific cases with terraform-docs that need to be handled, e.g. replace <br> with <br/>, I am more than happy to incorporate this in a pre-processing step. Can you share a minimum example that fails?

from md2conf.

chell0veck avatar chell0veck commented on August 14, 2024

Thank you.
We'll try to check with terraform-docs to find out what kind of markdown they generating.

from md2conf.

ojacques avatar ojacques commented on August 14, 2024

That make sense. I already tried for terraform docs to replace <br> with <br/>, but that was only part of the issue, and bumped into more parsing challenges (attributed parsing).

Thing is that terraform-docs tool has some implementation choices to ensure that the doc it generates is readable with GitHub and GitLab flavored markdown, which is where it is mostly used. There is an interesting discussion in the project about it in issue 500.
But the workaround mentioned in the issue (no-html no-anchor) is not enough to pass lxml strict parsing. In this issue, the use case was the same - publishing Terraform modules documentation to Confluence (but using another tool).

I tried 3 options:

  • replace br's - not enough, I got into further issues
  • use no-html from terraform-docs, but the doc was not well formatted for my use case: code blocks inside tables were missing line returns
  • make lxml parsing a bit more relaxed: proposal above, which eventually worked for this case, providing an acceptable layout on Confluence

It all depends on this md2conf project direction you are setting @hunyadi . If the goal is to provide a means to publish strict markdown to Confluence, then strict parsing is necessary. If the goal is to offer a Swiss Army knife to publish many types of markdown to Confluence, then relaxing the parsing may be an option.

I personally have 2 use cases: markdown used by mkdocs and especially mkdocs-material theme, and terraform-docs generated markdown. The former uses many markdown extensions, but as md2conf uses pymdownx, this is (mostly) compatible.

Let us know where you want to go.

from md2conf.

hunyadi avatar hunyadi commented on August 14, 2024

Does it help if we switch to lxml.etree.XHTMLParser or lxml.etree.HTMLParser? As a last resort, we can set recover = True on lxml.etree.XMLParser but I would like to have that as a last resort when we have exhausted other options. Lenient parsing has a tendency to hide true document structure errors.

Currently, the parser is called in the function elements_from_strings. This is used to parse the output of pymdownx and to parse Confluence storage format XML documents. We could trust pymdownx to detect document structure errors and assume that whatever pymdownx outputs is OK from our perspective.

Can you confirm that there are no phantom changes with recover = True? In other words, no documents are reported as if they were changed (e.g. due to Confluence reordering tags) unless they have been intentionally changed?

from md2conf.

ojacques avatar ojacques commented on August 14, 2024

Does it help if we switch to lxml.etree.XHTMLParser or lxml.etree.HTMLParser? As a last resort, we can set recover = True on lxml.etree.XMLParser but I would like to have that as a last resort when we have exhausted other options. Lenient parsing has a tendency to hide true document structure errors.

I have tried both and was unsuccessful - with parsing errors. But I am not educated on how to tweak those parsers - so I may be doing something incorrectly.

Currently, the parser is called in the function elements_from_strings. This is used to parse the output of pymdownx and to parse Confluence storage format XML documents.

Yes! And the usage of pymdownx was THE argument which makes me want to adopt your implementation. For my docs, I use mkdocs material - which leverages pymdownx. I have a branch where I added some extensions and appropriate configuration to enhance the compatibility with markdown written for mkdocs material.

Can you confirm that there are no phantom changes with recover = True? In other words, no documents are reported as if they were changed (e.g. due to Confluence reordering

I confirm that there are phantom changes and that those problematic pages are republished every time. Not an issue on my end, as it does not seem to affect the Confluence page version (I have to double check this).

Next steps for me:

  • submit PRs with enhanced support for pymdownx styled markdown (for example code blocks indented under a bullet list)
  • maybe issue a PR for a CLI option to allow parsing recovery at the cost of phantom changes?

from md2conf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.