Comments (6)
Thanks for the report. This appears to be an issue with the abbr
extension. In fact, the behavior can be replicated using that extension only, Specifically, the extension correctly identifies the line of input as an abbreviation definition and tries to build a regex to match instances of ^1^
in the document. The regex it builds then fails to compile and raises an error.
Obviously, the carrot (^
) has special meaning in regex, so it would need to be escaped to match the actual character. In fact, the code accounts for the fact that a user could include anything in an abbreviation and wraps each character in a character set ([]
). In other words, the regex for the abbreviation HTML
would be [H][T][M][L]
. As it turns out, there are only 4 characters which have special meaning in character sets (^
, \
, -
, and ]
) and it appears we do not do anything to account for them.
We have a few options to address this:
- Retain the current method of constructing regex but backslash escape the 4 special characters in character sets.
- Utilize a different method of constructing regex. For example, we could abandon character sets and do literal characters; either backslash escaping every character (
/H/T/M/L
) or only backslash escaping special characters listed here. Or maybe some other completely different approach. - Restrict the characters allowed in abbreviations to exclude most punctuation.
Option 3 would be simple in that it would mostly eliminate the possibility of special characters being used in generated regex . But it could break users' existing documents as the current permissive approach has been in-place for over a decade. And I say it would only "mostly eliminate" the issue because we would likely want to allow -
at a minimum; so we would still need to do some escaping. Option 2 would be more dramatic that option 1 and is listed for completeness. We might want to go that way of it was more performant or provided some other significant benefit. However, if we are just fixing the immediate bug, then option 1 seems like the best choice.
>>> import markdown
>>> txt = "*[^1^]: This is going to crash if extra extension is enabled"
>>> markdown.markdown(txt, extensions=['abbr'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\code\md\markdown\core.py", line 482, in markdown
return md.convert(text)
File "C:\code\md\markdown\core.py", line 357, in convert
root = self.parser.parseDocument(self.lines).getroot()
File "C:\code\md\markdown\blockparser.py", line 117, in parseDocument
self.parseChunk(self.root, '\n'.join(lines))
File "C:\code\md\markdown\blockparser.py", line 136, in parseChunk
self.parseBlocks(parent, text.split('\n\n'))
File "C:\code\md\markdown\blockparser.py", line 158, in parseBlocks
if processor.run(parent, blocks) is not False:
File "C:\code\md\markdown\extensions\abbr.py", line 61, in run
AbbrInlineProcessor(self._generate_pattern(abbr), title), 'abbr-%s' % abbr, 2
File "C:\code\md\markdown\extensions\abbr.py", line 94, in __init__
super().__init__(pattern)
File "C:\code\md\markdown\inlinepatterns.py", line 297, in __init__
self.compiled_re = re.compile(pattern, re.DOTALL | re.UNICODE)
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\re.py", line 250, in compile
return _compile(pattern, flags)
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\re.py", line 302, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 834, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\wlimberg\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 549, in _parse
raise source.error("unterminated character set",
re.error: unterminated character set at position 17
from markdown.
On second thought, we could go with option 3 and restrict abbreviations to not allow them to include the 3 characters ^
, \
, and ]
. As it is, those 3 characters would have never worked, so we aren't breaking anyone's existing documents. And then we don't need to worry about escaping them. While the backslash is a special character in Markdown, it does not make sense as part of an abbreviation, so being not permitted there should be fine. In fact, we have never supported escaping characters in abbreviations, which also means that a ]
would never end up in one (actually, it already is restricted)
Finally, I will note that while the -
does have special meaning in character sets, it is treated as a regular character if it is the first or last character of a character set. As we only every include a single character in a character set, it has worked fine without escaping and should continue to work without modification. Therefore, it does not need to be addressed at all.
from markdown.
I am rather opposed to the chosen solution, on two fronts:
-
It is strange to let the regex implementation details dictate the behavior.
-
It is really easy to implement the correct solution: replace the
_generate_pattern
function entirely with justre.escape
from markdown.
- It is strange to let the regex implementation details dictate the behavior.
I will note that there is no change in behavior, which is why I took this route. I simply pushed a bug fix which maintains the existing behavior. Specifically, the change prevents an error from being raised. So the change stands.
- It is really easy to implement the correct solution: replace the
_generate_pattern
function entirely with justre.escape
Apparently, way back when I first wrote this extension, I didn't realize that re.escape
existed. Because, if I did, I would have used it. In any event, a change in behavior would be considered in a separate PR.
from markdown.
I just pushed #1449. It still doesn't allow backslashes, but because of their meaning in Markdown, not due to the regex implementation.
from markdown.
Nice, thanks!
from markdown.
Related Issues (20)
- Use a different MkDocs theme. HOT 6
- How to write an extension that looks at raw HTML? HOT 13
- Search is missing from documentation HOT 1
- someone tell me how I can do this :) I know this is not mkdocs, but there's like too many third party things involved that I can't figure out where to go HOT 1
- multi lined strings not being formatted and put in code and pre blocks HOT 2
- IDs in headings behave differently compared to other markdown renderers HOT 2
- Add support for GitHub Flavored Markdown new admonitions ("alerts") HOT 5
- Does not handle comments in code snippets HOT 2
- Python Markdown turns `</#rrggbb>` into HTML comment, even while in inline code HOT 11
- bug(md_in_html): “markdown="1"” isn’t removed in child elements of the “li” tag HOT 4
- 3.5.2: pytest fails HOT 8
- Tables in blockquotes with nl2br extension HOT 2
- API docs are not being properly indexed for search HOT 18
- Strange and inconsistent parsing of lists with headers and multiple lines HOT 6
- The title from `toc_tokens` ignores the `smarty` extension HOT 3
- BlockProcessor output wrapped in p tag HOT 9
- Trying to migrate from `markdown2`, lists without blank lines not working HOT 3
- <table> improperly wrapped by <p> when inside a list HOT 1
- Add Support for Tab Customization in Code Blocks HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from markdown.