greghendershott / markdown Goto Github PK

View Code? Open in Web Editor NEW

101.0 10.0 28.0 718 KB

Markdown parser written in Racket.

Racket 72.66% Makefile 0.52% HTML 26.83%

racket markdown-parser

markdown's Introduction

Documentation.

markdown's People

Contributors

Stargazers

Watchers

markdown's Issues

reference-style link label between list items parses as separate lists

-> (parse-markdown "1. [A][a]\n\n[a]: google.com\n\n2. item2")
'((ol () (li () (a ((href "google.com")) "A"))) (ol () (li () "item2")))

Same example on babelmark

Looks like babelmark results are mixed. I expected it to parse as one list, but not sure which is correct.

Parse error from header separated by many blank lines

[Reported by @stchang. Moving to its own issue.]

Getting a parse error for two headers separated by more than 3 lines.

> (parse-markdown "# abc\n\n\n\n## def")
parsack.rkt:345:0: parse ERROR: at <string>:5:1:10
unexpected: "non-empty input"
  expected: "end-of-file"
> (parse-markdown "# \n\n\n\n## ")
parsack.rkt:345:0: parse ERROR: at <string>:5:1:7
unexpected: "non-empty input"
  expected: "end-of-file"
> (parse-markdown "# \n\n\n\n\n## ")
parsack.rkt:345:0: parse ERROR: at <string>:6:1:8
unexpected: "non-empty input"
  expected: "end-of-file"
> (parse-markdown "# \n\n\n## ")
'((h1 ((id ""))) (p () " ##"))

Images not parsed correctly if the path contains space

Images links are not parsed correctly if the path contains space

![hello](hel lo.jpg)

will not produce an image sexpr.

The commonmark spec says that the path to the image should conform to the same rules as a regular link. If you want a space in a regular link you should encapsulate it with <>, which would make the above example

![hello](<hel lo.jpg>)

which looks stupid, and also does not work with racket markdown.

Conforming to some kind of spec is of course nice, but wouldn't doing the expected thing (rendering the image anyway) also be nice? GitHub markdown does what I would expect (both examples work), pandoc as well.

Best regards
Linus

Reference style links for images don't seem to be working

Example of what should work:

![caption][url]

[url]: http://path/to/some/image

I thought I was handling this correctly but it doesn't seem to be working. For now, just quickly logging this to look at the code later.

Intra-block formatting not recognized for headings

## Heading **with** _formatting_

should be formatted:

Heading with formatting

but instead is

Heading *_with_* formatting

Add support for footnotes

Example:

Here is a footnote[^1].

...

[^1]: Here is the footnote definition.

Note: If this is in Gruber's spec, I don't see it. But I noticed it in the markdown mode for Emacs.

One spec: http://rephrase.net/box/word/footnotes/syntax/
Another: http://pythonhosted.org/Markdown/extensions/footnotes.html

performance experiments

I've been trying to improve the performance of the markdown parser through parsack improvements but haven't had much luck so far. I think it's just the bracketed-style of the grammar leads to too much backtracking, which my parser doesn't handle well.

That did give me an idea to profile some examples (ie your tests) to see if I could improve the ordering on any of the <or> options, to cut down on the backtracking.

For example the following changes produces a decent speedup of perf-test.rkt:

Flip the choices in $normal-char.
Remove the singleton <or> in source-url.
In $inline, drop $smart-punctuation down to above $code.
In $inline, drop the 4+ ... down to above $footnote-ref.

Obvious caveats:

I don't know how representative perf-tests.rkt is.
I know sometimes the order of the <or> choices matters, when the choices are not disjoint, so I don't know if I've actually messed something up. But all the tests do pass.
I do occasionally get a perf warning from random-test.rkt. I guess this is unsurprising since random-test emphasizes special chars?

Some numbers:

Before: non-strict: 3200, strict: 3340
After: non-strict: 2598, strict: 2547

Have you played around with the <or> ordering any?

Parsing HTML can omit spaces

Parse:

<p>X <a>Y</a> Z</p>

Expected:

(p () "X " (a () "Y") " Z")

Actual:

Note lack of spaces:

(p () "X" (a () "Y") "Z")

which as HTML is:

<p>X<a>Y</a>Z</p>

How about add table-of-contents ?

footnote has been surported, then how about toc ?

Newlines not preserved in <pre> blocks

<pre>...</pre> blocks need to be detected early on -- as blocks, and before any line-break consolidation that is acceptable or desirable for intra-block elements. (In other words, the fix is not to detect all HTML elements earlier on.)

Parsing after footnote

[^def]: Definition of footnote def.


aaaaa

Fail to parse because $footnode-def don't eat the two #\newline (don't know if it's responsability).

diff --git a/markdown/parse.rkt b/markdown/parse.rkt
index 30f2b85..60d29ee 100644
--- a/markdown/parse.rkt
+++ b/markdown/parse.rkt
@@ -805,7 +805,7 @@
             (optional $indent)
             (xs <- (sepBy $raw-lines
                           (try (>> $blank-line $indent))))
-            (optional $blank-line)
+            (many $blank-line)
             (return
              (begin
                (on-footnote-def! label (string-join xs "\n"))

Solve the problem.

can't nest styles

I can't nest code in bold, italic in bold, bold in italic, etc. It causes the outer style to fail to be interpreted.

Do not require blank line after closing backticks in fened code blocks

See greghendershott/frog#59

For example it should work as here on GitHub:

```
yo
```
ABC

is:

yo

ABC

need gobs of blank lines with  after code block

I had a code block using triple-backticks followed by a  marker, but unless I had three blank lines between them it failed to be processed correctly.

Generate scribble?

It would be truly awesome if this could generate scribble data types, and maybe it wouldn't be so hard, given the xexpr you're already generating.

Reference links with non-string labels not matched

A link with an empty URI like this:

[label][]

Is a reference link. It must be defined elsewhere in the Markdown:

[label]: www.example.com/

Now that we allow arbitrary markdown in link labels (see #5), if a reference link label is something that's not a simple string, for example the following as a result of adding smart quotes (see #13 ):

[some 'quotes' here][]
...
[some 'quotes' here]: www.example.com/

Then the reference is unresolved, because the reference link definition is still trying to match on the original string "some 'quotes' here", as opposed to what it is after running it through intra-block, namely '("some " lsquo "quotes" rsquo " here").

Pass third-party markdown test suites

See: https://github.com/trentm/python-markdown2/wiki/Testing-Notes

Input for which parser doesn't terminate

Discovered via randomized testing:

#lang racket

(require markdown)

(define input @~a{___*  '   "[<br />

<br />        ipsum'<div>"(<    ")_<]<div>[**<br /> **lorem*>***__```]```***"ipsum
<br />'*)***ipsum___    <br />***    *lorem___*ipsum**<br /></div><div> _</div>    <ipsumipsum &__
ipsum_[   <div>__<div>&ipsum(


(***<div>lorem`]  ___ </div>&lorem("**[ipsum__&   *

loremlorem*<br />lorem]``')___lorem&___''   ``">



```lorem<div>**    </div>[ lorem`]ipsum``<br />**&


__lorem   <br />lorem`*lorem  </div>lorem_ 'lorem```
})

(parse-markdown input)

Doesn't terminate (for at least 30 seconds).

can't nest literal HTML entities in link

I was trying to do:

[\[\[Stuff\]\]](stuff.html)

but since (as Issue #8 says) I can't escape brackets, I tried using the literal HTML entities [ and ] instead, but this caused the link to fail to be correctly interpreted. I worked around it with literal HTML, as always.

Reference link IDs are supposed to be case-INsensitive

http://daringfireball.net/projects/markdown/syntax#link

According to that, clearly in the [label][ref] case, with an explicit id ref.

Is an implicit ref ID from the label, e.g. [some _arbitrary_ markdown **here**][] supposed to match case-insensitive, too? I guess so.

en dashes at beginning of line don't work

IIRC the issue here was a line that starts with -- which is supposed to be an en dash. This got ignored and passed through as two dashes instead.

Questions about some corner cases

I'm trying to set up random testing of markdown inputs as well. I still have some kinks to work out but it found one case that I thought I would bring up.

A trailing space at the end of the input gets parsed as br but as " " when it's not at the end. Is this expected?

> (parse-markdown "![UoFEK](pFivY) ")
'((img ((src "pFivY") (alt "UoFEK"))) (br ()))
> (parse-markdown "![UoFEK](pFivY) [ToFEwaE](VYiYppJd)")
'((img ((src "pFivY") (alt "UoFEK"))) " " (a ((href "VYiYppJd")) "ToFEwaE"))

fenced code block close-marker is sensitive to white space

The second example has an extra space at the end and therefore doesn't parse properly. Is this the intended behavior?

$ racket
Welcome to Racket v6.0.0.3.
-> (require markdown)
-> (parse-markdown "```racket\n(define x 10)\n```")
'((pre ((class "brush: racket")) (code () "(define x 10)")))
-> (parse-markdown "```racket\n(define x 10)\n``` ")
'((p () (code () "racket\n(define x 10)")))

Open HTML tag at start of a long file causes parser to take forever

I wrote a long blog post using Frog and left an unclosed HTML tag towards the front of the file (<TODO>). When trying to convert the file to HTML the parser hung. It sent my CPU to 100% and didn't finish after 5 min. The memory usage was oscillating around 200MB so it appears to be doing a lot of backtracking.

This sounds related to issue #43 but wanted to provide an test case I came across.

spaces before list items crash parser

A list of the form

    * blah blah
    * blah blah
    * blah blah

crashes the parser, whereas if the list items have no spaces before them it works fine.

Self-closing HTML tags aren't being recognized as HTML

This:

<img src="img/yunocoros.jpg"/> The topic of coroutines (or
fibers, or continuations) for JavaScript comes up from time to time

does not preserve the <img src="/img/yunocoros.jpg"/> HTML.

The problem seems to be with "self-closing" (if that's the correct terminology) tags.

<img src="foo" />

does not work, whereas

<img src="foo"></img>

does work

cf greghendershott/frog#20

Double-hyphen in URL

Example:

Jay McCarthy posted about a macro to do a [`case` with `break` in Racket](http://jeapostrophe.github.io/2013-06-24-cas-cad--post.html).

Creates an invalid xexpr:

'(p () "Jay McCarthy posted about a macro to do a " (a () ((href "http://jeapostrophe.github.io/2013-06-24-cas-cad" (span mdash) "post.html")) (code () "case") " with " (code () "break") " in Racket") ".")

Note the (span mdaash) in the href attribute, from the -- in the URI.

parser cannot handle Windows-style line breaks

I believe the parser does not handle Windows-style line breaks, ie "\r\n", properly. For example, all the tests that rely on test.md fail in Windows.

Here is another example:

> (parse-markdown "[test][1]\n\n[1]:http://test.com \"test-title\"")
'((p () (a ((href "http://test.com") (title "test-title")) "test")))
> (parse-markdown "[test][1]\r\n\r\n[1]:http://test.com \"test-title\"")
Reference link not defined: (linkref "1")
'((a ((href "")) "test") "\r \r [1]:http://test.com " ldquo "test-title" rdquo)

Undeclared dependency

raco setup: --- summary of missing dependencies ---
raco setup: undeclared dependency detected
raco setup:   for package: "markdown"
raco setup:   on packages:
raco setup:    "base"
raco setup:    "sandbox-lib"
raco setup:    "scribble-lib"
raco setup:    "srfi-lite-lib"

Parsing of not-quite-malformed Markdown takes a long time

When the markdown parser parses the text below, it takes a long time to do it:

% time .../m2h try-md.md 
Doing try-md.md
'#(#<void>)
.../m2h try-md.md  6.68s user 0.17s system 100% cpu 6.836 total
%

(the m2h script does little more than parse the markdown and then dump it to HTML).

Now, the text here is obviously not very good markdown – it's just a list of filenames – but it is just text, and I would have expected the parser to broadly cope with it, even if the result isn't pretty. The case this is a reduction of consisted of about double this number of lines, and I thought the parser had crashed. Is there perhaps something O(n^2) happening here?

Now, one response to ‘the parser is slow when I give it rubbish markdown’ is ‘well, don't do that, then’. That is perfectly fair. But if there's some way of getting the parser to give up gracefully, that would be nice.

font-util-1.3.0_1   Create an index of X font files in a directory
fontconfig-2.11.0_1,1 XML-based font configuration API for X Windows
fontsproto-2.1.2    Fonts extension headers
freetype2-2.5.2     Free and portable TrueType font rendering engine
gdk-pixbuf2-2.28.2  Graphic library for GTK+
gettext-0.18.3.1    GNU gettext package
git-1.8.5.2         Distributed source code management tool
glib-2.36.3_1       Some useful routines of C programming (current stable versi
gmake-3.82_1        GNU version of 'make' utility
gnome_subr-1.0      Common startup and shutdown subroutines used by GNOME scrip
gnomehier-3.0       A utility port that creates the GNOME directory tree
gobject-introspection-1.36.0_2 Generate interface introspection data for GObject libraries
graphite2-1.2.4     Rendering capabilities for complex non-Roman writing system
gtk-update-icon-cache-2.24.22 Gtk-update-icon-cache utility from the Gtk+ toolkit
gtk3-3.8.8          Gimp Toolkit for X11 GUI (current stable version)
harfbuzz-0.9.25     OpenType text shaping engine
hicolor-icon-theme-0.12 A high-color icon theme shell from the FreeDesktop project
icu-50.1.2          International Components for Unicode (from IBM)
inputproto-2.3      Input extension headers
intltool-0.50.2     Tools to internationalize various kinds of data files
jasper-1.900.1_12   An implementation of the codec specified in the JPEG-2000 s
jbigkit-1.6         Lossless compression for bi-level images such as scanned pa
jpeg-8_4            IJG's jpeg compression utilities
kbproto-1.0.6       KB extension headers
lcms2-2.5           Accurate, fast, and small-footprint color management engine
libICE-1.0.8,1      Inter Client Exchange library for X11
libSM-1.2.2,1       Session Management library for X11
libX11-1.6.2,1      X11 library
libXau-1.0.8        Authentication Protocol library for X11
libXcomposite-0.4.4,1 X Composite extension library
libXcursor-1.1.14   X client-side cursor loading library
libXdamage-1.1.4    X Damage extension library
libXdmcp-1.1.1      X Display Manager Control Protocol library
libXext-1.3.2,1     X11 Extension library
libXfixes-5.0.1     X Fixes extension library
libXfont-1.4.7,1    X font library
libXft-2.3.1        Client-sided font API for X applications
libXi-1.7.2,1       X Input extension library
libXinerama-1.1.3,1 X11 Xinerama library
libXrandr-1.4.2     X Resize and Rotate extension library
libXrender-0.9.8    X Render extension library
libXt-1.1.4,1       X Toolkit library
libXtst-1.2.2       X Test extension
libcheck-0.9.11     Unit test framework for C
libgcrypt-1.5.3     General purpose crypto library based on code used in GnuPG
libgpg-error-1.12   Common error values for all GnuPG components
libiconv-1.14_1     A character set conversion library
libpthread-stubs-0.3_4 This library provides weak aliases for pthread functions
libtool-2.4.2_2     Generic shared library support script
libxcb-1.9.3        The X protocol C-language Binding (XCB) library
libxml2-2.8.0_3     XML parser library for GNOME
libxslt-1.1.28_1    The XSLT C library for GNOME

`(\_ -> ...)` does not generate <code> correctly

Haskell lambda function does not generate <code> correctly, it works on Mou markdown editor and Github.

This is Haskell lambda `(\_ -> ...)` code.

Result:
This is Haskell lambda (_ -> ...) code.

Github result:
This is Haskell lambda (\_ -> ...) code.

I read the code, when I put code before escape in function intra-block, it worked fine :).

Possibly incorrect html emitted with script tag

See what happens to the async attribute.

➜  markdown git:(master) racket markdown/main.rkt
<script async class="speakerdeck-embed" data-id="f0b571b0759a0131f0bd026a5a2b7ed1" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<!DOCTYPE html>^D
<html>
 <head>
  <meta charset="utf-8" /></head>
 <body>
  <script async="async" class="speakerdeck-embed" data-id="f0b571b0759a0131f0bd026a5a2b7ed1" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script></body></html>%

Block image should parse links in label

(parse-markdown "![[A _label_](/url/)](/png/)")
; Actual =>
'((div ((class "figure"))
       (img ((src "/png/")
             (alt "[A _label_](/url/)")))
       (p ((class "caption"))
          "[A " (em () "label") "](/url/)")))
; Expected => 
'((div ((class "figure"))
       (img ((src "/png/")
             (alt "[A _label_](/url/)")))
       (p ((class "caption"))
          (a ((href "/url/")) "A " (em () "label")))))

Bold/italic in the middle of words does not work: should it?

If you type something like **fr**ozen bl**og** then I think you should get frozen blog: you currently get something with embedded asterisks instead (this is using markdown in frog of course). I don't know what the markdown standard, as far as there is one, says about this.

I think this used to work (at least, when I wrote that markup originally I assume I checked the output!).

This is a non-major issue.

Error parsing HTML comment in list

See original report and discussion: greghendershott/frog#106

Generated HTML5 should use <img /> not <img></img>

cf greghendershott/frog#41

No license file

Unlicensed software is nonfree software... could we get a COPYING? :)

Smart quotes

cf greghendershott/frog#25

`\nBold blah blah.\n` results in bullet list instead of <strong>

cf greghendershott/frog#33

\n**Bold** blah blah.\n results in bullet list instead of bold.

Likewise, \n1.23 blah blah.\n results in number list instead of number.

In both cases, the issue is that a #\space should be required after *, -, +, or 1. otherwise it is not a list.

Support MathJax delimiters

In the sense that text between \[ and \] or between $ and $ is used literally, not parsed as markdown.

See greghendershott/frog#129 (comment)

Because \ already has a meaning in markdown -- for example \[ means a literal [, not e.g. part of markdown syntax for a link -- this may need to be \\[ and so on.

Smart-quote conversion not working with nested single quotes

Originally submitted by @chipotle as mbutterick/pollen#17, but seems to also belong here. This Markdown snippet:

"Reductive. Yes."

Is correctly parsed with its smart quotes like this:

<p>&ldquo;Reductive. Yes.&rdquo;</p>

But when inner single quotes are added:

"'Reductive.' Yes."

Then the parser ignores the outer double quotes:

<p>"&lsquo;Reductive.&rsquo; Yes."</p>

can't escape brackets

I believe markdown allows escaping square brackets with a backslash but this doesn't work.

Footnote definitions not appearing in numerical order

Footnote definitions should appear in numerical order, but currently they are appearing in the order they are defined in the Markdown text.

Another way to put it: Footnotes definitions should appear in the order of their usage not of their definition.

Example:

A usage[^foo] and another[^bar].

[^bar]: Bar note.

[^foo]: Foo note.

The foo note will be number 1 and bar will be number 2. The footnotes should appear in that order.

Obviously someone can work around this by reordering the footnote definitions, but it should work correctly especially when using the label variant.

can't nest images in links

I've tried nesting images in links either with:

[![foo](foo.jpg)](foo.html)

[<img src="foo.jpg"/>](foo.html)

and both cause the outer brackets to fail to be treated as a link. I can work around with literal HTML.

Emphasis of non-alphanumeric not working

(parse-markdown "*[alt](/url)*")
;; =>
'((p ()
     "*"
     (a ((href "/url")) "alt")
     "*"))

;; but should be:
'((p ()
     (em ()
         (a ((href "/url")) "alt"))))

code blocks not preceded by a blank line fail

A code block that isn't preceded by a blank line fails to be interpreted correctly.

no support for em dashes

Three dashes are treated as an em dash in other markdown parsers, but they're ignored.

Backtick (a.k.a. quasiquote) in fenced code block

(parse-markdown "```racket\n'(foo)\n```")
;; => '((pre ((class "brush: racket")) (code () "'(foo)"))) 
;; good

(parse-markdown "```racket\n`(foo)\n```")
;; => '((p () (code () "racket\n`(foo)")))
;; bad

First glance, I don't understand why. $verbatim/fenced is using $any-line which should not be affected by the ```. Somehow it is getting parsed as an inline $code, instead.

Allow configurability of the xexprs generated by the Markdown parse

Specific case first:

The Markdown library currently generates (h1 () ...) for Markdown ==== headers, and (h2 () ...) for ---- headers. For my application I'd prefer these to be (h2 () ...) and (h3 () ...) respectively. That's easy to fix, since I can just walk the tree adjusting them post-parse. However it would be neater if I could ask the parser to do that for me.

I see that this overlaps with the discussion of tree-walking, and its costs, in pull request #48, so might relate to that.

Looking at $setext-heading/para/plain in parse.rkt, I see that this h1/h2 interpretation is implemented by a (match ...) expression, so that looks like a natural location for a parameter, but you'll have a better idea than me of how expensive that would be.

There's a potential more general point, in that there are other places where it looks natural for a user to impose a different interpretation on the parse – functions $_emph and $_strong – but those cases are more marginal, and it might be that the sort of configurability I'm suggesting above is not worth generalising.

observe `xexpr-drop-empty-attributes`?

The xexpr-drop-empty-attributes parameter in xml/xexpr controls whether empty attribute lists are omitted from X-expressions.

parse-markdown makes X-expressions, but does not observe this parameter. Can it? Should it?

(parameterize ([xexpr-drop-empty-attributes #t])
  (parse-markdown "I am _emph_ and I am **strong**."))

> '((p () "I am " (em () "emph") " and I am " (strong () "strong") "."))

;; vs. '((p "I am " (em "emph") " and I am " (strong "strong") "."))

greghendershott / markdown Goto Github PK

markdown's Introduction

markdown's People

Contributors

Stargazers

Watchers

Forkers

markdown's Issues

Expected:

Actual:

Recommend Projects

Recommend Topics

Recommend Org