commonmark / commonmark.js Goto Github PK

CommonMark parser and renderer in JavaScript

License: Other

JavaScript 83.67% Shell 1.02% HTML 13.05% Makefile 1.39% CSS 0.88%

commonmark.js's Introduction

commonmark.js

CommonMark is a rationalized version of Markdown syntax, with a spec and BSD-licensed reference implementations in C and JavaScript.

For more information, see http://commonmark.org.

This repository contains the JavaScript reference implementation. It provides a library with functions for parsing CommonMark documents to an abstract syntax tree (AST), manipulating the AST, and rendering the document to HTML or to an XML representation of the AST.

To play with this library without installing it, see the live dingus at http://try.commonmark.org/.

Installing

You can install the library using npm:

npm install commonmark

This package includes the commonmark library and a command-line executable, commonmark.

For client-side use, you can use one of the single-file distributions provided in the dist/ subdirectory of the node installation (node_modules/commonmark/dist/). Use either commonmark.js (readable source) or commonmark.min.js (minimized source).

Alternatively, bower install commonmark will install the needed distribution files in bower_components/commonmark/dist.

You can also use the version hosted by unpkg: for example, https://unpkg.com/[email protected]/dist/commonmark.js for the unminimized version 0.29.3.

Building

Make sure to fetch dependencies with:

npm install

To run tests for the JavaScript library:

npm test

(Running the tests will also rebuild distribution files in dist/.)

To run benchmarks against some other JavaScript converters:

make bench

To start an interactive dingus that you can use to try out the library:

make dingus

Usage

Instead of converting Markdown directly to HTML, as most converters do, commonmark.js parses Markdown to an AST (abstract syntax tree), and then renders this AST as HTML. This opens up the possibility of manipulating the AST between parsing and rendering. For example, one could transform emphasis into ALL CAPS.

Here's a basic usage example:

var reader = new commonmark.Parser();
var writer = new commonmark.HtmlRenderer();
var parsed = reader.parse("Hello *world*"); // parsed is a 'Node' tree
// transform parsed if you like...
var result = writer.render(parsed); // result is a String

The constructors for Parser and HtmlRenderer take an optional options parameter:

var reader = new commonmark.Parser({smart: true});
var writer = new commonmark.HtmlRenderer({sourcepos: true});

Parser currently supports the following:

smart: if true, straight quotes will be made curly, -- will be changed to an en dash, --- will be changed to an em dash, and ... will be changed to ellipses.

Both HtmlRenderer and XmlRenderer (see below) support these options:

sourcepos: if true, source position information for block-level elements will be rendered in the data-sourcepos attribute (for HTML) or the sourcepos attribute (for XML).
safe: if true, raw HTML will not be passed through to HTML output (it will be replaced by comments), and potentially unsafe URLs in links and images (those beginning with javascript:, vbscript:, file:, and with a few exceptions data:) will be replaced with empty strings.
softbreak: specify raw string to be used for a softbreak.
esc: specify a function to be used to escape strings. Its argument is the string.

For example, to make soft breaks render as hard breaks in HTML:

var writer = new commonmark.HtmlRenderer({softbreak: "<br />"});

To make them render as spaces:

var writer = new commonmark.HtmlRenderer({softbreak: " "});

XmlRenderer serves as an alternative to HtmlRenderer and will produce an XML representation of the AST:

var writer = new commonmark.XmlRenderer({sourcepos: true});

The parser returns a Node. The following public properties are defined (those marked "read-only" have only a getter, not a setter):

type (read-only): a String, one of text, softbreak, linebreak, emph, strong, html_inline, link, image, code, document, paragraph, block_quote, item, list, heading, code_block, html_block, thematic_break.
firstChild (read-only): a Node or null.
lastChild (read-only): a Node or null.
next (read-only): a Node or null.
prev (read-only): a Node or null.
parent (read-only): a Node or null.
sourcepos (read-only): an Array with the following form: [[startline, startcolumn], [endline, endcolumn]].
isContainer (read-only): true if the Node can contain other Nodes as children.
literal: the literal String content of the node or null.
destination: link or image destination (String) or null.
title: link or image title (String) or null.
info: fenced code block info string (String) or null.
level: heading level (Number).
listType: a String, either bullet or ordered.
listTight: true if list is tight.
listStart: a Number, the starting number of an ordered list.
listDelimiter: a String, either ) or . for an ordered list.
onEnter, onExit: Strings, used only for custom_block or custom_inline.

Nodes have the following public methods:

appendChild(child): Append a Node child to the end of the Node's children.
prependChild(child): Prepend a Node child to the beginning of the Node's children.
unlink(): Remove the Node from the tree, severing its links with siblings and parents, and closing up gaps as needed.
insertAfter(sibling): Insert a Node sibling after the Node.
insertBefore(sibling): Insert a Node sibling before the Node.
walker(): Returns a NodeWalker that can be used to iterate through the Node tree rooted in the Node.

The NodeWalker returned by walker() has two methods:

next(): Returns an object with properties entering (a boolean, which is true when we enter a Node from a parent or sibling, and false when we reenter it from a child). Returns null when we have finished walking the tree.
resumeAt(node, entering): Resets the iterator to resume at the specified node and setting for entering. (Normally this isn't needed unless you do destructive updates to the Node tree.)

Here is an example of the use of a NodeWalker to iterate through the tree, making transformations. This simple example converts the contents of all text nodes to ALL CAPS:

var walker = parsed.walker();
var event, node;

while ((event = walker.next())) {
  node = event.node;
  if (event.entering && node.type === 'text') {
    node.literal = node.literal.toUpperCase();
  }
}

This more complex example converts emphasis to ALL CAPS:

var walker = parsed.walker();
var event, node;
var inEmph = false;

while ((event = walker.next())) {
  node = event.node;
  if (node.type === 'emph') {
    if (event.entering) {
      inEmph = true;
    } else {
      inEmph = false;
      // add Emph node's children as siblings
      while (node.firstChild) {
        node.insertBefore(node.firstChild);
      }
      // remove the empty Emph node
      node.unlink()
    }
  } else if (inEmph && node.type === 'text') {
      node.literal = node.literal.toUpperCase();
  }
}

Exercises for the reader: write a transform to

De-linkify a document, transforming links to regular text.
Remove all raw HTML (html_inline and html_block nodes).
Run fenced code blocks marked with a language name through a syntax highlighting library, replacing them with an HtmlBlock containing the highlighted code.
Print warnings to the console for images without image descriptions or titles.

Command line

The command line executable parses CommonMark input from the specified files, or from stdin if no files are specified, and renders the result to stdout as HTML. If multiple input files are specified, their contents are concatenated before parsing, with newlines between them.

commonmark inputfile.md > outputfile.html
commonmark intro.md chapter1.md chapter2.md > book.html

Use commonmark --help to get a summary of options.

A note on security

The library does not attempt to sanitize link attributes or raw HTML. If you use this library in applications that accept untrusted user input, you should either enable the safe option (see above) or run the output through an HTML sanitizer to protect against XSS attacks.

Performance

Performance is excellent, roughly on par with marked. On a benchmark converting an 11 MB Markdown file built by concatenating the Markdown sources of all localizations of the first edition of Pro Git by Scott Chacon, the command-line tool, commonmark is just a bit slower than the C program discount, roughly ten times faster than PHP Markdown, a hundred times faster than Python Markdown, and more than a thousand times faster than Markdown.pl.

Here are some focused benchmarks of four JavaScript libraries (using versions available on 24 Jan 2015). They test performance on different kinds of Markdown texts. (Most of these samples are taken from the markdown-it repository.) Results show a ratio of ops/second (higher is better) against showdown (which is usually the slowest implementation). Versions: showdown 1.3.0, marked 0.3.5, commonmark.js 0.22.1, markdown-it 5.0.2, node 5.3.0. Hardware: 1.6GHz Intel Core i5, Mac OSX.

Sample	showdown	commonmark	marked	markdown-it
README.md	1	3.6	3.1	3.9
block-bq-flat.md	1	4.8	4.9	4.9
block-bq-nested.md	1	11.9	6.8	10.7
block-code.md	1	4.7	12.1	23.0
block-fences.md	1	6.2	21.2	19.1
block-heading.md	1	5.0	4.8	6.5
block-hr.md	1	3.5	3.3	3.5
block-html.md	1	2.1	0.9	3.8
block-lheading.md	1	5.1	4.9	3.9
block-list-flat.md	1	4.7	4.4	7.4
block-list-nested.md	1	9.5	7.8	17.6
block-ref-flat.md	1	0.8	0.5	0.6
block-ref-nested.md	1	0.7	0.6	0.9
inline-autolink.md	1	2.3	3.4	2.5
inline-backticks.md	1	7.6	5.3	8.2
inline-em-flat.md	1	1.5	1.1	1.6
inline-em-nested.md	1	1.8	1.3	1.7
inline-em-worst.md	1	2.4	1.5	2.5
inline-entity.md	1	2.0	3.8	2.7
inline-escape.md	1	2.2	1.4	5.0
inline-html.md	1	2.9	3.7	3.3
inline-links-flat.md	1	2.7	2.7	2.2
inline-links-nested.md	1	1.4	0.5	0.5
inline-newlines.md	1	2.3	2.0	3.5
lorem1.md	1	6.0	2.9	3.3
rawtabs.md	1	4.6	3.9	6.7

To generate this table:

make bench-detailed

Authors

John MacFarlane wrote the first version of the JavaScript implementation. The block parsing algorithm was worked out together with David Greenspan. Kārlis Gaņģis helped work out a better parsing algorithm for links and emphasis, eliminating several worst-case performance issues. Vitaly Puzrin has offered much good advice about optimization and other issues.

commonmark.js's People

Contributors

Stargazers

Watchers

Forkers

brianleroux peterarmstrong rayray pri17 hours alberthilb robinst kublaj crissov ara303 iamstarkov balpha 0b10011 aureliojargas dudb glowdan chrisui xxgreg mcanthony mitaki28 tastes syaiful6 myinitialsarepm substance nikolas inno-v cyj100 danielbaird tmpfs nicojs fxcebx groupystinks pajn prasannavl rlugojr yiqideren ashang mgs255 dikmax curtis-fletcher timothygu brentonstrine mcannonbrookes happy-ferret orangeshark baig mattermost jmk2142 kirillfish rarara nodeframe maxsaxedesignweb rowhit thejimmyg noclew rainfore namse techhtml dwetterau kevmoo geyang liangklfang machour abi pastuhov 5dw larrikinventures tmr232 zischwartz sthagen murphymark dygapp fnd acidburn0zzz ludwigfrank bquast jiaochanglong gpzjyw mccasey kazssym sebastienh llqhz robertdober rmoorman digideskio vcode28629 seth4618 andersk sheraff rileytomasek vassudanagunta brunoscv kemitchell bikrone dhavalbudhelia iamahuman xzl8028 gosukiwi lcsingleton sitedata

commonmark.js's Issues

AST confusion

Given the following script:

{Parser} = require 'commonmark'

text = """
# test

some text

- list item one
- list item two
- list item three

more text
"""

parser = new Parser()
walker = parser.parse(text).walker()
console.log('''
hasNext | hasPrev | type | literal
------- | ------- | ---- | -------
''')
while event = walker.next()
  node = event.node
  console.log(
    [
      node.next isnt null
      node.prev isnt null
      node.type
      node.literal
    ].join(' | ')
  )

I get the following output:

hasNext	hasPrev	type	literal
false	false	Document
true	false	Header
false	false	Text	test
true	false	Header
true	true	Paragraph
false	false	Text	some text
true	true	Paragraph
true	true	List
true	false	Item
false	false	Paragraph
false	false	Text	list item one
false	false	Paragraph
true	false	Item
true	true	Item
false	false	Paragraph
false	false	Text	list item two
false	false	Paragraph
true	true	Item
false	true	Item
false	false	Paragraph
false	false	Text	list item three
false	false	Paragraph
false	true	Item
true	true	List
false	true	Paragraph
false	false	Text	more text
false	true	Paragraph
false	false	Document

Which has a couple problems:

Child-nodes are returned without using the walker in each node (which makes the next/prev values really confusing)
Container nodes are repeated (like Header-Text-Header and Paragraph-Text-Paragraph sets). This would make sense if the AST was flat and didn't show child / parent nodes (meaning that the duplicate nodes represent start/end HTML tags), However, there's no indication of which tags are start / end tags, and the AST isn't flat.

roadmap

cannot find any roadmap for commonmark. can anybody point me?

Named HTML entities with multiple codepoints not parsed correctly

See the following example: http://spec.commonmark.org/dingus/?text=%26ngE%3B%0A%0A%26gE%3B

&ngE; should be rendered as "≧̸" (U+02267 U+00338), but it's actually rendered as "≧" (which is the same as &gE;)

It looks like other such named entities are also not handled correctly.

Would probably also be good to add such an entity to the spec so that implementations are checked for this.

semver

Does commonmark following semver? There is no features added in two last minor versions

Bad perf case: Lots of delimiters that can close but no openers

Paste the text from this gist into dingus.
It takes about 1.3 seconds to parse
Paste the text a second time (append, so that the input is doubled in size)
It takes 6 to 13 seconds to parse or even longer

The test input is a_ (a, underscore, space) repeated 20000 times. The problem seems to be that in processEmphasis, for each potential closer, the opener is searched all the way back to the stack bottom.

Maybe it could be improved by removing a closer after not finding a corresponding opener iff the closer can not be an opener, so as to not have to check it again. Not sure if this is correct in all cases though or if there are other worst case inputs (need to think about it more).

Emphasis regression?

http://spec.commonmark.org/dingus/?text=_%28hai%29_.%20%3C-%20bad%0A%0A_%28hai%29_%0A%0A*%28hai%29*.%0A%0A

Happened somewhere after 0.15.

Is it a bug or intentional behaviour, when user should use *?

No way to set list attributes on a commonmark.Node

When using the abstract syntax tree directly there is no way to set the list attributes (listType, listDelimiter, listStart and listTight) without initializing the _listData yourself. As it starts with an underscore i think this seems to be a private property and should not be interacted with directly.

For example:

var node = new commonmark.Node('List');
node.listType = 'bullet'; // TypeError: Cannot set property 'delimiter' of null

Ordered lists starting with 0

According to the spec:

An ordered list marker is a sequence of one of more digits (0-9), followed by either a . character or a ) character.

0. seems to be recognized as an ordered list marker, but the resulting start attribute does not get set to 0 as expected: http://spec.commonmark.org/dingus/?text=0.%20Zero%0A1.%20One

(This issue may also affect cmark - I have not tested that)

Dingus doesn't display list markers if first block in item is Code Block

If you try the following bit of markdown in the dingus:

-     new list with indented code block

Show in dingus

the bullet will not be displayed.

This seems to be due to a clash with the Bootstrap CSS definition for pre. Bootstrap is doing a

pre {
  overflow: auto;
}

which is causing the list markers to be invisible. If you add overflow: visible to the definition for pre in dingus.css it is visible again.

pre {
  display: block;
  padding: 0.5em;
  color: #333;
  background: #f8f8ff;
  overflow: visible;
}

"Smart" replacement of hyphens with em/en dash seems strange

The current tests in smart_punct.txt for en/em dashes don't define behavior for certain longer combinations and the current code ends up resulting with hanging hyphens when they could be easily replaced in a different manner to only replace them with em/en dashes.

For example, a series of 10 hyphens results in 3 em dashes followed by a hyphen. In my opinion, it would make more sense for this to result in 5 en dashes. Additionally, 7 hyphens are converted into 2 em dashes and a hyphen, but I believe it should be 1 em dash and 2 en dashes. That is:

Current: ---------- => --- --- --- -  => ———-
 Better: ---------- => -- -- -- -- -- => –––––

Current: ------- => --- --- - => ——-
 Better: ------- => --- -- -- => —––

To achieve this behavior, each group of hyphens would be collected and counted at once, assuming it is 2 hyphens or more (eg, /^(?<!-)(-{2,})/), and then the most optimal grouping would be figured out (in PHP for thephpleague/commonmark, but should be able to be converted to JavaScript/C fairly easily):

$count = strlen($matched);
$en_dash = '–';
$en_count = 0;
$em_dash = '—';
$em_count = 0;
if ($count % 3 === 0) { // If divisible by 3, use all em dashes
    $em_count = $count / 3;
} elseif ($count % 2 === 0) { // If divisible by 2, use all en dashes
    $en_count = $count / 2;
} elseif (($count - 2) % 3 === 0) { // If 2 extra dashes, use en dash for last 2; em dashes for rest
    $em_count = floor(($count - 2) / 3);
    $en_count = 1;
} else { // Use en dashes for last 4 hyphens; em dashes for rest
    $em_count = floor(($count - 4) / 3);
    $en_count = 2;
}
$inlineContext->getInlines()->add(new Text(
    str_repeat($em_dash, $em_count).
    str_repeat($en_dash, $en_count)
));
return true;

Is this something that CommonMark would be interested in implementing? (I can do, I just don't want to spend the time writing the code if it won't be accepted.) Or should the smart_punct.txt file be updated with tests that check for these edge cases?

Make HTML renderer more customizable

It would be nice if renderers for individual elements (e.g. links) could be customized by setting properties of the renderer object.

Definitions of `can_open` and `can_close`

This is not really an issue -- more like an extended comment, which could be used to simplify the code and/or the definitions if you think it's better. (Posting it here though it could be a comment on the spec or probably the cmark code.)

On first reading, I found the definitions of left/right flanking and their use in can-open/close very confusing. My guess is that initially the flanking concept was simplifying things, but IMO it's not as helpful now when more conditions are piled up. With the recent additional change I resorted to a drawing karnaugh maps, and the result is much shorter:

var sp_after  = reWhitespaceChar.test(char_after);
var sp_before = reWhitespaceChar.test(char_before);
var pn_after  = rePunctuation.test(char_after);
var pn_before = rePunctuation.test(char_before);
can_open  = !sp_after  && (pn_before || sp_before || (cc===C_ASTERISK && !pn_after ));
can_close = !sp_before && (pn_after  || sp_after  || (cc===C_ASTERISK && !pn_before));

The length is of course not too relevant, and I'm guessing that the speed would be practically the same. What seems important to me, and this might be just me, is that it's much easier to read, and makes more easily sense as a definition (had it been in the text).

isContainer, attribute or method?

Hi,

in the README, isContainer is described as an attribute of a node. But in the code, it's a method.

Updating the README should be enough, but maybe it make more sense to modify the code to make it a real attribute. What do you think?

Typo in blocks.js?

While porting this to a typed language (Haxe) I noticed a potential typo in lib/blocks.js:
https://github.com/jgm/commonmark.js/blob/master/lib/blocks.js#L686

I'm thinking above should just be block._parent as we want to exit here when tip is null:
https://github.com/jgm/commonmark.js/blob/master/lib/blocks.js#L737

Or maybe not... I'm new here ;)

Separate parser library

I've been using the commonmark.js parser combined with react for rendering.

Since I am not using the html or xml renderer, it would be great to have a separate js file built which only contains the parser, node, and walker objects.

Test #8 from smart_punct.txt

It looks like test #8 from smart_punct.txt fail, at least in dingus.

"A paragraph with no closing quote.

"Second paragraph by same speaker, in fiction."

Dingus shows first paragraph with closing quote, while there should be open quote.

HTTPS download link

The readme has a link to the compiled source here: http://spec.commonmark.org/js/commonmark.js

I think we should try to be encouraging developers to get their code over HTTPS, to prevent problems like the recent Xcode attack.

Simply changing the above link to https didn't work (cert domain error). It seems to be hosted on GitHub but I couldn't figure out the correct invocation through GitHub pages or whatever. This URL does work though: https://raw.githubusercontent.com/jgm/CommonMark-site/gh-pages/js/commonmark.js

Thanks.

Edit: this would also apply to the whole CommonMark site, but one step at a time...

Any plans for a Grunt plugin?

Wrong parse on nested links/emphasis

% bin/commonmark 
**x [a*b**c*](d)
<p><em><em>x <a href="d">a<em>b</em>c</a></em></em></p>

See commonmark/cmark#59.

Inconsistent handling of malformed link reference titles

Consider these two samples. Both have malformed link titles.

[foo]: /url
"title

[foo]

http://spec.commonmark.org/dingus/?text=%5Bfoo%5D%3A%20%2Furl%0A%22title%0A%0A%5Bfoo%5D

and

[foo]: /url
"title" ok

[foo]

http://spec.commonmark.org/dingus/?text=%5Bfoo%5D%3A%20%2Furl%0A%22title%22%20ok%0A%0A%5Bfoo%5D

The first sample creates a link reference, but the second one doesn't. I believe they should both be treated the same - ie. links in both cases.

Parser adds lines to a tip which can't accept them

Here's an example which exhibits this behavior:

10. Bullet

        code


Test

The issue occurs when parsing line 5. As you can see, it checks whether the container (a CodeBlock) accepts lines, but then adds the line to the tip instead (which is a Document):

According to the comments on lines 117-118, the tip should be checked to see if it handles lines. Would it therefore be true that line 723 should check the tip type instead of the container type? Or is there perhaps an issue with the tip being out-of-sync with the container?

sourcepos on links

Is there a reason why we don't get data-sourcepos on links?

test [test](https://google.com)

Current get:
<p data-sourcepos="1:1-1:31">test <a href="https://google.com">test</a></p>
Would be nice to get:
<p data-sourcepos="1:1-1:31">test <a data-sourcepos="1:6-1:31" href="https://google.com">test</a></p>

Fuse/merge adjacent text nodes

Currently, the AST returned by commonmark.js can contain multiple adjacent text nodes. E.g. the following (dingus):

https://www.google.com/?q=foo_bar

Results in this AST:

<document>
  <paragraph>
    <text>https://www.google.com/?q=foo</text>
    <text>_</text>
    <text>bar</text>
  </paragraph>
</document>

For uses that require post-processing the AST before rendering (e.g. autolinking plain URLs), this makes it a little bit more difficult, because adjacent text nodes may have to be merged first.

Could this be implemented in the inline parser directly, so that the resulting AST never contains adjacent text nodes?

This isn't really a bug report. It's more of a discussion starter, and to hear your thoughts about this. I'm thinking about how to implement auto-linking in my implementation of CommonMark, and post-processing might be a good option.

Update mdurl dependency

Unfortunately the google closure compiler is unable to parse the commonmark library as there is a variable called "char" which is a reserved keyword in javascript.

This has been fixed upstream in mdurl.

See: markdown-it/mdurl#1 (comment)

ETA until stability?

I'm considering switching dox over to using commonmark, but I got bit by the massive API and AST changes between 0.12 and 0.17 and am hesitant to make it a dependency while things are changing so much.

Do you have an idea of when the package might be stable for a 1.0 release?

CommonMark renderer

The C library (cmark) has a CommonMark renderer. This could be ported over to commonmark.js (the code would be fairly similar), but I haven't done it yet.

Softbreak and Hardbreak

Accordingly to readme and code there are two types of breaks: Softbreak and Hardbreak. Hardbreak prepended by double space, but then I cannot understand how Softbreak is detected. Can you help me?

## Expected
Input: "YOLO  \nmd ftw\n"
AST:
  Document: null
  Paragraph: null
  Text: YOLO
  Hardbreak: null // it’s fine, it’s how it is suppoused to be
  Text: md ftw
  Paragraph: null
  Document: null

## Not Expected
Input: "YOLO\n\n\nmd ftw\n"
AST:
  Document: null
  Paragraph: null
  Text: YOLO
  // where is Softbreak here?
  Paragraph: null
  Paragraph: null
  Text: md ftw
  Paragraph: null
  Document: null

Invalid node type

I'm ported commonmark.js to java (https://github.com/hidekatsu-izuno/commonmark4j).

Then I found the non existing node type HtmlInline in the xml.js. Is this not HtmlBlock?

        unescapedContents = nodetype === 'Html' || nodetype === 'HtmlInline';

0.22.0 minified version does not parse inline links correctly

With example input:

# H1

Lorem ipsum.

## H2

[link][foo]

[foo]: http://foo.com

### H3

1. Item 1
2. Item 2

* Bullet 1
* Bullet 2

    ~~~
    blockquote here
    ~~~

* An example [link](http://example.com 'link title').

And the unminified dist/commonmark.js, the (correct) parse tree is:

Document
.Header
..Text# H1
.Paragraph
..Text# Lorem i...
.Header
..Text# H2
.Paragraph
..Link
...Text# link
.Header
..Text# H3
.List
..Item
...Paragraph
....Text# Item 1
..Item
...Paragraph
....Text# Item 2
.List
..Item
...Paragraph
....Text# Bullet 1
..Item
...Paragraph
....Text# Bullet 2
...CodeBlock# blockqu...
..Item
...Paragraph
....Text# An exam...
....Link
.....Text# link
....Text# .

However, using dist/commonmark.min.js, the parse tree is:

Document
.Header
..Text# H1
.Paragraph
..Text# Lorem i...
.Header
..Text# H2
.Paragraph
..Link
...Text# link
.Header
..Text# H3
.List
..Item
...Paragraph
....Text# Item 1
..Item
...Paragraph
....Text# Item 2
.List
..Item
...Paragraph
....Text# Bullet 1
..Item
...Paragraph
....Text# Bullet 2
...CodeBlock# blockqu...
..Item
...Paragraph
....Text# An exam...
....Text# [
....Text# link
....Text# ]
....Text# (http:/...
....Text# '
....Text# link title
....Text# '
....Text# ).

Any ideas? I am just reporting this observation, I have not tried to look into why the minified version displays this behavior. I am seeing the same thing with master 8fefa4954a76bd1b78fe7144c4aef7d4eb499cc3 as well.

Regards,
Paul

Italics inside Bold text can parse as double em instead of strong

When trying to parse:
1. **one t*w*o three**
you are returned:

<ol>
<li><em><em>one t</em>w</em>o three**</li>
</ol>

You would expect:

<ol>
<li><strong>one t<em>w</em>o three</strong></li>
</ol>

It's taking the first to asterisks as opening EMs and the other 2 around the "w" as closing

Live exmaple: http://spec.commonmark.org/dingus/?text=1.%20**one%20t*w*o%20three**

note: it also has the same effect if you put the string in without the list: **one t*w*o three**
Live example: http://spec.commonmark.org/dingus/?text=**one%20t*w*o%20three**

Source maps

See commonmark/commonmark-spec#57, especially @zdne's comment.

Instead of starting and ending line and column for each element, we need to associate each element with a possibly non-contiguous range of positions in the source.

This is because CommonMark inline elements can be broken by indicators of block structure:

> *emphasized
> text*

Here the second > should not be considered part of the emphasized text, even though it occurs after the start of the emphasized text and before the end.

Dingus shows incorrect sourcepos value for paragraphs

http://spec.commonmark.org/dingus.html?text=123

At first the value seems correct (<paragraph data-sourcepos="1:1-1:3">) but as soon as you start typing/clicking in the editor, it goes off: <paragraph data-sourcepos="19:1-1:12">

Probably because this.lineNumber is not reset.

header vs heading

Why H1-6 titles have header type instead of more w3c compliant heading?

PS. Sorry for asking too many questions, is it okay?

Allow Node classes to be set

I have to transform some custom syntax like this to HTML:

# Demo

T> This is some tip.

W> This is some warning.

I realized I could get 90% there by transforming those into blockquotes. That's not enough, though. I would need to attach that metadata (tip, warning) there to style these appropriately.

Here's what I came up with:

'use strict';
var markdown = require('commonmark');

var mdReader = new markdown.Parser();
var mdWriter = new markdown.HtmlRenderer();

main();

function main() {
    var content = '# demo\nT> some tip\n\nW> some warning\n'
    var content2 = '> some';
    var parsed = mdReader.parse(content);

    parsed = transform(parsed);

    var result = mdWriter.render(parsed);

    console.log(result);
}

function transform(parsed) {
    var walker = parsed.walker();
    var event, node;

    while ((event = walker.next())) {
      node = event.node;
      if (event.entering && node.type === 'Text') {
        if(node.literal.indexOf('T>') === 0) {
            node._parent._classes = ['tip']; // XXX: not possible yet
            node._parent._type = 'BlockQuote';
            node.literal = node.literal.slice(2).trim();
        }
        if(node.literal.indexOf('W>') === 0) {
            // ... same thing for warning
        }
      }
    }

    return parsed;
}

Do you think it would be alright to add support for something like ._classes? This would make it so much easier to do custom stuff like this. No doubt there are some other applications.

Fix url normalizer

Discussed here commonmark/commonmark-spec#270

My last attempt to use url for honest parse caused tons of broken tests (see commonmark/commonmark-spec#270 (comment)):

It always add missed / after domain name (http://example.com?abc -> http://example.com/?abc)
It replaces \ with / in query http://example.com?abc nodejs/node-v0.x-archive@f7ede33

Need to decide, what to do with (tests|spec|implementation). I've stopped working on this issue, until direction given. Code is available in separate branch https://github.com/markdown-it/markdown-it/commits/normalize (last commit).

Example:

node -e "console.log(require('url').parse('http://example.com?foo'))"

{ protocol: 'http:',
  slashes: true,
  auth: null,
  host: 'example.com',
  port: null,
  hostname: 'example.com',
  hash: null,
  search: '?foo',
  query: 'foo',
  pathname: '/',
  path: '/?foo',
  href: 'http://example.com/?foo' }

see href

the README links to an old version of commonmark.js (on spec.commonmark.org)

There's a link in the README to http://spec.commonmark.org/js/commonmark.js, which is old enough that it refers to DocParser instead of Parser, so then the example usage later in the README is wrong.

If I knew where the source of that website was, I'd give you a pull request updating it. :)

Putting slighlty misaligned blockquote in a list

Is the following a bug in commonmark.js? The behaviour seems to disobey rule 4 of List Items.

 > Blockquote
> continued here.


1.  > Blockquote
   > continued here.

Dingus here.

class injection in code block renderer

Did you know that CommonMark supports buttons? Apparently it does:

The trouble is: renderer appends the character class blindly to the language- part. If there are spaces there, we'll end up with <code class="language-foo bar"> which is two separate classes.

Well technically space (0x20) is filtered. But HTML5 allows 5 space characters according to 2.4 part of the spec. I'll quote:

The space characters, for the purposes of this specification, are U+0020 SPACE, "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR" (U+000D).

So, this code block will have three classes: language-foo, bar and baz:

```foo&#x09;bar&#x0C;baz
code
```

Second and third are essentially user-supplied, which is usually a bad thing.

Add version of dingus that works better with screen readers.

See http://talk.commonmark.org/t/the-commonmark-dingus-edit-box-is-inaccessible-to-some-screen-reader-users/1067/4.

Odd list behavior

10000. ok
    1. ok

should give a list with two items, but it does not.
Similarly

   10. hi
    11. there

The trigger is a four space indent. See this topic.

Diagnosis: In lib/blocks.js, around lines 384-399, the parser assumes that if a line is indented 4 or more spaces but it's not a code block (because it would be interrupting a paragraph), then it's a lazy paragraph continuation. That assumption is wrong, because it might be a list item.

Small typo in README.md

Hi. README.md says:

prependChild(child): Prepend a Node child to the end of the Node's children.

Should that say "to the beginning of the Node's children"?

Add source positions for inline elements

This would make it possible to use commonmark.js for syntax highlighting.

unnecessary \n in empty blockquotes

That's not a bug, only suggestion:

now:

<blockquote>
</blockquote>


could be:

<blockquote></blockquote>

Currently it's the only thing we have to normalize in markdown-it tests (because i decided to not add special cases to renderer).

If you don't like to make this change - just close this ticket.

Blockquote termination edge-case

The spec says, “An indented code block cannot interrupt a paragraph.”

So this looks right:

$ echo -e '> 111\n    222' | ./commonmark-0.20 
<blockquote>
<p>111
222</p>
</blockquote>

But a list can:

$ echo -e '> 111\n - 222' | ./commonmark-0.20 
<blockquote>
<p>111</p>
</blockquote>
<ul>
<li>222</li>
</ul>

The trouble comes when the thing after the blockquote looks like this:

> 111
    - 222

$ echo -e '> 111\n    - 222' | ./commonmark-0.20 
<blockquote>
<p>111</p>
</blockquote>
<pre><code>- 222
</code></pre>

It terminates a blockquote saying “hey, I'm a list”, but is parsed as a code block afterwards.

Imho, it should be parsed like a paragraph continuation after all:

<blockquote>
<p>111
- 222</p>
</blockquote>

sourcepos not correct

sourcepos attribute in AST is not correct when parser run for more than 1 time:

     var reader = new commonmark.Parser();
     var writer = new commonmark.HtmlRenderer({sourcepos:true});
     console.log(writer.render(reader.parse("Hello *world*")));
     console.log(writer.render(reader.parse("Hello *world*")));

Now the result is

    <p data-sourcepos="1:1-1:13">Hello <em>world</em></p>
    <p data-sourcepos="2:1-1:13">Hello <em>world</em></p>

And I wonder why should the sourcepos be 1-based instead of 0-based?

Dingus permalink doesn't include smart punctuation flag

To replicate the issue:

Visit this example
Check the "Smart punctuation" box
Click on "permalink" - note that "Smart punctuation" is not checked

tests for AST tree

Do you remember my topic about commonmark API design?

I’m in process of figuring out how to do it in the best way. To proof the concept, I chose to create helper module. One of the method is ast(), but I have no idea how to test it properly and didn‘t find any tests related to AST tree in commonmark.js too =(.

Firstly, I thought about deepEqual, but it failed due to circular deps, then I tried simple equal with JSON.stringify() but circular deps broke everything here too.

How can I verify that AST tree have proper structure? any advice or tip will be useful, thanks

Escaped backslash at end of link label produces wrong output

If I am reading the spec right, the following sample should be a valid link label shouldn't it?

[bar\\]: /url "title"

[bar\\]

http://spec.commonmark.org/dingus/?text=%5Bbar%5C%5C%5D%3A%20%2Furl%20%22title%22%0A%0A%5Bbar%5C%5C%5D%0A

However commonmark.js is not recognising it as such, and is instead generating a paragraph.

"&" entity doesn't get converted to "&" character

Update: I was wrong about the issue, see comment below. Sorry.

Hello.

Specification says:

With the goal of making this standard as HTML-agnostic as possible, all valid HTML entities (except in code blocks and code spans) are recognized as such and converted into unicode characters before they are stored in the AST.

& is a valid HTML entity and when it's stored in AST it should be converted into '&', but it's not.

It can be easily demonstrated with http://try.commonmark.org. If you input that code:

&amp;
&mu;

You will get following output in AST tab:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-2:4">
  <paragraph sourcepos="1:1-2:4">
    <text>&amp;</text>
    <softbreak />
    <text>μ</text>
  </paragraph>
</document>

The output that I expect is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-2:4">
  <paragraph sourcepos="1:1-2:4">
    <text>&</text>
    <softbreak />
    <text>μ</text>
  </paragraph>
</document>

Thanks

tab-related regressions

I didn't quite understand 0.21 spec changes related to tabs. The general idea seems good, but devil is in the details as they say. So I checked this implementation, unfortunately, it's behavior appears to be buggy.

Tab immediately after list item marker was allowed, now it's not. Is it intentional?

$ echo -e ' -\tlist' | ./commonmark-0.20
<ul>
<li>list</li>
</ul>

$ echo -e ' -\tlist' | ./commonmark-0.21 
<p>-    list</p>

If code block indentation is using half a tab, what happens?

$ echo -e ' - foo\n\n\t\tbar' | ./commonmark-0.20
<ul>
<li>
<p>foo</p>
<pre><code> bar
</code></pre>
</li>
</ul>

$ echo -e ' - foo\n\n\t\tbar' | ./commonmark-0.21
<ul>
<li>
<p>foo</p>
<p>ar</p>
</li>
</ul>

Variation of the bug above. But it might deserve a special place because it's unclear whether - \t\tcode should be a code block or not (might be a bug in 0.20 actually):

$ echo -e ' - \t\tcode' | ./commonmark-0.20
<ul>
<li>code</li>
</ul>

$ echo -e ' - \t\tcode' | ./commonmark-0.21 
<ul>
<li>
<pre><code>de
</code></pre>
</li>
</ul>