Giter VIP home page Giter VIP logo

micromark's Introduction

micromark

Build Coverage Downloads Size Sponsors Backers Chat

The smallest CommonMark compliant markdown parser. With positional info and concrete tokens.

Feature highlights

Contents

When should I use this?

  • If you just want to turn markdown into HTML (with maybe a few extensions)
  • If you want to do really complex things with markdown

See § Comparison for more info

What is this?

micromark is an open source markdown parser written in JavaScript. It’s implemented as a state machine that emits concrete tokens, so that every byte is accounted for, with positional info. It then compiles those tokens directly to HTML, but other tools can take the data and for example build an AST which is easier to work with (mdast-util-to-markdown).

While most markdown parsers work towards compliancy with CommonMark (or GFM), this project goes further by following how the reference parsers (cmark, cmark-gfm) work, which is confirmed with thousands of extra tests.

Other than CommonMark and GFM, micromark also supports common extensions to markdown such as MDX, math, and frontmatter.

These npm packages have a sibling project in Rust: markdown-rs.

Install

This package is ESM only. In Node.js (version 16+), install with npm:

npm install micromark

In Deno with esm.sh:

import {micromark} from 'https://esm.sh/micromark@3'

In browsers with esm.sh:

<script type="module">
  import {micromark} from 'https://esm.sh/micromark@3?bundle'
</script>

Use

Typical use (buffering):

import {micromark} from 'micromark'

console.log(micromark('## Hello, *world*!'))

Yields:

<h2>Hello, <em>world</em>!</h2>

You can pass extensions (in this case micromark-extension-gfm):

import {micromark} from 'micromark'
import {gfm, gfmHtml} from 'micromark-extension-gfm'

const value = '* [x] [email protected] ~~strikethrough~~'

const result = micromark(value, {
  extensions: [gfm()],
  htmlExtensions: [gfmHtml()]
})

console.log(result)

Yields:

<ul>
<li><input checked="" disabled="" type="checkbox"> <a href="mailto:[email protected]">[email protected]</a> <del>strikethrough</del></li>
</ul>

Streaming interface:

import {createReadStream} from 'node:fs'
import {stream} from 'micromark/stream'

createReadStream('example.md')
  .on('error', handleError)
  .pipe(stream())
  .pipe(process.stdout)

function handleError(error) {
  // Handle your error here!
  throw error
}

API

See § API in the micromark readme.

Extensions

micromark supports extensions. There are two types of extensions for micromark: SyntaxExtension, which change how markdown is parsed, and HtmlExtension, which change how it compiles. They can be passed in options.extensions or options.htmlExtensions, respectively.

As a user of extensions, refer to each extension’s readme for more on how to use them. As a (potential) author of extensions, refer to § Extending markdown and § Creating a micromark extension.

List of extensions

Community extensions

SyntaxExtension

A syntax extension is an object whose fields are typically the names of hooks, referring to where constructs “hook” into. The fields at such objects are character codes, mapping to constructs as values.

The built in constructs are an example. See it and existing extensions for inspiration.

HtmlExtension

An HTML extension is an object whose fields are typically enter or exit (reflecting whether a token is entered or exited). The values at such objects are names of tokens mapping to handlers.

See existing extensions for inspiration.

Extending markdown

micromark lets you change markdown syntax, yes, but there are alternatives. The alternatives are often better.

Over the years, many micromark and remark users have asked about their unique goals for markdown. Some exemplary goals are:

  1. I want to add rel="nofollow" to external links
  2. I want to add links from headings to themselves
  3. I want line breaks in paragraphs to become hard breaks
  4. I want to support embedded music sheets
  5. I want authors to add arbitrary attributes
  6. I want authors to mark certain blocks with meaning, such as tip, warning, etc
  7. I want to combine markdown with JS(X)
  8. I want to support our legacy flavor of markdown-like syntax

These can be solved in different ways and which solution is best is both subjective and dependent on unique needs. Often, there is already a solution in the form of an existing remark or rehype plugin. Respectively, their solutions are:

  1. remark-external-links
  2. rehype-autolink-headings
  3. remark-breaks
  4. custom plugin similar to rehype-katex but integrating abcjs
  5. either remark-directive and a custom plugin or with rehype-attr
  6. remark-directive combined with a custom plugin
  7. combining the existing micromark MDX extensions however you please, such as done by mdx-js/mdx or xdm
  8. Writing a micromark extension

Looking at these from a higher level, they can be categorized:

  • Changing the output by transforming syntax trees (1 and 2)

    This category is nice as the format remains plain markdown that authors are already familiar with and which will work with existing tools and platforms.

    Implementations will deal with the syntax tree (mdast) and the ecosystems remark and rehype. There are many existing utilities for working with that tree. Many remark plugins and rehype plugins also exist.

  • Using and abusing markdown to add new meaning (3, 4, potentially 5)

    This category is similar to Changing the output by transforming syntax trees, but adds a new meaning to certain things which already have semantics in markdown.

    Some examples in pseudocode:

    *   **A list item with the first paragraph bold**
    
        And then more content, is turned into `<dl>` / `<dt>` / `<dd>` elements
    
    Or, the title attributes on links or images is [overloaded](/url 'rel:nofollow')
    with a new meaning.
    
    ```csv
    fenced,code,can,include,data
    which,is,turned,into,a,graph
    ```
    
    ```js data can="be" passed=true
    // after the code language name
    ```
    
    HTML, especially comments, could be used as **markers**<!--id="markers"-->
  • Arbitrary extension mechanism (potentially 5; 6)

    This category is nice when content should contain embedded “components”. Often this means it’s required for authors to have some programming experience. There are three good ways to solve arbitrary extensions.

    HTML: Markdown already has an arbitrary extension syntax. It works in most places and authors are already familiar with the syntax, but it’s reasonably hard to implement securely. Certain platforms will remove HTML completely, others sanitize it to varying degrees. HTML also supports custom elements. These could be used and enhanced by client side JavaScript or enhanced when transforming the syntax tree.

    Generic directives: although a proposal and not supported on most platforms, directives do work with many tools already. They’re not the easiest to author compared to, say, a heading, but sometimes that’s okay. They do have potential: they nicely solve the need for an infinite number of potential extensions to markdown in a single markdown-esque way.

    MDX also adds support for components by swapping HTML out for JS(X). JSX is an extension to JavaScript, so MDX is something along the lines of literate programming. This does require knowledge of React (or Vue) and JavaScript, excluding some authors.

  • Extending markdown syntax (7 and 8)

    Extend the syntax of markdown means:

    • Authors won’t be familiar with the syntax
    • Content won’t work in other places (such as on GitHub)
    • Defeating the purpose of markdown: being simple to author and looking like what it means

    …and it’s hard to do as it requires some in-depth knowledge of JavaScript and parsing. But it’s possible and in certain cases very powerful.

Creating a micromark extension

This section shows how to create an extension for micromark that parses “variables” (a way to render some data) and one to turn a default construct off.

Stuck? See support.md.

Prerequisites

  • You should possess an intermediate to high understanding of JavaScript: it’s going to get a bit complex
  • Read the readme of unified (until you hit the API section) to better understand where micromark fits
  • Read the § Architecture section to understand how micromark works
  • Read the § Extending markdown section to understand whether it’s a good idea to extend the syntax of markdown

Extension basics

micromark supports two types of extensions. Syntax extensions change how markdown is parsed. HTML extensions change how it compiles.

HTML extensions are not always needed, as micromark is often used through mdast-util-from-markdown to parse to a markdown syntax tree. So instead of an HTML extension a from-markdown utility is needed. Then, a mdast-util-to-markdown utility, which is responsible for serializing syntax trees to markdown, is also needed.

When developing something for internal use only, you can pick and choose which parts you need. When open sourcing your extensions, it should probably contain four parts: syntax extension, HTML extension, from-markdown utility, and a to-markdown utility.

On to our first case!

Case: variables

Let’s first outline what we want to make: render some data, similar to how Liquid and the like work, in our markdown. It could look like this:

Hello, {planet}!

Turned into:

<p>Hello, Venus!</p>

An opening curly brace, followed by one or more characters, and then a closing brace. We’ll then look up planet in some object and replace the variable with its corresponding value, to get something like Venus out.

It looks simple enough, but with markdown there are often a couple more things to think about. For this case, I can see the following:

  • Is there a “block” version too?
  • Are spaces allowed? Line endings? Should initial and final white space be ignored?
  • Balanced nested braces? Superfluous ones such as {{planet}} or meaningful ones such as {a {pla} net}?
  • Character escapes ({pla\}net}) and character references ({pla&#x7d;net})?

To keep things as simple as possible, let’s not support a block syntax, see spaces as special, support line endings, or support nested braces. But to learn interesting things, we will support character escapes and -references.

Note that this particular case is already solved quite nicely by micromark-extension-mdx-expression. It’s a bit more powerful and does more things, but it can be used to solve this case and otherwise serve as inspiration.

Setup

Create a new folder, enter it, and set up a new package:

mkdir example
cd example
npm init -y

In this example we’ll use ESM, so add type: 'module' to package.json:

@@ -2,6 +2,7 @@
   "name": "example",
   "version": "1.0.0",
   "description": "",
+  "type": "module",
   "main": "index.js",
   "scripts": {
     "test": "echo \"Error: no test specified\" && exit 1"

Add a markdown file, example.md, with the following text:

Hello, {planet}!

{pla\}net} and {pla&#x7d;net}.

To check if our extension works, add an example.js module, with the following code:

import fs from 'node:fs/promises'
import {micromark} from 'micromark'
import {variables} from './index.js'

const buf = await fs.readFile('example.md')
const out = micromark(buf, {extensions: [variables]})
console.log(out)

While working on the extension, run node example to see whether things work. Feel free to add more examples of the variables syntax in example.md if needed.

Our extension doesn’t work yet, for one because micromark is not installed:

npm install micromark --save-dev

…and we need to write our extension. Let’s do that in index.js:

export const variables = {}

Although our extension doesn’t do anything, running node example now somewhat works!

Syntax extension

Much in micromark is based on character codes (see § Preprocess). For this extension, the relevant codes are:

  • -5 — M-0005 CARRIAGE RETURN (CR)
  • -4 — M-0004 LINE FEED (LF)
  • -3 — M-0003 CARRIAGE RETURN LINE FEED (CRLF)
  • null — EOF (end of the stream)
  • 92 — U+005C BACKSLASH (\)
  • 123 — U+007B LEFT CURLY BRACE ({)
  • 125 — U+007D RIGHT CURLY BRACE (})

Also relevant are the content types (see § Content types). This extension is a text construct, as it’s parsed alongsides links and such. The content inside it (between the braces) is string, to support character escapes and -references.

Let’s write our extension. Add the following code to index.js:

const variableConstruct = {name: 'variable', tokenize: variableTokenize}

export const variables = {text: {123: variableConstruct}}

function variableTokenize(effects, ok, nok) {
  return start

  function start(code) {
    console.log('start:', effects, code);
    return nok(code)
  }
}

The above code exports an extension with the identifier variables. The extension defines a text construct for the character code 123. The construct has a name, so that it can be turned off (optional, see next case), and it has a tokenize function that sets up a state machine, which receives effects and the ok and nok states. ok can be used when successful, nok when not, and so constructs are a bit similar to how promises can resolve or reject. tokenize returns the initial state, start, which itself receives the current character code, prints some debugging information, and then returns a call to nok.

Ensure that things work by running node example and see what it prints.

Now we need to define our states and figure out how variables work. Some people prefer sketching a diagram of the flow. I often prefer writing it down in pseudo-code prose. I’ve also found that test driven development works well, where I write unit tests for how it should work, then write the state machine, and finally use a code coverage tool to ensure I’ve thought of everything.

In prose, what we have to code looks like this:

  • start: Receive 123 as code, enter a token for the whole (let’s call it variable), enter a token for the marker (variableMarker), consume code, exit the marker token, enter a token for the contents (variableString), switch to begin
  • begin: If code is 125, reconsume in nok. Else, reconsume in inside
  • inside: If code is -5, -4, -3, or null, reconsume in nok. Else, if code is 125, exit the string token, enter a variableMarker, consume code, exit the marker token, exit the variable token, and switch to ok. Else, consume, and remain in inside.

That should be it! Replace variableTokenize with the following to include the needed states:

function variableTokenize(effects, ok, nok) {
  return start

  function start(code) {
    effects.enter('variable')
    effects.enter('variableMarker')
    effects.consume(code)
    effects.exit('variableMarker')
    effects.enter('variableString')
    return begin
  }

  function begin(code) {
    return code === 125 ? nok(code) : inside(code)
  }

  function inside(code) {
    if (code === -5 || code === -4 || code === -3 || code === null) {
      return nok(code)
    }

    if (code === 125) {
      effects.exit('variableString')
      effects.enter('variableMarker')
      effects.consume(code)
      effects.exit('variableMarker')
      effects.exit('variable')
      return ok
    }

    effects.consume(code)
    return inside
  }
}

Run node example again and see what it prints! The HTML compiler ignores things it doesn’t know, so variables are now removed.

We have our first syntax extension, and it sort of works, but we don’t handle character escapes and -references yet. We need to do two things to make that work: a) skip over \\ and \} in our algorithm, b) tell micromark to parse them.

Change the code in index.js to support escapes like so:

@@ -23,6 +23,11 @@ function variableTokenize(effects, ok, nok) {
       return nok(code)
     }

+    if (code === 92) {
+      effects.consume(code)
+      return insideEscape
+    }
+
     if (code === 125) {
       effects.exit('variableString')
       effects.enter('variableMarker')
@@ -35,4 +40,13 @@ function variableTokenize(effects, ok, nok) {
     effects.consume(code)
     return inside
   }
+
+  function insideEscape(code) {
+    if (code === 92 || code === 125) {
+      effects.consume(code)
+      return inside
+    }
+
+    return inside(code)
+  }
 }

Finally add support for character references and character escapes between braces by adding a special token that defines a content type:

@@ -11,6 +11,7 @@ function variableTokenize(effects, ok, nok) {
     effects.consume(code)
     effects.exit('variableMarker')
     effects.enter('variableString')
+    effects.enter('chunkString', {contentType: 'string'})
     return begin
   }

@@ -29,6 +30,7 @@ function variableTokenize(effects, ok, nok) {
     }

     if (code === 125) {
+      effects.exit('chunkString')
       effects.exit('variableString')
       effects.enter('variableMarker')
       effects.consume(code)

Tokens with a contentType will be replaced by postprocess (see § Postprocess) by the tokens belonging to that content type.

HTML extension

Up next is an HTML extension to replace variables with data. Change example.js to use one like so:

@@ -1,11 +1,12 @@
 import fs from 'node:fs/promises'
 import {micromark} from 'micromark'
-import {variables} from './index.js'
+import {variables, variablesHtml} from './index.js'

 const buf = await fs.readFile('example.md')
-const out = micromark(buf, {extensions: [variables]})
+const html = variablesHtml({planet: '1', 'pla}net': '2'})
+const out = micromark(buf, {extensions: [variables], htmlExtensions: [html]})
 console.log(out)

And add the HTML extension, variablesHtml, to index.js like so:

@@ -52,3 +52,19 @@ function variableTokenize(effects, ok, nok) {
     return inside(code)
   }
 }
+
+export function variablesHtml(data = {}) {
+  return {
+    enter: {variableString: enterVariableString},
+    exit: {variableString: exitVariableString},
+  }
+
+  function enterVariableString() {
+    this.buffer()
+  }
+
+  function exitVariableString() {
+    var id = this.resume()
+    if (id in data) {
+      this.raw(this.encode(data[id]))
+    }
+  }
+}

variablesHtml is a function that receives an object mapping “variables” to strings and returns an HTML extension. The extension hooks two functions to variableString, one when it starts, the other when it ends. We don’t need to do anything to handle the other tokens as they’re already ignored by default. enterVariableString calls buffer, which is a function that “stashes” what would otherwise be emitted. exitVariableString calls resume, which is the inverse of buffer and returns the stashed value. If the variable is defined, we ensure it’s made safe (with this.encode) and finally output that (with this.raw).

Further exercises

It works! We’re done! Of course, it can be better, such as with the following potential features:

  • Add support for empty variables
  • Add support for spaces between markers and string
  • Add support for line endings in variables
  • Add support for nested braces
  • Add support for blocks
  • Add warnings on undefined variables
  • Use micromark-build, and use devlop, debug, and micromark-util-symbol (see § Size & debug)
  • Add mdast-util-from-markdown and mdast-util-to-markdown utilities to parse and serialize the AST

Case: turn off constructs

Sometimes it’s needed to turn a default construct off. That’s possible through a syntax extension. Note that not everything can be turned off (such as paragraphs) and even if it’s possible to turn something off, it could break micromark (such as character escapes).

To disable constructs, refer to them by name in an array at the disable.null field of an extension:

import {micromark} from 'micromark'

const extension = {disable: {null: ['codeIndented']}}

console.log(micromark('\ta', {extensions: [extension]}))

Yields:

<p>a</p>

Architecture

micromark is maintained as a monorepo. Many of its internals, which are used in micromark (core) but also useful for developers of extensions or integrations, are available as separate modules. Each module maintained here is available in packages/.

Overview

The naming scheme in packages/ is as follows:

  • micromark-build — Small CLI to build dev code into production code
  • micromark-core-commonmark — CommonMark constructs used in micromark
  • micromark-factory-* — Reusable subroutines used to parse parts of constructs
  • micromark-util-* — Reusable helpers often needed when parsing markdown
  • micromark — Core module

micromark has two interfaces: buffering (maintained in micromark/dev/index.js) and streaming (maintained in micromark/dev/stream.js). The first takes all input at once whereas the last uses a Node.js stream to take input separately. They thinly wrap how data flows through micromark:

                                            micromark
+-----------------------------------------------------------------------------------------------+
|            +------------+         +-------+         +-------------+         +---------+       |
| -markdown->+ preprocess +-chunks->+ parse +-events->+ postprocess +-events->+ compile +-html- |
|            +------------+         +-------+         +-------------+         +---------+       |
+-----------------------------------------------------------------------------------------------+

Preprocess

The preprocessor (micromark/dev/lib/preprocess.js) takes markdown and turns it into chunks.

A chunk is either a character code or a slice of a buffer in the form of a string. Chunks are used because strings are more efficient storage than character codes, but limited in what they can represent. For example, the input ab\ncd is represented as ['ab', -4, 'cd'] in chunks.

A character code is often the same as what String#charCodeAt() yields but micromark adds meaning to certain other values.

In micromark, the actual character U+0009 CHARACTER TABULATION (HT) is replaced by one M-0002 HORIZONTAL TAB (HT) and between 0 and 3 M-0001 VIRTUAL SPACE (VS) characters, depending on the column at which the tab occurred. For example, the input \ta is represented as [-2, -1, -1, -1, 97] and a\tb as [97, -2, -1, -1, 98] in character codes.

The characters U+000A LINE FEED (LF) and U+000D CARRIAGE RETURN (CR) are replaced by virtual characters depending on whether they occur together: M-0003 CARRIAGE RETURN LINE FEED (CRLF), M-0004 LINE FEED (LF), and M-0005 CARRIAGE RETURN (CR). For example, the input a\r\nb\nc\rd is represented as [97, -5, 98, -4, 99, -3, 100] in character codes.

The 0 (U+0000 NUL) character code is replaced by U+FFFD REPLACEMENT CHARACTER ().

The null code represents the end of the input stream (called eof for end of file).

Parse

The parser (micromark/dev/lib/parse.js) takes chunks and turns them into events.

An event is the start or end of a token amongst other events. Tokens can “contain” other tokens, even though they are stored in a flat list, by entering before and exiting after them.

A token is a span of one or more codes. Tokens are most of what micromark produces: the built in HTML compiler or other tools can turn them into different things. Tokens are essentially names attached to a slice, such as lineEndingBlank for certain line endings, or codeFenced for a whole fenced code.

Sometimes, more info is attached to tokens, such as _open and _close by attention (strong, emphasis) to signal whether the sequence can open or close an attention run. These fields have to do with how the parser works, which is complex and not always pretty.

Certain fields (previous, next, and contentType) are used in many cases: linked tokens for subcontent. Linked tokens are used because outer constructs are parsed first. Take for example:

- *a
  b*.
  1. The list marker and the space after it is parsed first
  2. The rest of the line is a chunkFlow token
  3. The two spaces on the second line are a linePrefix of the list
  4. The rest of the line is another chunkFlow token

The two chunkFlow tokens are linked together and the chunks they span are passed through the flow tokenizer. There the chunks are seen as chunkContent and passed through the content tokenizer. There the chunks are seen as a paragraph and seen as chunkText and passed through the text tokenizer. Finally, the attention (emphasis) and data (“raw” characters) is parsed there, and we’re done!

Content types

The parser starts out with a document tokenizer. Document is the top-most content type, which includes containers such as block quotes and lists. Containers in markdown come from the margin and include more constructs on the lines that define them.

Flow represents the sections (block constructs such as ATX and setext headings, HTML, indented and fenced code, thematic breaks), which like document are also parsed per line. An example is HTML, which has a certain starting condition (such as <script> on its own line), then continues for a while, until an end condition is found (such as </style>). If that line with an end condition is never found, that flow goes until the end.

Content is zero or more definitions, and then zero or one paragraph. It’s a weird one, and needed to make certain edge cases around definitions spec compliant. Definitions are unlike other things in markdown, in that they behave like text in that they can contain arbitrary line endings, but have to end at a line ending. If they end in something else, the whole definition instead is seen as a paragraph.

The content in markdown first needs to be parsed up to this level to figure out which things are defined, for the whole document, before continuing on with text, as whether a link or image reference forms or not depends on whether it’s defined. This unfortunately prevents a true streaming markdown parser.

Text contains phrasing content (rich inline text: autolinks, character escapes and -references, code, hard breaks, HTML, images, links, emphasis, strong).

String is a limited text-like content type which only allows character references and character escapes. It exists in things such as identifiers (media references, definitions), titles, or URLs and such.

Constructs

Constructs are the things that make up markdown. Some examples are lists, thematic breaks, or character references.

Note that, as a general rule of thumb, markdown is really weird. It’s essentially made up of edge cases rather than logical rules. When browsing the built in constructs, or venturing to build your own, you’ll find confusing new things and run into complex custom hooks.

One more reasonable construct is the thematic break (see code). It’s an object that defines a name and a tokenize function. Most of what constructs do is defined in their required tokenize function, which sets up a state machine to handle character codes streaming in.

Postprocess

The postprocessor (micromark/dev/lib/postprocess.js) is a small step that takes events, ensures all their nested content is parsed, and returns the modified events.

Compile

The compiler (micromark/dev/lib/compile.js) takes events and turns them into HTML. While micromark was created mostly to advance markdown parsing irrespective of compiling to HTML, the common case of doing so is built in. A built in HTML compiler is useful because it allows us to check for compliancy to CommonMark, the de facto norm of markdown, specified in roughly 650 input/output cases. The parsing parts can still be used separately to build ASTs, CSTs, or many other output formats.

The compiler has an interface that accepts lists of events instead of the whole at once, but because markdown can’t truly stream, events are buffered before compiling and outputting the final result.

Examples

GitHub flavored markdown (GFM)

To support GFM (autolink literals, strikethrough, tables, and tasklists) use micromark-extension-gfm. Say we have a file like this:

# GFM

## Autolink literals

www.example.com, https://example.com, and [email protected].

## Footnote

A note[^1]

[^1]: Big note.

## Strikethrough

~one~ or ~~two~~ tildes.

## Table

| a | b  |  c |  d  |
| - | :- | -: | :-: |

## Tag filter

<plaintext>

## Tasklist

* [ ] to do
* [x] done

Then do something like this:

import fs from 'node:fs/promises'
import {micromark} from 'micromark'
import {gfm, gfmHtml} from 'micromark-extension-gfm'

const doc = await fs.readFile('example.md')

console.log(micromark(doc, {extensions: [gfm()], htmlExtensions: [gfmHtml()]}))
Show equivalent HTML
<h1>GFM</h1>
<h2>Autolink literals</h2>
<p><a href="http://www.example.com">www.example.com</a>, <a href="https://example.com">https://example.com</a>, and <a href="mailto:[email protected]">[email protected]</a>.</p>
<h2>Footnote</h2>
<p>A note<sup><a href="#user-content-fn-1" id="user-content-fnref-1" data-footnote-ref="" aria-describedby="footnote-label">1</a></sup></p>
<h2>Strikethrough</h2>
<p><del>one</del> or <del>two</del> tildes.</p>
<h2>Table</h2>
<table>
<thead>
<tr>
<th>a</th>
<th align="left">b</th>
<th align="right">c</th>
<th align="center">d</th>
</tr>
</thead>
</table>
<h2>Tag filter</h2>
&lt;plaintext&gt;
<h2>Tasklist</h2>
<ul>
<li><input disabled="" type="checkbox"> to do</li>
<li><input checked="" disabled="" type="checkbox"> done</li>
</ul>
<section data-footnotes="" class="footnotes"><h2 id="footnote-label" class="sr-only">Footnotes</h2>
<ol>
<li id="user-content-fn-1">
<p>Big note. <a href="#user-content-fnref-1" data-footnote-backref="" class="data-footnote-backref" aria-label="Back to content"></a></p>
</li>
</ol>
</section>

Math

To support math use micromark-extension-math. Say we have a file like this:

Lift($L$) can be determined by Lift Coefficient ($C_L$) like the following equation.

$$
L = \frac{1}{2} \rho v^2 S C_L
$$

Then do something like this:

import fs from 'node:fs/promises'
import {micromark} from 'micromark'
import {math, mathHtml} from 'micromark-extension-math'

const doc = await fs.readFile('example.md')

console.log(micromark(doc, {extensions: [math], htmlExtensions: [mathHtml()]}))
Show equivalent HTML
<p>Lift(<span class="math math-inline"><span class="katex"></span></span>) can be determined by Lift Coefficient (<span class="math math-inline"><span class="katex"></span></span>) like the following equation.</p>
<div class="math math-display"><span class="katex-display"><span class="katex"></span></span></div>

Syntax tree

A higher level project, mdast-util-from-markdown, can give you an AST.

import fromMarkdown from 'mdast-util-from-markdown' // This wraps micromark.

const result = fromMarkdown('## Hello, *world*!')

console.log(result.children[0])

Yields:

{
  type: 'heading',
  depth: 2,
  children: [
    {type: 'text', value: 'Hello, ', position: [Object]},
    {type: 'emphasis', children: [Array], position: [Object]},
    {type: 'text', value: '!', position: [Object]}
  ],
  position: {
    start: {line: 1, column: 1, offset: 0},
    end: {line: 1, column: 19, offset: 18}
  }
}

Another level up is remark, which provides a nice interface and hundreds of plugins.

Markdown

CommonMark

The first definition of “Markdown” gave several examples of how it worked, showing input Markdown and output HTML, and came with a reference implementation (Markdown.pl). When new implementations followed, they mostly followed the first definition, but deviated from the first implementation, and added extensions, thus making the format a family of formats.

Some years later, an attempt was made to standardize the differences between implementations, by specifying how several edge cases should be handled, through more input and output examples. This is known as CommonMark, and many implementations now work towards some degree of CommonMark compliancy. Still, CommonMark describes what the output in HTML should be given some input, which leaves many edge cases up for debate, and does not answer what should happen for other output formats.

micromark passes all tests from CommonMark and has many more tests to match the CommonMark reference parsers. Finally, it comes with CMSM, which describes how to parse markup, instead of documenting input and output examples.

Grammar

The syntax of markdown can be described in Backus–Naur form (BNF) as:

markdown = .*

No, that’s not a typo: markdown has no syntax errors; anything thrown at it renders something.

Project

Comparison

There are many other markdown parsers out there and maybe they’re better suited to your use case! Here is a short comparison of a couple in JavaScript. Note that this list is made by the folks who make micromark and remark, so there is some bias.

Note: these are, in fact, not really comparable: micromark (and remark) focus on completely different things than other markdown parsers do. Sure, you can generate HTML from markdown with them, but micromark (and remark) are created for (abstract or concrete) syntax trees—to inspect, transform, and generate content, so that you can make things like MDX, Prettier, or Astro.

micromark

micromark can be used in two different ways. It can either be used, optionally with existing extensions, to get HTML easily. Or, it can give tremendous power, such as access to all tokens with positional info, at the cost of being hard to get into. It’s super small, pretty fast, and has 100% CommonMark compliance. It has syntax extensions, such as supporting 100% GFM compliance (with micromark-extension-gfm), but they’re rather complex to write. It’s the newest parser on the block, which means it’s fresh and well suited for contemporary markdown needs, but it’s also battle-tested, and already the 3rd most popular markdown parser in JavaScript.

If you’re looking for fine grained control, use micromark. If you just want HTML from markdown, use micromark.

remark

remark is the most popular markdown parser. It’s built on top of micromark and boasts syntax trees. For an analogy, it’s like if Babel, ESLint, and more, were one project. It supports the syntax extensions that micromark has (so it’s 100% CM compliant and can be 100% GFM compliant), but most of the work is done in plugins that transform or inspect the tree, and there’s tons of them. Transforming the tree is relatively easy: it’s a JSON object that can be manipulated directly. remark is stable, widely used, and extremely powerful for handling complex data.

You probably should use remark.

marked

marked is the oldest markdown parser on the block. It’s been around for ages, is battle tested, small, popular, and has a bunch of extensions, but doesn’t match CommonMark or GFM, and is unsafe by default.

If you have markdown you trust and want to turn it into HTML without a fuss, and don’t care about perfect compatibility with CommonMark or GFM, but do appreciate a small bundle size and stability, use marked.

markdown-it

markdown-it is a good, stable, and essentially CommonMark compliant markdown parser, with (optional) support for some GFM features as well. It’s used a lot as a direct dependency in packages, but is rather big. It shines at syntax extensions, where you want to support not just markdown, but your (company’s) version of markdown.

If you need a couple of custom syntax extensions to your otherwise CommonMark-compliant markdown, and want to get HTML out, use markdown-it.

Others

There are lots of other markdown parsers! Some say they’re small, or fast, or that they’re CommonMark compliant—but that’s not always true. This list is not supposed to be exhaustive (but it’s the most relevant ones). This list of markdown parsers is a snapshot in time of why (not) to use (alternatives to) micromark: they’re all good choices, depending on what your goals are.

Test

micromark is tested with the ~650 CommonMark tests and more than 1.2k extra tests confirmed with CM reference parsers. These tests reach all branches in the code, which means that this project has 100% code coverage. Finally, we use fuzz testing to ensure micromark is stable, reliable, and secure.

To build, format, and test the codebase, use $ npm test after clone and install. The $ npm run test-api and $ npm run test-coverage scripts check either the unit tests, or both them and their coverage, respectively.

The $ npm run test-fuzz script does fuzz testing for 30 minutes.

Size & debug

micromark is really small. A ton of time went into making sure it minifies well, by the way code is written but also through custom build scripts to pre-evaluate certain expressions. Furthermore, care went into making it compress well with gzip and brotli.

Normally, you’ll use the pre-evaluated version of micromark. While developing, debugging, or testing your code, you should switch to use code instrumented with assertions and debug messages:

node --conditions development module.js

To see debug messages, use a DEBUG env variable set to micromark:

DEBUG="*" node --conditions development module.js

Version

micromark adheres to semver since 3.0.0.

Security

The typical security aspect discussed for markdown is cross-site scripting (XSS) attacks. Markdown itself is safe if it does not include embedded HTML or dangerous protocols in links/images (such as javascript: or data:). micromark makes any markdown safe by default, even if HTML is embedded or dangerous protocols are used, as it encodes or drops them. Turning on the allowDangerousHtml or allowDangerousProtocol options for user-provided markdown opens you up to XSS attacks.

Another security aspect is DDoS attacks. For example, an attacker could throw a 100mb file at micromark, in which case the JavaScript engine will run out of memory and crash. It is also possible to crash micromark with smaller payloads, notably when thousands of links, images, emphasis, or strong are opened but not closed. It is wise to cap the accepted size of input (500kb can hold a big book) and to process content in a different thread or worker so that it can be stopped when needed.

Using extensions might also be unsafe, refer to their documentation for more information.

For more information on markdown sanitation, see improper-markup-sanitization.md by @chalker.

See security.md in micromark/.github for how to submit a security report.

Contribute

See contributing.md in micromark/.github for ways to get started. See support.md for ways to get help.

This project has a code of conduct. By interacting with this repository, organisation, or community you agree to abide by its terms.

Sponsor

Support this effort and give back by sponsoring on OpenCollective!


Salesforce 🏅

Vercel

Motif

HashiCorp

GitBook

Gatsby

Netlify

Coinbase

ThemeIsle

Expo

Boost Note

Markdown Space

Holloway


You?

Origin story

Over the summer of 2018, micromark was planned, and the idea shared in August with a couple of friends and potential sponsors. The problem I (@wooorm) had was that issues were piling up in remark and other repos, but my day job (teaching) was fun, fulfilling, and deserved time too. It was getting hard to combine the two. The thought was to feed two birds with one scone: fix the issues in remark with a new markdown parser (codename marydown) while being financially supported by sponsors building fancy stuff on top, such as Gatsby, Contentful, and Vercel (ZEIT at the time). @johno was making MDX on top of remark at the time (important historical note: several other folks were working on JSX + markdown too). We bundled our strengths: MDX was getting some traction and we thought together we could perhaps make something sustainable.

In November 2018, we launched with the idea for micromark to solve all existing bugs, sustaining the existing hundreds of projects, and furthering the exciting high-level project MDX. We pushed a single name: unified (which back then was a small but essential part of the chain). Gatsby and Vercel were immediate sponsors. We didn’t know whether it would work, and it worked. But now you have a new problem: you are getting some financial support (much more than other open source projects) but it’s not enough money for rent, and too much money to print stickers with. You still have your job and issues are still piling up.

At the start of summer 2019, after a couple months of saving up donations, I quit my job and worked on unified through fall. That got the number of open issues down significantly and set up a strong governance and maintenance system for the collective. But when the time came to work on micromark, the money was gone again, so I contracted through winter 2019, and in spring 2020 I could do about half open source, half contracting. One of the contracting gigs was to write a new MDX parser, for which I also documented how to do that with a state machine in prose. That gave me the insight into how the same could be done for markdown: I drafted CMSM, which was some of the core ideas for micromark, but in prose.

In May 2020, Salesforce reached out: they saw the bugs in remark, how micromark could help, and the initial work on CMSM. And they had thousands of Markdown files. In a for open source uncharacteristic move, they decided to fund my work on micromark. A large part of what maintaining open source means, is putting out fires, triaging issues, and making sure users and sponsors are happy, so it was amazing to get several months to just focus and make something new. I remember feeling that this project would probably be the hardest thing I’d work on: yeah, parsers are pretty difficult, but markdown is on another level. Markdown is such a giant stack of edge cases on edge cases on even more weirdness, what a mess. On August 20, 2020, I released 2.0.0, the first working version of micromark. And it’s hard to describe how that moment felt. It was great.

In 2022, Vercel paid me to make a Rust version: markdown-rs. Super cool that I got to continue this work and bring it to a new language.

License

MIT © Titus Wormer

micromark's People

Contributors

ahacad avatar bryanph avatar christianmurphy avatar davidanson avatar fazouane-marouane avatar ocavue avatar paulbarmstrong avatar porges avatar robsimmons avatar shartte avatar timshilov avatar tripodsan avatar trysound avatar wooorm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

micromark's Issues

Using power-assert causes Webpack builds to fail

Initial checklist

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

Use micromark in a Webpack app.

Expected behavior

The app builds without configuration changes.

Actual behavior

The build fails with:

WARNING in ../../node_modules/power-assert-formatter/lib/create.js 30:28-49
Critical dependency: the request of a dependency is an expression
 @ ../../node_modules/power-assert-formatter/index.js 12:0-40
 @ ../../node_modules/power-assert/index.js 15:16-49
 @ ../../node_modules/micromark/dev/lib/create-tokenizer.js 27:0-33 213:4-22 217:4-10 223:4-22 231:4-22 236:4-10 285:4-10 286:4-25 298:4-10 299:4-25 302:4-10 305:4-22 311:4-10 414:10-16 464:8-26 472:8-26 508:4-10 576:4-10 577:4-10 652:
10-16
 @ ../../node_modules/micromark/dev/lib/parse.js 14:0-53 49:13-28

and similar warnings for any package that now has a dependency on power-assert.

There is a solution provided by power-assert here, but it seems like it would hide other warnings that I would want to see and I don't think I should need to modify my Webpack config to get micromark to work.

It would probably be better not to add power-assert as a dependency in a patch release since it's likely to break many people's builds.

Also note that the main micromark package uses power-assert, but I think it's missing from the the package's dependencies.

Out of curiosity, what is the advantage of switching to power-assert?

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

control character and puntuation cause extra emphasis to appear

Initial checklist

Affected packages and versions

3.0.5

Link to runnable example

https://stackblitz.com/edit/node-njevp4?file=index.js

Steps to reproduce

import { micromark } from 'micromark';
import { Parser, HtmlRenderer } from 'commonmark';
import rehypeParse from 'rehype-parse';
import { unified } from 'unified';
import { visit } from 'unist-util-visit';
import lodash from 'lodash';

const reader = new Parser();
const writer = new HtmlRenderer();
function scrubber(tree) {
  visit(tree, function (node) {
    node.data = undefined;
    node.value = undefined;
    node.position = undefined;
  });

  return tree;
}

const commonmark = (buf) => writer.render(reader.parse(buf));

const content = `example*�.*example example**`;

const micromarkHtml = micromark(content, {
  allowDangerousHtml: true,
  allowDangerousProtocol: true,
}).trim();
const commonmarkHtml = commonmark(content).trim();

const micromarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(micromarkHtml)
);
const commonmarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(commonmarkHtml)
);

console.log('micromark');
console.log(micromarkHtml);
console.log('');
console.log(JSON.stringify(micromarkHtmlAst, null, 4));
console.log('');
console.log('commonmark');
console.log(commonmark(content));
console.log('');
console.log(JSON.stringify(commonmarkHtmlAst, null, 2));
console.log(lodash.isEqual(micromarkHtmlAst, commonmarkHtmlAst));

Expected behavior

single emphasis in the document

<p>example*�.<em>example example</em>*</p>

with the HTML structure

{
  "type": "root",
  "children": [
    {
      "type": "element",
      "tagName": "p",
      "properties": {},
      "children": [
        {
          "type": "text"
        },
        {
          "type": "element",
          "tagName": "em",
          "properties": {},
          "children": [
            {
              "type": "text"
            }
          ]
        },
        {
          "type": "text"
        }
      ]
    }
  ]
}

Actual behavior

extra emphasis is added

<p>example<em>�.<em>example example</em></em></p>

changing the structure

{
    "type": "root",
    "children": [
        {
            "type": "element",
            "tagName": "p",
            "properties": {},
            "children": [
                {
                    "type": "text"
                },
                {
                    "type": "element",
                    "tagName": "em",
                    "properties": {},
                    "children": [
                        {
                            "type": "text"
                        },
                        {
                            "type": "element",
                            "tagName": "em",
                            "properties": {},
                            "children": [
                                {
                                    "type": "text"
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Runtime

Node v16

Package manager

npm v7

OS

Linux

Build and bundle tools

No response

HTML with excess whitespace is not parsed correctly

Initial checklist

Affected packages and versions

micromark 3.0.10, mdast-util-from-markdown 1.2.0

Link to runnable example

https://codesandbox.io/s/awesome-elbakyan-c0gus?file=/src/index.ts

Steps to reproduce

If you remove the line between Some HTML and Spanning multiple lines it does work but excess whitespace makes the parser confused (it thinks the extra line means a new paragraph starts)

Expected behavior

The HTML should be all combined in a single html node

Actual behavior

The parsing fails

Potential Heap overflow/memory leak

Subject of the issue

Fuzz testing micromark, by itself without plugins (#18 modified)

const fs = require('fs')
const micromark = require('../index')

function fuzz(buf) {
  try {
    // focus on issues in files less than 1Mb
    if (buf.length > 1000000) return

    // write result in temp file in case unrecoverable exception is thrown
    fs.writeFileSync('temp.txt', buf)

    // commonmark buffer without html
    micromark(buf)
  } catch (e) {
    throw e
  }
}

module.exports = {
  fuzz
}

after running through 10-30 files often crashes with:

<--- Last few GCs --->

[16841:0x4e8fc10]    11334 ms: Mark-sweep (reduce) 3664.6 (4118.7) -> 3664.6 (4118.7) MB, 162.9 / 0.0 ms  (average mu = 0.067, current mu = 0.000) last resort GC in old space requested
[16841:0x4e8fc10]    11494 ms: Mark-sweep (reduce) 3664.6 (4115.7) -> 3664.6 (4116.7) MB, 160.5 / 0.0 ms  (average mu = 0.033, current mu = 0.000) last resort GC in old space requested


<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0xa02dd0 node::Abort() [node]
 2: 0x94e471 node::FatalError(char const*, char const*) [node]
 3: 0xb7686e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb76be7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xd31485  [node]
 6: 0xd43cf1 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 7: 0xd09562 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
 8: 0xd033e4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
 9: 0xd0b719 v8::internal::Factory::NewInternalizedStringImpl(v8::internal::Handle<v8::internal::String>, int, unsigned int) [node]
10: 0xf3169f v8::internal::StringTable::AddKeyNoResize(v8::internal::Isolate*, v8::internal::StringTableKey*) [node]
11: 0xf3fa16 v8::internal::Handle<v8::internal::String> v8::internal::StringTable::LookupKey<v8::internal::InternalizedStringKey>(v8::internal::Isolate*, v8::internal::InternalizedStringKey*) [node]
12: 0xf3fac6 v8::internal::StringTable::LookupString(v8::internal::Isolate*, v8::internal::Handle<v8::internal::String>) [node]
13: 0xb7644b v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Name>, unsigned long, v8::internal::Handle<v8::internal::JSReceiver>, v8::internal::LookupIterator::Configuration) [node]
14: 0xee1809 v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::LookupIterator::Key const&, v8::internal::LookupIterator::Configuration) [node]
15: 0x106d9f9 v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::StoreOrigin, v8::Maybe<v8::internal::ShouldThrow>) [node]
16: 0x106eb07 v8::internal::Runtime_SetKeyedProperty(int, unsigned long*, v8::internal::Isolate*) [node]
17: 0x13fe259  [node]
timeout: the monitored command dumped core
Aborted

on an innocuous looking file, like

# Foo

| Name | GitHub | Twitter |
| ---- | ------ | ------- |

Your environment

  • OS: Ubuntu
  • Packages: Micromark 2.8.0, including if #21 is applied
  • Env: Node 14

Steps to reproduce

Run fuzzer from #18

Expected behavior

no crash

Actual behavior

<--- Last few GCs --->

[16841:0x4e8fc10]    11334 ms: Mark-sweep (reduce) 3664.6 (4118.7) -> 3664.6 (4118.7) MB, 162.9 / 0.0 ms  (average mu = 0.067, current mu = 0.000) last resort GC in old space requested
[16841:0x4e8fc10]    11494 ms: Mark-sweep (reduce) 3664.6 (4115.7) -> 3664.6 (4116.7) MB, 160.5 / 0.0 ms  (average mu = 0.033, current mu = 0.000) last resort GC in old space requested


<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0xa02dd0 node::Abort() [node]
 2: 0x94e471 node::FatalError(char const*, char const*) [node]
 3: 0xb7686e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb76be7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xd31485  [node]
 6: 0xd43cf1 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 7: 0xd09562 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
 8: 0xd033e4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
 9: 0xd0b719 v8::internal::Factory::NewInternalizedStringImpl(v8::internal::Handle<v8::internal::String>, int, unsigned int) [node]
10: 0xf3169f v8::internal::StringTable::AddKeyNoResize(v8::internal::Isolate*, v8::internal::StringTableKey*) [node]
11: 0xf3fa16 v8::internal::Handle<v8::internal::String> v8::internal::StringTable::LookupKey<v8::internal::InternalizedStringKey>(v8::internal::Isolate*, v8::internal::InternalizedStringKey*) [node]
12: 0xf3fac6 v8::internal::StringTable::LookupString(v8::internal::Isolate*, v8::internal::Handle<v8::internal::String>) [node]
13: 0xb7644b v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Name>, unsigned long, v8::internal::Handle<v8::internal::JSReceiver>, v8::internal::LookupIterator::Configuration) [node]
14: 0xee1809 v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::LookupIterator::Key const&, v8::internal::LookupIterator::Configuration) [node]
15: 0x106d9f9 v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::StoreOrigin, v8::Maybe<v8::internal::ShouldThrow>) [node]
16: 0x106eb07 v8::internal::Runtime_SetKeyedProperty(int, unsigned long*, v8::internal::Isolate*) [node]
17: 0x13fe259  [node]
timeout: the monitored command dumped core
Aborted

`assert` is not browser friendly

Initial checklist

Affected packages and versions

latest

Link to runnable example

No response

Steps to reproduce

browserify/node-util#62 (comment)

Expected behavior

Use console.assert instead.

Actual behavior

assert is a node built-in module.

Runtime

Node v14

Package manager

yarn v1

OS

macOS

Build and bundle tools

Vite

micromark crashes on invalid URI

Subject of the issue

Some malformed URL can crash micromark

Your environment

  • OS: Ubuntu 16
  • Packages: micromark 2.6.0
  • Env: Node 14

Steps to reproduce

var micromark = require('micromark')

console.log(micromark('[](<%>)'))

originally detected with #18, credit to @wooorm for a more minimal repro

Expected behavior

<p><a href="%25"></a></p>

Actual behavior

URIError: URI malformed
    at decodeURI (<anonymous>)
    at normalizeUri (micromark/dist/util/normalize-uri.js:1:1040)
    at url (micromark/dist/compile/html.js:1:54303)
    at Object.onexitmedia (micromark/dist/compile/html.js:1:61812)
    at done (micromark/dist/compile/html.js:1:50389)
    at compile (micromark/dist/compile/html.js:1:48534)
    at buffer (micromark/dist/index.js:1:2192)
    at Worker.fuzz [as fn] (micromark/fuzzer.js:1:1781)
    at process.<anonymous> (micromark/node_modules/jsfuzz/build/src/worker.js:63:30)

Incorrect handling of emphasis for Japanese language

Emphasis markup is parsed incorrectly for Japanese language

Your environment

Steps to reproduce

Input (Japanese) :

console.log(micromark("1.  **新規アプリの追加(NEW APP)**を選択します。"));

Output - incorrect:

<ol>
<li>**新規アプリの追加(NEW APP)**を選択します。</li>
</ol>

Expected behavior

Emphasis should be parsed correctly

<ol>
<li><strong>新規アプリの追加(NEW APP)</strong>を選択します。</li>
</ol>

Additional:

The same text in English and Chinese

Input (English):

console.log(micromark("1.  Select **NEW APP** (top-left corner)"));

Output - correct:

<ol>
<li>Select <strong>NEW APP</strong> (top-left corner)</li>
</ol>

Input (Chinese):

console.log(micromark("1.  选择**添加应用**(左上角)"));

Output - correct:

<ol>
<li>选择<strong>添加应用</strong>(左上角)</li>
</ol>

This bug appeared when I switched to [email protected] from [email protected].

The next code works correctly:

import unified from "unified";
import markdown from "remark-parse"; // 8.0.3
import rehype from "remark-rehype"; // 8.0.0
import stringify from "rehype-stringify"; // 8.0.0

unified()
    .use(markdown)
    .use(rehype)
    .use(stringify)
    .process("1.  __新規アプリの追加(NEW APP)__を選択します。", function(err, file) {
        console.log(String(file));
    });

Ordered lists starting with non-1 are not parsed when some content is present before them (micromark 3)

Initial checklist

Affected packages and versions

3.0.0 (via mdast-util-from-markdown 1.0.0)

Please let me know if this issue should be moved to mdast-util-from-markdown, but I think the bug is somewhere in micromark source :)

Link to runnable example

https://codesandbox.io/s/naughty-ptolemy-y62bf?file=/src/index.ts

Steps to reproduce

This is not parsed as list (paragraph before)

Content.

2. Hello
3. world

This is also not parsed as list (empty line before)


2. Hello
3. world

This is parsed as list (when trimmed, start number is correct)

2. Hello
3. world

This is also parsed as list (obviously)

Content 

1. Hello
2. world

Expected behavior

List starting with non-1 numbers are parsed correctly.

Github handles it:

  1. Hello
  2. world

micromark pre-3 also handled it correctly.

Actual behavior

List starting with non-1 numbers are not parsed correctly when some paragraph or even empty line is present before them (in container?) 🤷

Content.

2. Hello
3. world
{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "Content.",
                    "position": {
                        "start": {
                            "line": 2,
                            "column": 1,
                            "offset": 1
                        },
                        "end": {
                            "line": 2,
                            "column": 9,
                            "offset": 9
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 2,
                    "column": 1,
                    "offset": 1
                },
                "end": {
                    "line": 2,
                    "column": 9,
                    "offset": 9
                }
            }
        },
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "2. Hello\n3. world",
                    "position": {
                        "start": {
                            "line": 4,
                            "column": 1,
                            "offset": 11
                        },
                        "end": {
                            "line": 5,
                            "column": 9,
                            "offset": 28
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 4,
                    "column": 1,
                    "offset": 11
                },
                "end": {
                    "line": 5,
                    "column": 9,
                    "offset": 28
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 6,
            "column": 1,
            "offset": 29
        }
    }
}

Runtime etc.

I do not think it's build / runtime dependent - it's some construct issue - but it happens both in browser & node - windows & linux.

unravelLinkedTokens RangeError: Maximum call stack size exceeded

Subject of the issue

#18 fed with https://github.com/remarkjs/remark/blob/8108fe54e04640dda119aad366d70e6edf2602f1/test/fixtures/input/title-attributes.text can trigger a call stack exceeded issue in unravelLinkedTokens.
These files are pretty large 1mb and around 30k lines a piece, a more minimal example, at 105kb is also included.

It seems the be related to unterminated links, but more research is needed.

Your environment

  • OS: Ubuntu
  • Packages: micromark 2.6.1
  • Env: node 14

Steps to reproduce

var fs = require('fs')
var micromark = require('./index')

// var doc = fs.readFileSync('crash-395a731d55c510f1338b8c9911c159ab56329d18bc3a12a26b826b750d0b1253.txt')
// var doc = fs.readFileSync('crash-4bf6a4882505b11dea88b5e16e6f0d3766252601ae704e42ebe606d270f9f26f.txt')
var doc = fs.readFileSync('crash-7182fa3e89e1b8fb28bda27b6da6b3769f05b1ce68551d96c46acd0931d95004.txt')

var result = micromark(doc)

console.log(result)

crash-7182fa3e89e1b8fb28bda27b6da6b3769f05b1ce68551d96c46acd0931d95004.txt
crash-4bf6a4882505b11dea88b5e16e6f0d3766252601ae704e42ebe606d270f9f26f.txt
crash-395a731d55c510f1338b8c9911c159ab56329d18bc3a12a26b826b750d0b1253.txt

a more minimal example of what may be the same issue ([]( repeated 35k times in a 105kb file)

repeated-unterminated-links.txt

Expected behavior

If possible no error, alternatively a better error message could help.

Actual behavior

RangeError: Maximum call stack size exceeded
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:16585)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)

`TokenizeContext.sliceSerialize` for `Token.type` of `setextHeading` includes non-heading content from outside the range of [`startLine`, `endLine`]

Initial checklist

Affected packages and versions

4.0.0

Link to runnable example

No response

Steps to reproduce

user@HOST micromark-setext % npm ls micromark
micromark-setext@ /Users/user/Documents/micromark-setext
└── [email protected]

user@HOST micromark-setext % cat issue.mjs 
import { parse } from "micromark";
import { postprocess } from "micromark";
import { preprocess } from "micromark";

const markdown = `
Text

Setext
======

Text
`;

const encoding = undefined;
const end = true;
const options = undefined;
const chunks = preprocess()(markdown, encoding, end);
const parseContext = parse(options).document().write(chunks);
const events = postprocess(parseContext);
for (const event of events) {
  const [ kind, token, context ] = event;
  if (kind === "enter") {
    const { type, start, end } = token;
    const { "line": startLine } = start;
    const { "line": endLine } = end;
    console.dir(`${type} (${startLine}-${endLine}): ${context.sliceSerialize(token)}`);
  }
}
user@HOST micromark-setext % node issue.mjs  
'lineEndingBlank (1-2): \n'
'content (2-2): Text'
'paragraph (2-2): Text'
'data (2-2): Text'
'lineEnding (2-3): \n'
'lineEndingBlank (3-4): \n'
'setextHeading (4-5): Text\n\nSetext\n======'
'setextHeadingText (4-4): Setext'
'data (4-4): Setext'
'lineEnding (4-5): \n'
'setextHeadingLine (5-5): ======'
'setextHeadingLineSequence (5-5): ======'
'lineEnding (5-6): \n'
'lineEndingBlank (6-7): \n'
'content (7-7): Text'
'paragraph (7-7): Text'
'data (7-7): Text'
'lineEnding (7-8): \n'
user@HOST micromark-setext %

Expected behavior

Note specifically this part of the output: 'setextHeading (4-5): Text\n\nSetext\n======'

While the start and end lines are correct, the output of sliceSerialize includes "Text\n\n" from lines 2 and 3 which is not part of the heading (confirmed by the associated setextHeadingText token which contains only "Setext").

Actual behavior

See above.

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

No response

hard break at the end of a paragraph is not properly parsed

Initial checklist

Affected packages and versions

micormark

Link to runnable example

https://codesandbox.io/s/thirsty-fire-1jdcgn

Steps to reproduce

parse this:

# Trailing hard-break

This break is properly detected\
yes?

But a trailing break is not\



What's worse, it leaves a stray `\`

checking the github behaviour, it's the same :-( but of course this is rather unfortunate and is difficult to find a workaround.

Expected behavior

<h1>Trailing hard-break</h1>
<p>This break is properly detected<br />
yes?</p>
<p>But a trailing break is not<br /></p>
<p>What's worse, it leaves a stray <code>\</code></p>

Actual behavior

<h1>Trailing hard-break</h1>
<p>This break is properly detected<br />
yes?</p>
<p>But a trailing break is not\</p>
<p>What's worse, it leaves a stray <code>\</code></p>

Runtime

Node v14

Package manager

npm v7

OS

macOS

Build and bundle tools

No response

Attention misnests tokens

I have been wishing to write a (simple and lightweight) spec‐compliant editor for Markdown with syntax highlighting for a while now.

Now that this library has become usable (and it seems to be the first of its kind), I have finally gotten an opportunity to write a simple editor with it! (Thank you! 🎉)

Unfortunately, there appears to be a bug in the library! The issue I’m running into is that emphases marked with *** (both regular and strong) have their tokens misnested.

I have written a simple program to demonstrate what I mean:

simple reproduction example
import parser from "https://dev.jspm.io/[email protected]/lib/parse.js"
import preprocessor from "https://dev.jspm.io/[email protected]/lib/preprocess.js"
import postprocessor from "https://dev.jspm.io/[email protected]/lib/postprocess.js"

let preprocess = txt =>
{
	let write = preprocessor()
	return [...write(txt), ...write(null)]
}

let parse = text => postprocessor()(preprocess(text).flatMap(parser().document().write))

let tokens = parse("hello ***world***")
tokens.pop()

let output = ""

let i = 0
let offset
for (let [kind, {type, start, end}] of tokens)
{
	let char = "→"
	if (kind === "enter") offset = start.offset
	else offset = end.offset, i--, char = "←"
	output += `${" ".repeat(i*3) + char} ${type} at ${offset}\n`
	if (kind === "enter") i++
}

console.log(output)

(Note: I’m using dev.jspm.io for now, as opposed to jspm.dev, because jspm.dev bundles the whole library into its index file, as opposed to separating it into multiple files. See more info on jspm.dev’s announcement post)

Currently, the output is the following:

current output
→ content at 0
   → paragraph at 0
      → data at 0
      ← data at 5
      → data at 5
      ← data at 6
      → emphasis at 8
         → emphasisSequence at 8
         ← emphasisSequence at 9
         → emphasisText at 9
            → strong at 6
               → strongSequence at 6
               ← strongSequence at 8
               → strongText at 8
                  → data at 9
                  ← data at 14
               ← strongText at 15
               → strongSequence at 15
               ← strongSequence at 17
            ← strong at 17
         ← emphasisText at 14
         → emphasisSequence at 14
         ← emphasisSequence at 15
      ← emphasis at 15
   ← paragraph at 17
← content at 17

As you can see, when moving from → emphasisText at 9 to → strong at 6 (as well as in other places), the indices go down, which is unexpected. This causes my highlighter to break! 😱

Thanks in advance for the attention!

Reduce execution time by ~11% with a simple reimplementation of TokenizeContext.now

Initial checklist

Affected packages and versions

latest main branch

Link to runnable example

No response

Steps to reproduce

I ran a profile of micromark and noticed TokenizeContext.now was something like the 4th most time-consuming function. It’s quite a simple function as all it does is return a copy of point. I tried a couple of alternate implementations and found one that reduces runtime by ~11% on an Apple Silicon M1 (mac Mini). I imagine the delta is different on other hardware, so maybe other folks can give this a try on their hardware? That said, I expect this change will be more efficient everywhere because it avoids a call into Object.assign and tells the JIT exactly what needs to be done. All tests pass with this change; you can see the code change and minimal test harness here: main...DavidAnson:micromark:TokenizeContext-now.

I added a scenario to perf.js that reads the content of readme.md and calls micromark 500 times. This input seems fairly representative, but I’m happy if folks want to profile on something else. The numbers below are pretty stable, so I only took three samples before/after.

The 3 readings I did before changing anything: ((17.726 + 17.786 + 17.606) / 3) = 17.706s

The 3 readings I did after making the change: ((15.75 + 15.676 + 15.688) / 3) = 15.705s

By my math, the time eliminated is: ((17.706 - 15.705) / 17.706) = 0.1130 = 11.30%

To be sure, the alternate implementation I propose here violates the encapsulation of Point - but a simple test case could be added to ensure any future changes to Point are accomodated.

I can send a proper PR if folks are open to this change.

Expected behavior

N/A

Actual behavior

N/A

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

No response

Error - [webpack] 'dist': ./node_modules/micromark-util-decode-numeric-character-reference/index.js 23:11 Module parse failed: Identifier directly after number

Initial checklist

Affected packages and versions

using node 18.17.1

Link to runnable example

No response

Steps to reproduce

I am using botframework-webchat and when i try to build it, the below error message pop-ups.

Error - [webpack] 'dist':
./node_modules/micromark-util-decode-numeric-character-reference/index.js 23:11
Module parse failed: Identifier directly after number (23:11)
You may need an appropriate loader to handle this file type, currently no loaders are configured to process this file. See https://webpack.js.org/concepts#loaders
| code > 126 && code < 160 ||
| // Lone high surrogates and low surrogates.

code > 55_295 && code < 57_344 ||
| // Noncharacters.
| code > 64_975 && code < 65_008 || /* eslint-disable no-bitwise */
@ ./node_modules/mdast-util-from-markdown/lib/index.js 138:0-97 1061:14-45
@ ./node_modules/mdast-util-from-markdown/index.js
@ ./node_modules/botframework-webchat/lib/markdown/private/iterateLinkDefinitions.js
@ ./node_modules/botframework-webchat/lib/markdown/renderMarkdown.js
@ ./node_modules/botframework-webchat/lib/index.js
@ ./lib/extensions/chatbotExtension/renderer/Chatbot.js
@ ./lib/extensions/chatbotExtension/renderer/ChatbotPanel.js
@ ./lib/extensions/chatbotExtension/ChatbotExtensionApplicationCustomizer.js
./node_modules/micromark-util-sanitize-uri/index.js 86:22

Expected behavior

the package should build sucessfully

Actual behavior

currently its giving error while running npm build

Runtime

Node v16

Package manager

npm v7

OS

Windows

Build and bundle tools

Webpack

Performance improvement: linked lists for events

Subject of the feature

Given that on large markdown files we are dealing with tons (literally, 100k or so) of events, improving performance might be switching from arrays to linked event objects.

Problem

Operations on big arrays can be slow, such as #21.
Switching to linked lists adds complexity (while removing it in certain other cases!), but will probably/hopefully improve perf.

Alternatives

We’re already using really fast array methods. And everything is mutating already. Maybe linked lists won’t net a lot.

nested ordered lists not starting with 1. are not detected

Initial checklist

Affected packages and versions

micromark 3.2

Link to runnable example

https://codesandbox.io/s/trusting-worker-lt4scv

Steps to reproduce

nested ordered lists are not parsed correctly, if they don't start with a 1.

they work in gh:

  1. a
    2. foo
    3. bar
  2. b

Expected behavior

nested lists are parsed correctly on any level.

also see failing test: main...adobe-rnd:micromark:nested-lists-test

Actual behavior

if a nested ordered list doesn't start with 1. it is not parsed as list

Runtime

Node v16

Package manager

No response

OS

No response

Build and bundle tools

No response

`index.d.ts` is missing in `micromark-util-encode` published files

Initial checklist

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

The package.json published for micromark-util-encode v1.0.0 contains "types": "index.d.ts":

https://unpkg.com/browse/[email protected]/package.json

Yet the index.d.ts file is not in the files whitelist:

"files": [
"index.js"
],

Most likely that is the reason index.d.ts isn't published:

https://unpkg.com/browse/[email protected]/

I haven't checked if other micromark packages have a similar issue, this is just what I discovered in my particular project:
Screen Shot 2021-12-25 at 10 41 32 am

Expected behavior

Types declared in the package.json should be published.

Actual behavior

Types declared in the package.json are not published.

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

Other (please specify in steps to reproduce)

List items wrapped in <p> tags due to trailing space

Initial checklist

Affected packages and versions

3

Link to runnable example

No response

Steps to reproduce

In chrome's console run:

const mm = await import('https://esm.sh/micromark@3?bundle');
console.log(mm.micromark('List1\n* item1\n* item2\n\n\n\n'));
console.log('------');
console.log(mm.micromark('List1\n* item1\n* item2\n\n\n \n'));

Note the only difference between the two examples is a single space some blank lines away from the list. Those two examples return different html, the latter has the list elements wrapped in <p>

<p>List1</p>
<ul>
<li>item1</li>
<li>item2</li>
</ul>
------
<p>List1</p>
<ul>
<li>
<p>item1</p>
</li>
<li>
<p>item2</p>
</li>
</ul>

Expected behavior

I'm not clear enough on the markdown spec to say which case is actually correct. Certainly other markdown parsers I've tried (though that is not a long list) render it like the first example.

Regardless I'd expect it to be the same between the two. In most markdown editors the trailing space is impossible to see and it can take a long time to track down why some list elements render with increased padding.

Actual behavior

See repro steps. Two examples output visually different HTML whereas I feel they should render the same.

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

TokenizeContext.sliceSerialize throws in sliceChunks if first chunk of token is Code instead of string

Initial checklist

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

user@HOST micromark-issue % npm ls micromark
micromark-issue@ /Users/user/micromark-issue
└── [email protected]

user@HOST micromark-issue % cat issue.mjs 
import { parse } from "micromark/lib/parse";
import { postprocess } from "micromark/lib/postprocess";
import { preprocess } from "micromark/lib/preprocess";

function repro(markdown) {
  console.log("trying...");
  const encoding = undefined;
  const end = true;
  const options = undefined;
  const chunks = preprocess()(markdown, encoding, end);
  const parseContext = parse(options).document().write(chunks);
  const events = postprocess(parseContext);
  for (const event of events) {
    const [ \_, token, context ] = event;
    context.sliceSerialize(token);
  }
  console.log("ok");
}

repro("Heading\\n=======");
repro("\\nHeading\\n=======");
user@HOST micromark-issue % node issue.mjs 
trying...
ok
trying...
file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:520
      view[0] = view[0].slice(startBufferIndex)
                        ^

TypeError: view[0].slice is not a function
    at sliceChunks (file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:520:25)
    at sliceStream (file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:154:12)
    at Object.sliceSerialize (file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:149:28)
    at repro (file:///Users/user/micromark-issue/issue.mjs:15:13)
    at file:///Users/user/micromark-issue/issue.mjs:21:1
    at ModuleJob.run (node:internal/modules/esm/module\_job:198:25)
    at async Promise.all (index 0)
    at async ESMLoader.import (node:internal/modules/esm/loader:385:24)
    at async loadESM (node:internal/process/esm\_loader:88:5)
    at async handleMainPromise (node:internal/modules/run\_main:61:12)
user@HOST micromark-issue % 

Expected behavior

sliceSerialize should always be safe to call in a manner like the above and should return a meaningful string. The presence of a leading \n in Markdown (for example) should not need to be guarded against by library users.

Actual behavior

Exception, see above

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

Other (please specify in steps to reproduce)

Emphasis and strong when immediately followed by emphasis in the same word causes extra asterisks to appear

Issue from react-markdown: remarkjs/react-markdown#812

But potentially the root of the issue could live in the md parser. Below I have linked the repro links and comments from the other issue:

When processing the MD string

***123****456*

<em>
  <strong>123</strong>
</em>
<em>456</em>

React markdown renders what seems to be some additional asterisks?

Screenshot 2024-01-31 at 1 15 46 PM

Initial checklist

Affected packages and versions

react-markdown

Link to runnable example

No response

Steps to reproduce

Compare the result of:

This is just 1 word, where the first half is both italicized and bolded, the 2nd half is only italicized.

The MDAST that gets created from unified() => rehypeParse => rehypeRemark looks correct, so to me the issue seems to be either:

  1. The syntax generated from the processing flow is incorrect.
  2. The syntax is correct, and its React-Markdown's rendering of the syntax that is not correct.

Expected behavior

Screenshot 2024-01-31 at 1 15 38 PM

Actual behavior

Screenshot 2024-01-31 at 1 15 46 PM

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

crash on reference like structure before directive with parenthesis

Subject of the issue

const micromark = require("micromark/lib");
const directive = require("micromark-extension-directive");

micromark( "[!]:)", "utf-8", { extensions: [directive()] });

throws

AssertionError [ERR_ASSERTION]: expected non-empty token (`chunkString`)

Your environment

Steps to reproduce

run:

const micromark = require("micromark/lib");
const directive = require("micromark-extension-directive");

micromark( "[!]:)", "utf-8", { extensions: [directive()] });

Expected behavior

No error, or a more specific markdown syntax related error

Actual behavior

AssertionError [ERR_ASSERTION]: expected non-empty token (`chunkString`)

Custom extensions break in development mode, despite working in production

Initial checklist

Affected packages and versions

3.1.0

Link to runnable example

https://github.com/chudoklates/micromark-error-demo

Steps to reproduce

Use repo provided above.

Generally, for this error to occur, the parser needs to be run through Webpack in development mode. There also needs to be an extension which calls effects.consume() in its syntax before effects.enter() is called

Expected behavior

Actions which are permissible in the production distribution should also be permissible in development mode.

Actual behavior

A TypeError is thrown when the code reaches this assertion:

// at the point of error: code: 123, context.events: []
assert(
      code === null
        ? context.events.length === 0 ||
            context.events[context.events.length - 1][0] === 'exit'
        : context.events[context.events.length - 1][0] === 'enter',
      'expected last token to be open'
    )
Uncaught TypeError: Cannot read properties of undefined (reading '0')
    at Object.consume (create-tokenizer.js:246:52)
    at onStart (extensions.js:45:13)
    at start (create-tokenizer.js:460:12)
    at start (create-tokenizer.js:401:46)
    at start (text.js:49:30)
    at go (create-tokenizer.js:229:13)
    at main (create-tokenizer.js:209:11)
    at Object.write (create-tokenizer.js:135:5)
    at subcontent (index.js:198:17)
    at subtokenize (index.js:90:30)

Runtime

Node v16

Package manager

yarn v1

OS

macOS

Build and bundle tools

Webpack

Configure collapsing newlines into a single paragraph

Initial checklist

Problem

I want to have several paragraphs like this:

I am a paragraph.
I am part of the same paragraph.

But I am a new paragraph.

This is compiled to the following:

I am a paragraph. I am part of the same paragraph. But I am a new paragraph.

Solution

I'd expect the following result:

I am a paragraph. I am part of the same paragraph.

But I am a new paragraph.

Alternatives

I could use the <p> tag manually in the Markdown.

`micromark-util-symbol` can not be imported by typescript

Initial checklist

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

try to import micromark-util-symbol in typescript

Expected behavior

no type error reporting

Actual behavior

get error: Cannot find module 'micromark-util-symbol' or its corresponding type declarations.ts(2307)

Possible Solution

add a field in package.json:

  "types": "./lib/default.d.ts",

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

Split code into several packages, use export maps and conditions

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and couldn’t find anything (or linked relevant results below)

Subject

Split code in several packages, use export maps and condition

Problem

Given that:

  1. We have instrumented development code (with assertions and more verbose code such as using codes.greaterThan instead of the actual character code) and optimized production code, that is currently split into dist/ or lib/ respectively
  2. Many of the internals (such as constants like codes, values, constants, but also the utilities on detecting characters, the factories in tokenize/factory-, or the tiny things in lib/util/) are useful in micromark extensions (or inverted: many of the extensions currently use the micromark’s internals)

Solution

I propose:

  1. Making micromark/micromark a monorepo that houses a couple of projects
  2. Using micromark-factory-* as a namespace in the ecosystem for factories: some housed in the monorepo (how to parse a label), some in their own repos in this org (how micromark-extension-directive parses HTML attributes or micromark-extension-expression parses JavaScript), yet some others in the ecosystem
  3. As we already have ESM, combine several files into one exported file, that uses named export to expose their functions. For example:
    • micromark-core-character would expose all the ascii*, unicode*, and markdown* functions currently in micromark/lib/character
    • micromark-core-constant would expose codes, constants, values, types, html-block-names, html-raw-names
  4. Create a small rollup config file or wrapper that takes a prod/ folder which houses a micromark extension/factory/core, and builds a dev/ folder from it, copying types, inlining constants, and removing assertions
  5. Use export maps with conditions (see endorsed ones) set to either development / production / default (same as prod I guess)

Alternatives

?

effect.check() modify events when construct is for document and has resolver

Initial checklist

Affected packages and versions

micromark

Link to runnable example

https://github.com/wataru-chocola/report-micromark-20210827

Steps to reproduce

Run my PoC.

$ git clone https://github.com/wataru-chocola/report-micromark-20210827
$ cd report-micromark-20210827
$ npm install
$ npx node index.js

Expected behavior

document constructs are invoked twice in micromark/lib/initialize/document.js :

  1. from checkNewContainers state

    return effects.check(
      containerConstruct,
      thereIsANewContainer,
      thereIsNoNewContainer
    )(code)
  2. from documentContinued

    return effects.attempt(
      containerConstruct,
      containerContinue,
      flowStart
    )(code)

And I expect the first invocation effect.check(...) doesn't make any modifications on events.

 * @property {Attempt} check
 *   Attempt, then revert.

Actual behavior

effect.check() does modify events if construct is for document and has resolver.

My construct in PoC code dumps context.events at the start.
On 1st run (from effects.check), we see the correct events which are generated by previous tokenization.

+ initialize tokenizer (runCount: 1)
+ previous events
[ 'enter', 'chunkFlow', 'term\n' ]
[ 'exit', 'chunkFlow', 'term\n' ]
+ run resolverTo

But on 2nd run (from effects.attempt), events are modified by resolver in the previous check execution.

+ initialize tokenizer (runCount: 2)
+ previous events
[ 'enter', 'defListTerm', 'term\n' ]
[ 'enter', 'chunkFlow', 'term\n' ]

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

Trade-off between extensibility and performance

Say we take:

  1. Indented code: it ends when there is a bogus line. But there could be infinity blank lines before that bogus line.
  2. HTML blocks of kind 6 or 7: one blank line ends it.

Do we backtrack to before the blank lines, and check all the tokenisers again (blank line is last probably), or is there a knowledge of what other tokenisers are enabled and can we “eat” every blank line directly?

The trade-off here is that either, with knowledge of other tokens, we can be more performant and scan the buffer fewer times, or that we are more extensible, allowing blank lines to be turned off, or alternative tokenisers from extensions dealing with them?

micromark handles links with custom protocol different from commonmark

Initial checklist

Affected packages and versions

3.0.5

Link to runnable example

https://stackblitz.com/edit/node-qr2fly?file=index.js

Steps to reproduce

import { micromark } from "micromark";
import { Parser, HtmlRenderer } from "commonmark";

const reader = new Parser();
const writer = new HtmlRenderer();

const commonmark = (buf) => writer.render(reader.parse(buf));

const content = `<test:what>`;

console.log(micromark(content));
console.log(commonmark(content));

Expected behavior

micromark and commonmark should produce the same HTML output

<p><a href="test:what">test:what</a></p>

Actual behavior

micromark produces different HTML

<p><a href="">test:what</a></p>

Runtime

Node v16

Package manager

npm v7

OS

Linux

Build and bundle tools

No response

How far to buffer?

Markdown consists of blocks and inlines. Blocks are parsed per line.

Typically, at a certain point in a line, you know you’re right: take this ATX heading:

###### A heading

When standing on the space, you know you’re in a heading: it can’t be anything else. So ATX headings don’t really need to buffer a lot: at most 6 characters.

Other values, need more, like this link definition:

[take]:
https://this-link-definition
'asd
> block quote?
asd
asd
asd
asd

Only at the last character, the line feed without a closing title marker before it, do you know you need to backtrack, and parse the whole thing again. And it isn’t all a paragraph either, take for example the embedded > block quote?

An alternative example that needs to buffer infinity lines is indented code:

␠␠␠␠this is a chunk (a properly indented non-blank line)
␠␠␠
␠␠
␠
␠␠
␠␠␠
␠␠␠␠
␠␠␠␠␠
␠␠␠␠
␠␠␠
␠␠
␠
␠␠␠
<-- And only here do we know the blank lines are not part of the indented code. Note that the line endings, and more that four spaces in a blank line, still show up in the code, so if we had another chunk, all the above line endings and that one extra space would be there.

🤔 So how far does one buffer? These are edge cases, not common in normal Markdown. But it could be interesting to see if we can cap this to reduce a potential memory problem.

Strings ending with `\n-` are compiled into a level 2 heading

Initial checklist

Affected packages and versions

4.0.0

Link to runnable example

https://codesandbox.io/s/trusting-star-wv879z?file=/src/index.mjs

Steps to reproduce

I've created a minimal reproduction of the issue here:

Screenshot 2023-10-03 at 2 52 12 pm

Expected behavior

I'd expect the string to be compiled into a paragraph with the hyphen to be at the start of the 2nd line.

Actual behavior

The string is compiled into a level 2 heading

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

CMSM

micromark is developed jointly with CMSM: Common Markup State Machine, as it’s sometimes easier to make changes in prose.

If you’re interested in micromark, also definitely check out CMSM!

ES5 Compatibility

Initial checklist

Problem

i moved from react-markdown to micromark for the reasons below:

  1. react-markdown has a dependency that is es6 which breaks my app when it is run on IE11
  2. react-markdown used to have a bug which micromark has no such issue
  3. micromark is the 'smallest'

however, since version 1.x, the building target is es2020, there is a lot of consts, lets, method shortcuts and more which will definitely break my app...

Solution

esm is a trend, and webpack can deal with it by default, however the es6 syntax... i may have to config my babel, and it might cost more time compliing.

so might it be possible to set build target to es5 which is for now the most compatible output? it does not hurt esm module.

part of my tsconfig.json

"target": "es5",
"lib": [
  "dom",
  "es2015",
  "es2017"
],

Alternatives

no..

How to handle “virtual spacing” in the CST?

I’m going to post a couple of problems I foresee as I’m trying to wrap my head around what micromark will be.

Take the following example:

>␉␠indented.code("in a block quote")

It’s a block quote marker, followed by a tab (tabs are forced to be treated as four spaces).
The first “virtual space” of the tab is part of the block quote marker. The second three “virtual spaces” are part of the indent of the indented code.
One extra real space, and you’ve got a code indent of four spaces, making it a proper indented code, in a block quote.

How is that represented as tokens? In a CST?

images without alt should not generate alt attribute with empty string

Subject of the issue

see

'<p><img src="example.png" alt="" /></p>',

I don't know if this is really a bad thing, but the new behaviour of micromark is to generate an empty string for the alt attribute, where as the old remark-parser used to set the alt property of the mdast node to null.

there is a slight distinction, such as an image with an empty alt text is considered a decorative image and should be ignored by a screen reader. if the alt attribute is missing, it will just read the src (not a brilliant behaviour, either :-)

https://www.w3.org/WAI/tutorials/images/decorative/

In any case, the new behaviour allows the author to specify decorative images in markdown by default, which wasn't possible before.

Expected behavior

not sure. But for backward compatibility's sake: A markdown image w/o an alt text should not create an alt attribute in HTML (mdast node's property should be null)

micromark preserves control characters where commonmark does not

Initial checklist

Affected packages and versions

3.0.5

Link to runnable example

https://stackblitz.com/edit/node-aaphim?file=index.js

Steps to reproduce

import { micromark } from "micromark";
import { Parser, HtmlRenderer } from "commonmark";
import rehypeParse from "rehype-parse";
import { unified } from "unified";
import { visit } from "unist-util-visit";
import lodash from "lodash";

const reader = new Parser();
const writer = new HtmlRenderer();
function scrubber(tree) {
  visit(tree, function (node) {
    node.data = undefined;
    node.value = undefined;
    node.position = undefined;
  });

  return tree;
}

const commonmark = (buf) => writer.render(reader.parse(buf));

const content = ``;

const micromarkHtml = micromark(content, {
  allowDangerousHtml: true,
  allowDangerousProtocol: true,
}).trim();
const commonmarkHtml = commonmark(content).trim();

const micromarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(micromarkHtml)
);
const commonmarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(commonmarkHtml)
);

console.log("micromark");
console.log(micromarkHtml);
console.log("");
console.log(JSON.stringify(micromarkHtmlAst, null, 4));
console.log("");
console.log("commonmark");
console.log(commonmark(content));
console.log("");
console.log(JSON.stringify(commonmarkHtmlAst, null, 2));
console.log(lodash.isEqual(micromarkHtmlAst, commonmarkHtmlAst));

📓 the character in content is U+000C

Expected behavior

<p></p>

with the structure

{
  "type": "root",
  "children": [
    {
      "type": "element",
      "tagName": "p",
      "properties": {},
      "children": []
    }
  ]
}

Actual behavior

micromark keeps the space

<p>
</p>

changing the structure of the document

{
    "type": "root",
    "children": [
        {
            "type": "element",
            "tagName": "p",
            "properties": {},
            "children": [
                {
                    "type": "text"
                }
            ]
        }
    ]
}

Runtime

Node v16

Package manager

npm v7

OS

Linux

Build and bundle tools

No response

Make `definitions` available to extensions

Initial checklist

Problem

i'm writing an extension where I would need the definitions defined in the document.

Solution

the definitions should be available either via the context this.definitions or via getData('mediaDefinitions').
the latter is probably better.

Alternatives

I could probably overwrite all the definition related enter and exit methods, and track them as well, but this sounds like a wrong approach.

regression: link references w/o definition are ignored

Subject of the issue

With the old remark parser, link references that didn't have a corresponding definition were nonetheless detected and converted to mdast.

for example, the following:

> [!NOTE]
> This is a note. Who'd have noted?

used to generate:

{
  "type": "root",
  "children": [
    {
      "type": "blockquote",
      "children": [
        {
          "type": "paragraph",
          "children": [
            {
              "type": "linkReference",
              "identifier": "!note",
              "label": "!NOTE",
              "referenceType": "shortcut",
              "children": [
                {
                  "type": "text",
                  "value": "!NOTE"
                }
              ]
            },
            {
              "type": "text",
              "value": "\nThis is a note. Who'd have noted?"
            }
          ]
        }
      ]
    }
  ]
}

With micromark, the linkReference is not inserted, if there is no corresponding definition and a plain paragraph is genererated:

{
  "type": "root",
  "children": [
    {
      "type": "blockquote",
      "children": [
        {
          "type": "paragraph",
          "children": [
            {
              "type": "text",
              "value": "[!NOTE]\nThis is a note. Who’d have noted?"
            }
          ]
        }
      ],
    }
  ],
}

Expected behavior

It should still generate a linkReference node in the mdast, so that the client of the mdast can decide how to handle a missing definition.

Actual behavior

the parser ignores the link reference if no definition is defined.

uvu shouldn't be set in dependencies

Initial checklist

Affected packages and versions

micromark-core-commonmark@npm:1.0.4, micromark-extension-gfm-autolink-literal@npm:1.0.2, micromark-extension-gfm-footnote@npm:1.0.2, micromark-extension-gfm-strikethrough@npm:1.0.3, micromark-extension-gfm-table@npm:1.0.4, micromark-extension-gfm-task-list-item@npm:1.0.2

Link to runnable example

No response

Steps to reproduce

https://unpkg.com/[email protected]/package.json and you can see that uvu is in the dependencies and not devDeps

Expected behavior

uvu should be set in the dev deps so that, when installing any of the packages define here (like micromark-core-commonmark), uvu wouldn't be installed to (it's a test runner, not used in the runtime code)

Actual behavior

uvu is listed in the deps so it get's installed

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

Reduce coupling by using anylogger

Subject of the feature

Reduce coupling and footprint of minified file by using anylogger i.s.o debug

Problem

Currently, this library has a dependency on debug. Though that is an excellent library, this dependency has 2 major drawbacks:

  • This library is now forcing debug onto all developers that use this library (high coupling)
  • debug is 3.1kB minified and gzipped, directly adding 3.1kB to the minimum footprint of this library

Alternatives

Please have a look at anylogger. It's a logging facade specifically designed for libraries. It achieves these goals:

  • Decouple the library from the underlying logging framework
  • Reduce the minimal bundle footprint. Anylogger is only 370 bytes.

The decoupling is achieved by only including the minimal facade to allow client code to do logging and using adapters to back that facade with an actual logging framework. The minimal footprint follows naturally from this decoupling as the bulk of the code lives in the adapter.

There are already adapters for some popular logging frameworks and more adapters can easily be created:

If this library were to switch to anylogger, you could still install debug as a dev-dependency and then require('anylogger-debug') in your tests to have your tests work exactly as they always did, with debug as the logging framework, while still decoupling it from debug for all clients.

Disclaimer: anylogger was written by me so I'm self-advertising here. However I do honestly believe it is the best solution in this situation and anylogger was written specifically to decrease coupling between libraries and logging frameworks because for any large application, devs typically end up with multiple loggers in their application because some libraries depend on debug, others on loglevel, yet others on log4js and so on. This hurts bundle size badly as we add multiple kB of logging libraries to it.

Implementation of autolink and literalAutolink (micromark-extension-gfm-autolink-literal) are inconsistent when handling "@."

Initial checklist

Affected packages and versions

micromark 4.0.0, micromark-extension-gfm-autolink-literal 2.0.0

Link to runnable example

No response

Steps to reproduce

Expected behavior

Consistent treatment of [email protected] by autolink and literalAutolink.

Actual behavior

[email protected] and <[email protected]> are both emitted as literalAutolink. Expected behavior is observed for <[email protected]> which is emitted as autolink.

This is significant for a linter which can be confused by the current behavior into adding infinite <> wrappers attempting to turn [email protected] from literalAutolink into autolink: DavidAnson/markdownlint#1140

I propose that <[email protected]> should be treated as autolink, which is seemingly possible if emailAtSignOrDot behaved differently:

function emailAtSignOrDot(code) {
return asciiAlphanumeric(code) ? emailLabel(code) : nok(code)
}

The micromark tokens (when using micromark-extension-gfm-autolink-literal) for parsing the above Markdown are:

content [email protected]
  paragraph [email protected]
    literalAutolink [email protected]
      literalAutolinkEmail [email protected]
lineEnding \n
lineEndingBlank \n
content <[email protected]>
  paragraph <[email protected]>
    data <
    literalAutolink [email protected]
      literalAutolinkEmail [email protected]
    data >
lineEnding \n
lineEndingBlank \n
content <[email protected]>
  paragraph <[email protected]>
    autolink <[email protected]>
      autolinkMarker <
      autolinkEmail [email protected]
      autolinkMarker >
lineEnding \n

Runtime

Node v16

Package manager

npm v6

OS

macOS

Build and bundle tools

Webpack

Improved Concrete Syntax Trees

Subject of the feature

I'm in the process of migrating my markdown editor to use remark/micromark instead of markdown-it. One of my goals is not to change the formatting style of my users' input files, at least if I can help it.

At the moment, the micromark tokenizers seem not to record information that might help reconstruct the original input Markdown in cases where Markdown has redundancies:

  • _ vs * for emphasis
  • ATX headings vs setext headings
  • * vs - vs + for unordered lists
  • * vs - vs = for hrule / thematic breaks, as well as the length of the string used to indicate the break

I'm not terribly interested in preserving superfluous whitespace the user might have, but it would be nice to at least preserve their preferences for emphasis / heading / list syntax. For instance, I personally like to use * for regular lists and +/- for pro/con lists, and at the moment there's no way to preserve that information.

  • Are there any plans for improved concrete syntax tree support in the future?
  • If not would PRs be welcome to record syntactic information on syntax tree nodes, as well as in the corresponding serializers?
  • Otherwise, any advice for implementing this feature as a set of plugins would be appreciated!

Thanks!

Problem

Expected behavior

Alternatives

Improving performance by reducing useless parsing

Subject of the feature

There are two main places where parsing is done that is (potentially) useless.

  • content: At the end of a line in content, we parse ahead, to figure out if the paragraph should be closed
    This is double, because when the paragraph is closed, we will actually do the parsing.
    This was done in remark too. And similar to there, it is a bit optimized.
    This point in parsing markdown is rather complex, because of interplay with definitions, setext headings, paragraphs, but also lazy lines.
    Removing lookaheadConstruct improves performance by 13%. The alternative should be possible and hopefully is not too big.
  • document: to figure out whether containers continue, close flow, start new flow, or have lazy lines, another throwaway inspection is done.
    Removing document completely improves performance by 28% (although lists are complex so it some time spent there is unavoidable)
    (SOLVED IN 939e90d)

3.0.8 seems to introduce a module level dependency on document

Initial checklist

Affected packages and versions

3.0.8

Link to runnable example

No response

Steps to reproduce

I'm using Micromark in an astro project, and ever since installing micromark 3.0.8 I get this error:

[15:04:05] [snowpack] + [email protected]
[build] Unable to render src/pages/renew/checkout.astro

ReferenceError: document is not defined

While this might be snowpack being finnicky, [email protected] works totally fine! So just wondering if anything has been introduced which could cause it.

Expected behavior

My project builds:

Actual behavior

[15:04:05] [snowpack] + [email protected]
[build] Unable to render src/pages/renew/checkout.astro

ReferenceError: document is not defined

Runtime

Node v16

Package manager

yarn v1

OS

macOS

Build and bundle tools

Snowpack

What would the usage, api surface and extension points look like?

To get more clarity on where this fits in in the @unifiedjs ecosystem, could the assigned folks add some example usages of this library in this issue please?

Examples of how this would be used by @remarkjs and/or @unifiedjs would be helpful as they would clear up the following questions:

  • what would the usage of this library look like?
  • who are the potential consumers of it?
  • would this cause a rewrite or changes in how remark-parse is written?
  • what should the api surface look like?
  • what are potential extension points?
  • does this impact processor.use from the @unifiedjs world?
  • will this stream tokens or eat a file and spit out all the tokens at once? (assuming this is a lexer)

..and any other you folks can come up with.

The idea behind this is to discuss and land on a common understanding of this project's technical goals (e.g., is this a lexer? a parser? I've seen both words around here leading to some confusion), nail the api surface and identify potential extension points. This should help speed up dev, lead to some early "documentation" and prevent misalignment on the goals.

Thanks!

Lack of document and types for turning off constructs

Subject of the issue

micromark doesn't accept {disable: {null: []}} as an extension when using TypeScript.

Your environment

Steps to reproduce

please check https://github.com/issueset/micromark-disable-typescript-issue

Expected behavior

No typescript error. And it's better to have some document in the README.md for this feature.

Actual behavior

 Type '{ disable: { null: string[]; }; }' is not assignable to type 'SyntaxExtension[]'.

Including license in NPM packages

Initial checklist

Problem

While scanning my dependencies I found that micromark NPM packages don't include their actual license file. I believe it would make sense for the micromark NPM packages to include the license since the MIT license requires that it be included in all copies or substantial portions of the Software.

Solution

Since there's 22 NPM packages in the repo and they would presumably all use the same license from the root repo directory, I propose adding a release script that copies the license file from the root repo directory into each of the package directories, like this from vue router. I think it would then make sense to allow git to ignore license files in the package directories (but still allow NPM to include them).

Alternatives

It could also be solved by copy-pasting the license into each of the package directories. I think that may not be preferrable due to causing duplicative content in the repo.

& in image url will be encode to html entity

Initial checklist

Affected packages and versions

micromark 3.1.0

Link to runnable example

No response

Steps to reproduce

const content = '![](/imgs/i1.png?_a=center&_w=300)'

const html = micromark(content, {
    extensions: [gfm()],
    htmlExtensions: [gfmHtml()],
});


console.log(html)

Expected behavior

<p><img src="/imgs/i1.png?_a=center&_w=300" alt="" /></p>

Actual behavior

<p><img src="/imgs/i1.png?_a=center&amp;_w=300" alt="" /></p>

Runtime

Node v16

Package manager

pnpm

OS

macOS

Build and bundle tools

Other (please specify in steps to reproduce)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.