Say we take: Indented code: it ends when there is a bogus line

This sounds related to <a class="issue-link js-issue-link" data-error-text="Failed to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Trade-off between extensibility and performance about micromark HOT 6 CLOSED

micromark commented on May 14, 2024

Trade-off between extensibility and performance

from micromark.

Comments (6)

johno commented on May 14, 2024 2

Micromark is intended as the base for remark and by extension MDX correct? Editor projects like https://github.com/blocks/blocks will be continuously changing the document.

Blocks, in particular, has an intermediary schema (via Slate) for editing, so the parsing really only occurs when deserializing/serializing the document. Serialization would really only occur once and deserialization whenever one wants to output the MD which can be trivially debounced.

You often want to be able to mix several languages/grammars in a single document (think HTML with JavaScript and CSS embedded in it).

Markdown doesn’t depend on valid HTML, rather, a simplified black box.

Right, not on HTML, but MDX does depend on valid JSX.

I'm not sure I 100% understand what you mean by a "simplified black box" @wooorm. As in the HTML string isn't parsed itself and is instead put into an HTML node with its raw contents?

As @ChristianMurphy states, MDX depends on valid JSX, which essentially will hard fail when not properly written (because babel or JS evaluation will go 💥). It doesn't have the built in "recovery" that browsers have for HTML documents. For MDX, this will require knowledge of the JSX language/grammar, which would be pretty complex to handle.

Would the approach be that MDX extends/replaces micromark's HTML tokenizer/parser/compiler and replaces it with its own (likely using Babel)?

Personally I'd lean towards being more robust/correct than being the fastest thing ever for the first release. Once it's "correct" we can profile bottlenecks and optimize. Not to mention some of these considerations are pretty edge-casey. Also, a lot of end users of unified (like Gatsby) can, and do, implement their own layers of caching.

from micromark.

ChristianMurphy commented on May 14, 2024 1

This sounds related to #8

Do we backtrack to before the blank lines, and check all the tokenisers again

If possible let's avoid backtracking entirely. http://marijnhaverbeke.nl/blog/lezer.html

is there a knowledge of what other tokenisers are enabled and can we “eat” every blank line directly

This may be more complex, but will likely be a better approach for performance. 👍

from micromark.

ChristianMurphy commented on May 14, 2024 1

Am I correct in assuming that with lezer / tree-parser you don‘t backtrack, but instead parse multiple syntax trees?

Sort of, GLR parsers fork whenever an ambiguity is encountered.

For example, parsing

## Hello World ##

When it is parsed

## Hello World ##
--*--------------*--
  |              |
  *--------------*
  ^              ^
  |              ambiguity resolved, when newline is reached
  | 
  ambiguity start, there may or may not be an ATX closing to header,
  parser forks and tries parsing both ways

Or how would it, more practically, work?

GLR parsers work, by implementing the GLR algorithm.
These papers detail the algorithm:

Original Paper https://www.ijcai.org/Proceedings/85-2/Papers/014.pdf
Royal Holloway, University of London follow up paper https://www.cs.rhul.ac.uk/research/languages/publications/tomita_style_1.ps

The document is constantly changing.

typically not true for micromark

The input is often not in a finished, syntactically correct form. But you still have to make some sense of it—nobody wants an editor where most features stop working when you have a syntax error in your document.

typically not true for micromark

Micromark is intended as the base for remark and by extension MDX correct?
Editor projects like https://github.com/blocks/blocks will be continuously changing the document.

You can't do anything expensive. If the parsing works takes too long, it'll introduce latency that makes editing feel slugglish and unresponsive.
micromark needs to be complete, perfect, can take a while to get there (but preferably fast/small)

In what sense "complete" and "perfect"?

The way I look at it, Micromark is a lexer/tokenizer.
Looking at it from that sense, it is by definition incomplete, as an intermediate representation for remark to build on.

You often want to be able to mix several languages/grammars in a single document (think HTML with JavaScript and CSS embedded in it).

Markdown doesn’t depend on valid HTML, rather, a simplified black box.

Right, not on HTML, but MDX does depend on valid JSX.

from micromark.

wooorm commented on May 14, 2024

Am I correct in assuming that with lezer / tree-parser you don‘t backtrack, but instead parse multiple syntax trees? Or how would it, more practically, work?

I think those two are very interesting, but they have very different roles/goals:

The document is constantly changing.

typically not true for micromark
You can't do anything expensive. If the parsing works takes too long, it'll introduce latency that makes editing feel slugglish and unresponsive.

micromark needs to be complete, perfect, can take a while to get there (but preferably fast/small)
The input is often not in a finished, syntactically correct form. But you still have to make some
sense of it—nobody wants an editor where most features stop working when you have a syntax
error in your document.

typically not true for micromark
You often want to be able to mix several languages/grammars in a single document (think
HTML with JavaScript and CSS embedded in it).

Markdown doesn’t depend on valid HTML, rather, a simplified black box.

from micromark.

wooorm commented on May 14, 2024

@ChristianMurphy Thanks for explaining, even though I write a lot of parsing things I really don’t know these basics!

In what sense "complete" and "perfect"?

[What] way I look at it, Micromark is a lexer/tokenizer. Looking at it from that sense, it is by definition incomplete, as an intermediate representation for remark to build on.
— @ChristianMurphy

That’s true. One example I can think of that CM requires entities to be valid: &foo; is literal, & is &. A simpler tokeniser could treat any alphanumerical value as an entity to be simpler, whereas micromark should probably make the distinction between valid and invalid named character references. (Although, maybe we can have a probablyNamedCharacterReference token and defer this to remark/etc 🤷‍♂️)

As in the HTML string isn't parsed itself and is instead put into an HTML node with its raw contents?
— @johno

Typically, languages have other languages inside them. The two examples here are HTML in Markdown and JSX in MDX. The difference is that Markdown doesn’t parse HTML, it parses some XML-like structures. Whereas MDX indeed seems to parse (in the future?) only valid JSX.

This may be a bit of a problem: HTML/MD don’t have “invalid” content, that throws a parse error and crashes. JSX/MDX do have that.

Would the approach be that MDX extends/replaces micromark's HTML tokenizer/parser/compiler and replaces it with its own (likely using Babel)?
— @johno

Correct! I think there’s two crucial examples of extensions for micromark: 1: GFM, 2: MDX. The first is probably easier and also very ubiquitous and can function as a proof of concept, the second is necessary but can take a bit longer as its probably a bit hard.
The gist of it is that we need to figure out what the “hooks” are for extensions to plug into. Can they add/remove new inlines/blocks? Can they inject into the states of a context?

from micromark.

wooorm commented on May 14, 2024

CMSM does not define backtracking. This therefore removes the possibility of extensions in favour of performance.

I see two possibilities of extensions: a) define useful extensions in CMSM enable them with flags, or b) allow some form of hooks.

I’d like to table this for now: first priority is to get micromark working (keeping extensions in mind), actually supporting extensions comes after that.

from micromark.

Trade-off between extensibility and performance about micromark HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent