Giter VIP home page Giter VIP logo

Comments (6)

johno avatar johno commented on May 14, 2024 2

Micromark is intended as the base for remark and by extension MDX correct? Editor projects like https://github.com/blocks/blocks will be continuously changing the document.

Blocks, in particular, has an intermediary schema (via Slate) for editing, so the parsing really only occurs when deserializing/serializing the document. Serialization would really only occur once and deserialization whenever one wants to output the MD which can be trivially debounced.

You often want to be able to mix several languages/grammars in a single document (think HTML with JavaScript and CSS embedded in it).

Markdown doesn’t depend on valid HTML, rather, a simplified black box.

Right, not on HTML, but MDX does depend on valid JSX.

I'm not sure I 100% understand what you mean by a "simplified black box" @wooorm. As in the HTML string isn't parsed itself and is instead put into an HTML node with its raw contents?

As @ChristianMurphy states, MDX depends on valid JSX, which essentially will hard fail when not properly written (because babel or JS evaluation will go 💥). It doesn't have the built in "recovery" that browsers have for HTML documents. For MDX, this will require knowledge of the JSX language/grammar, which would be pretty complex to handle.

Would the approach be that MDX extends/replaces micromark's HTML tokenizer/parser/compiler and replaces it with its own (likely using Babel)?


Personally I'd lean towards being more robust/correct than being the fastest thing ever for the first release. Once it's "correct" we can profile bottlenecks and optimize. Not to mention some of these considerations are pretty edge-casey. Also, a lot of end users of unified (like Gatsby) can, and do, implement their own layers of caching.

from micromark.

ChristianMurphy avatar ChristianMurphy commented on May 14, 2024 1

This sounds related to #8

Do we backtrack to before the blank lines, and check all the tokenisers again

If possible let's avoid backtracking entirely. http://marijnhaverbeke.nl/blog/lezer.html

is there a knowledge of what other tokenisers are enabled and can we “eat” every blank line directly

This may be more complex, but will likely be a better approach for performance. 👍

from micromark.

ChristianMurphy avatar ChristianMurphy commented on May 14, 2024 1

Am I correct in assuming that with lezer / tree-parser you don‘t backtrack, but instead parse multiple syntax trees?

Sort of, GLR parsers fork whenever an ambiguity is encountered.

For example, parsing

## Hello World ##

When it is parsed

## Hello World ##
--*--------------*--
  |              |
  *--------------*
  ^              ^
  |              ambiguity resolved, when newline is reached
  | 
  ambiguity start, there may or may not be an ATX closing to header,
  parser forks and tries parsing both ways

Or how would it, more practically, work?

GLR parsers work, by implementing the GLR algorithm.
These papers detail the algorithm:


The document is constantly changing.

typically not true for micromark

The input is often not in a finished, syntactically correct form. But you still have to make some sense of it—nobody wants an editor where most features stop working when you have a syntax error in your document.

typically not true for micromark

Micromark is intended as the base for remark and by extension MDX correct?
Editor projects like https://github.com/blocks/blocks will be continuously changing the document.

You can't do anything expensive. If the parsing works takes too long, it'll introduce latency that makes editing feel slugglish and unresponsive.
micromark needs to be complete, perfect, can take a while to get there (but preferably fast/small)

In what sense "complete" and "perfect"?

The way I look at it, Micromark is a lexer/tokenizer.
Looking at it from that sense, it is by definition incomplete, as an intermediate representation for remark to build on.

You often want to be able to mix several languages/grammars in a single document (think HTML with JavaScript and CSS embedded in it).

Markdown doesn’t depend on valid HTML, rather, a simplified black box.

Right, not on HTML, but MDX does depend on valid JSX.

from micromark.

wooorm avatar wooorm commented on May 14, 2024

Am I correct in assuming that with lezer / tree-parser you don‘t backtrack, but instead parse multiple syntax trees? Or how would it, more practically, work?

I think those two are very interesting, but they have very different roles/goals:

  • The document is constantly changing.

    typically not true for micromark

  • You can't do anything expensive. If the parsing works takes too long, it'll introduce latency that makes editing feel slugglish and unresponsive.

    micromark needs to be complete, perfect, can take a while to get there (but preferably fast/small)

  • The input is often not in a finished, syntactically correct form. But you still have to make some
    sense of it—nobody wants an editor where most features stop working when you have a syntax
    error in your document.

    typically not true for micromark

  • You often want to be able to mix several languages/grammars in a single document (think
    HTML with JavaScript and CSS embedded in it).

    Markdown doesn’t depend on valid HTML, rather, a simplified black box.

from micromark.

wooorm avatar wooorm commented on May 14, 2024

@ChristianMurphy Thanks for explaining, even though I write a lot of parsing things I really don’t know these basics!

In what sense "complete" and "perfect"?

[What] way I look at it, Micromark is a lexer/tokenizer. Looking at it from that sense, it is by definition incomplete, as an intermediate representation for remark to build on.
@ChristianMurphy

That’s true. One example I can think of that CM requires entities to be valid: &foo; is literal, & is &. A simpler tokeniser could treat any alphanumerical value as an entity to be simpler, whereas micromark should probably make the distinction between valid and invalid named character references. (Although, maybe we can have a probablyNamedCharacterReference token and defer this to remark/etc 🤷‍♂️)


As in the HTML string isn't parsed itself and is instead put into an HTML node with its raw contents?
@johno

Typically, languages have other languages inside them. The two examples here are HTML in Markdown and JSX in MDX. The difference is that Markdown doesn’t parse HTML, it parses some XML-like structures. Whereas MDX indeed seems to parse (in the future?) only valid JSX.

This may be a bit of a problem: HTML/MD don’t have “invalid” content, that throws a parse error and crashes. JSX/MDX do have that.

Would the approach be that MDX extends/replaces micromark's HTML tokenizer/parser/compiler and replaces it with its own (likely using Babel)?
@johno

Correct! I think there’s two crucial examples of extensions for micromark: 1: GFM, 2: MDX. The first is probably easier and also very ubiquitous and can function as a proof of concept, the second is necessary but can take a bit longer as its probably a bit hard.
The gist of it is that we need to figure out what the “hooks” are for extensions to plug into. Can they add/remove new inlines/blocks? Can they inject into the states of a context?

from micromark.

wooorm avatar wooorm commented on May 14, 2024

CMSM does not define backtracking. This therefore removes the possibility of extensions in favour of performance.

I see two possibilities of extensions: a) define useful extensions in CMSM enable them with flags, or b) allow some form of hooks.

I’d like to table this for now: first priority is to get micromark working (keeping extensions in mind), actually supporting extensions comes after that.

from micromark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.