Comments (6)
Micromark is intended as the base for remark and by extension MDX correct? Editor projects like https://github.com/blocks/blocks will be continuously changing the document.
Blocks, in particular, has an intermediary schema (via Slate) for editing, so the parsing really only occurs when deserializing/serializing the document. Serialization would really only occur once and deserialization whenever one wants to output the MD which can be trivially debounced.
You often want to be able to mix several languages/grammars in a single document (think HTML with JavaScript and CSS embedded in it).
Markdown doesn’t depend on valid HTML, rather, a simplified black box.
Right, not on HTML, but MDX does depend on valid JSX.
I'm not sure I 100% understand what you mean by a "simplified black box" @wooorm. As in the HTML string isn't parsed itself and is instead put into an HTML node with its raw contents?
As @ChristianMurphy states, MDX depends on valid JSX, which essentially will hard fail when not properly written (because babel or JS evaluation will go 💥). It doesn't have the built in "recovery" that browsers have for HTML documents. For MDX, this will require knowledge of the JSX language/grammar, which would be pretty complex to handle.
Would the approach be that MDX extends/replaces micromark's HTML tokenizer/parser/compiler and replaces it with its own (likely using Babel)?
Personally I'd lean towards being more robust/correct than being the fastest thing ever for the first release. Once it's "correct" we can profile bottlenecks and optimize. Not to mention some of these considerations are pretty edge-casey. Also, a lot of end users of unified (like Gatsby) can, and do, implement their own layers of caching.
from micromark.
This sounds related to #8
Do we backtrack to before the blank lines, and check all the tokenisers again
If possible let's avoid backtracking entirely. http://marijnhaverbeke.nl/blog/lezer.html
is there a knowledge of what other tokenisers are enabled and can we “eat” every blank line directly
This may be more complex, but will likely be a better approach for performance. 👍
from micromark.
Am I correct in assuming that with lezer / tree-parser you don‘t backtrack, but instead parse multiple syntax trees?
Sort of, GLR parsers fork whenever an ambiguity is encountered.
For example, parsing
## Hello World ##
When it is parsed
## Hello World ##
--*--------------*--
| |
*--------------*
^ ^
| ambiguity resolved, when newline is reached
|
ambiguity start, there may or may not be an ATX closing to header,
parser forks and tries parsing both ways
Or how would it, more practically, work?
GLR parsers work, by implementing the GLR algorithm.
These papers detail the algorithm:
- Original Paper https://www.ijcai.org/Proceedings/85-2/Papers/014.pdf
- Royal Holloway, University of London follow up paper https://www.cs.rhul.ac.uk/research/languages/publications/tomita_style_1.ps
The document is constantly changing.
typically not true for micromark
The input is often not in a finished, syntactically correct form. But you still have to make some sense of it—nobody wants an editor where most features stop working when you have a syntax error in your document.
typically not true for micromark
Micromark is intended as the base for remark and by extension MDX correct?
Editor projects like https://github.com/blocks/blocks will be continuously changing the document.
You can't do anything expensive. If the parsing works takes too long, it'll introduce latency that makes editing feel slugglish and unresponsive.
micromark needs to be complete, perfect, can take a while to get there (but preferably fast/small)
In what sense "complete" and "perfect"?
The way I look at it, Micromark is a lexer/tokenizer.
Looking at it from that sense, it is by definition incomplete, as an intermediate representation for remark
to build on.
You often want to be able to mix several languages/grammars in a single document (think HTML with JavaScript and CSS embedded in it).
Markdown doesn’t depend on valid HTML, rather, a simplified black box.
Right, not on HTML, but MDX does depend on valid JSX.
from micromark.
Am I correct in assuming that with lezer / tree-parser you don‘t backtrack, but instead parse multiple syntax trees? Or how would it, more practically, work?
I think those two are very interesting, but they have very different roles/goals:
-
The document is constantly changing.
typically not true for micromark
-
You can't do anything expensive. If the parsing works takes too long, it'll introduce latency that makes editing feel slugglish and unresponsive.
micromark needs to be complete, perfect, can take a while to get there (but preferably fast/small)
-
The input is often not in a finished, syntactically correct form. But you still have to make some
sense of it—nobody wants an editor where most features stop working when you have a syntax
error in your document.typically not true for micromark
-
You often want to be able to mix several languages/grammars in a single document (think
HTML with JavaScript and CSS embedded in it).Markdown doesn’t depend on valid HTML, rather, a simplified black box.
from micromark.
@ChristianMurphy Thanks for explaining, even though I write a lot of parsing things I really don’t know these basics!
In what sense "complete" and "perfect"?
[What] way I look at it, Micromark is a lexer/tokenizer. Looking at it from that sense, it is by definition incomplete, as an intermediate representation for remark to build on.
— @ChristianMurphy
That’s true. One example I can think of that CM requires entities to be valid: &foo;
is literal, &
is &
. A simpler tokeniser could treat any alphanumerical value as an entity to be simpler, whereas micromark should probably make the distinction between valid and invalid named character references. (Although, maybe we can have a probablyNamedCharacterReference
token and defer this to remark/etc 🤷♂️)
As in the HTML string isn't parsed itself and is instead put into an HTML node with its raw contents?
— @johno
Typically, languages have other languages inside them. The two examples here are HTML in Markdown and JSX in MDX. The difference is that Markdown doesn’t parse HTML, it parses some XML-like structures. Whereas MDX indeed seems to parse (in the future?) only valid JSX.
This may be a bit of a problem: HTML/MD don’t have “invalid” content, that throws a parse error and crashes. JSX/MDX do have that.
Would the approach be that MDX extends/replaces micromark's HTML tokenizer/parser/compiler and replaces it with its own (likely using Babel)?
— @johno
Correct! I think there’s two crucial examples of extensions for micromark: 1: GFM, 2: MDX. The first is probably easier and also very ubiquitous and can function as a proof of concept, the second is necessary but can take a bit longer as its probably a bit hard.
The gist of it is that we need to figure out what the “hooks” are for extensions to plug into. Can they add/remove new inlines/blocks? Can they inject into the states of a context?
from micromark.
CMSM does not define backtracking. This therefore removes the possibility of extensions in favour of performance.
I see two possibilities of extensions: a) define useful extensions in CMSM enable them with flags, or b) allow some form of hooks.
I’d like to table this for now: first priority is to get micromark working (keeping extensions in mind), actually supporting extensions comes after that.
from micromark.
Related Issues (20)
- 3.0.8 seems to introduce a module level dependency on document HOT 9
- `index.d.ts` is missing in `micromark-util-encode` published files HOT 3
- HTML with excess whitespace is not parsed correctly HOT 2
- List items wrapped in <p> tags due to trailing space HOT 3
- hard break at the end of a paragraph is not properly parsed HOT 3
- Make `definitions` available to extensions HOT 2
- Custom extensions break in development mode, despite working in production HOT 6
- & in image url will be encode to html entity HOT 2
- Configure collapsing newlines into a single paragraph HOT 3
- TokenizeContext.sliceSerialize throws in sliceChunks if first chunk of token is Code instead of string HOT 20
- Reduce execution time by ~11% with a simple reimplementation of TokenizeContext.now HOT 3
- nested ordered lists not starting with 1. are not detected HOT 4
- `TokenizeContext.sliceSerialize` for `Token.type` of `setextHeading` includes non-heading content from outside the range of [`startLine`, `endLine`] HOT 1
- `micromark-util-symbol` can not be imported by typescript HOT 9
- Strings ending with `\n-` are compiled into a level 2 heading HOT 3
- Error - [webpack] 'dist': ./node_modules/micromark-util-decode-numeric-character-reference/index.js 23:11 Module parse failed: Identifier directly after number HOT 12
- Emphasis and strong when immediately followed by emphasis in the same word causes extra asterisks to appear HOT 4
- Implementation of autolink and literalAutolink (micromark-extension-gfm-autolink-literal) are inconsistent when handling "@." HOT 10
- Including license in NPM packages HOT 4
- Performance: larger MDX files are unmanagably slow to parse HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from micromark.