Comments (15)
Agreed, let's bike shed this β‘(tomorrow for me!)
from jupyterlab-markup.
I think CodeMirror overlay modes might be the right tool for this job. Each extension that want's to extend the syntax can just add an overlay. If this were the case, we wouldn't need to implement an interface for this; the existing CodeMirror interface should (I believe) suffice
from jupyterlab-markup.
I had a brief chat with the EBP people, and spent a little bit of time looking into the feasibility of this. AFAICT, with the current markdown-it + CM5 approach, each plugin will need to write two different tokenizers, one for markdown-it and one for a CM mode.
The Double-Implementation Problem
This doesn't sit hugely well with me - it seems crazy that we do effectively the same work twice. The simplest solution here is to use a Markdown library that does include position information, and fit it into a CM Mode. There would be some challenges here:
- performance: CM modes need to be fast, and I'm not sure how well this full-reparse would work with single character edits while typing. Unlike markdown rendering, this needs to be immediate*.
- look-ahead: CM modes can look ahead, but then we'd need to handle the case where we've performed a look-ahead to produce some tokens, and then read that same line in the next
token()
call. I'm sure this is do-able, but a bit messy. - granularity: CM modes want to style syntactical tokens. I'm not convinced that this abstract is well enforced with markdown-it-style parsers - particular markdown-it (despite not producing usable position info), it's very easy to produce "tokens" that are not syntactic. For example, consider (where I've replaced ` with
:
)Markdown-it produces a single AST-like::: fence some fence :::
Token
. Conversely, Lezer producesDocument( FencedCode( CodeMark(":::") CodeInfo("python") CodeText("some code!") CodeMark(":::") ) )
I mentioned Lezer - CM6 standardises language information around a concrete syntax tree, which can either be generated by Lezer's LR runtime, or by another parser that produces the same structures.
The summary here is that CodeMirror (5 and 6) really needs an incremental parser, both wrt. performance and API-matching. So, if one wanted to re-use the parser for Markdown rendering and highlighting, then really we want to satisfy that. I think CM6 has a nicer API here: instead of feeding the parser line-by-line and requiring the formatting each time, CM6 wants the entire parse tree, but can later call in to reparse only a subset. I don't know CM5 well enough to be sure, but I suspect that we would have to handle parse-tree invalidation inside the CM mode ourselves.
Relatedly, there is discussion about how to move beyond TextMate grammars for VSCode.
Wider Issues
Before I thought about the re-parse cost, I was going to suggest something radical β having spoken to the EBP team made me think about the fact that we have two separate implementations of markdown-it and the plugin ecosystem; one in Python, and one in JS. We could think-ahead and use a Rust/WASM base for markdown parsing, which could then be used by the Python tools too. I think this is where the space is probably headed, but it's a lot of work.
Additionally, LSP markdown-support is something we've talked about, and being able to share some of the implementation here would also be nice too.
Conclusions
I think there are two separate issues now being discussed in this post:
- sharing parsing between CM and markdown rendering in the browser
- sharing parsing + rendering with backend for EB / LSP
I am just not familiar enough yet with the problem space to know what the best long-term solution is. If incremental parsing is viable for rendered markup, then it sounds like the best approach - it will also reduce our repaint times (although the DOM/VDOM is ultimately going to be the bottleneck I suspect). However, we would need a second pass IIRC to handle things like link validation which aren't possible in a single forward-pass.
I've seen a few ideas here:
The only WASM-friendly option is the last one. Toastmark extends commonmark.js to add enough information to be able to build an AST. It seems like they rely on being able to use contextual information to move back to a CST:
Furthermore, the token structure that composes different elements of the Markdown is simple enough that it can be implemented using an abstract syntax tree just by adding a few pieces of information.
I think Toastmark is avoiding the CM mode API by instead using marks. Maybe that would be a good interrim, because the Mode API only handles highlighting and indentation (i.e. not folding)
Additionally, rendering / analysis tools might want more than the CST - an AST would be much easier to render. This would warrant a second pass.
I am considering whether it's better to take a longer-term view of the solution here. Rather than investing time into getting highlighting working with CM5 Modes + markdown-it (and writing everything twice), maybe actually moving to lezer (or at least generating a lezer CST) would be a good thingβ’ in the long run? By dropping markdown-it we would immediately lose the entire ecosystem, which would not be ideal. However, the core Markdown extensions that make it worth using are not too complex. Maybe a community-wide effort here would be sufficient to keep things ticking over?
I can't see a way that we can have our cake and eat it unless we make some bold decisions regarding the future plans here :/
Useful links / recap:
- https://marijnhaverbeke.nl/blog/lezer.html
- https://lezer-sandbox-1gurlbleb-a61.vercel.app/
- http://tree-sitter.github.io/tree-sitter/
- microsoft/vscode#77140
- https://toastui.medium.com/the-need-for-a-new-markdown-parser-and-why-e6a7f1826137
- https://discuss.codemirror.net/t/is-it-feasible-to-create-a-mode-based-on-existing-ast/1684/5
from jupyterlab-markup.
@agoose77 something I think you mentioned to me, that you don't mention here, is https://github.com/syntax-tree/mdast (based on https://github.com/syntax-tree/unist), which is basically a nice language agnostic (JSONable) and extensible AST format for Markdown (and also includes line/column source mapping).
I'm not sure how this would fit in with the incremental parsing (and lezer etc),
but it feels like a nice, standardised, format to centre around, without tying yourself to one "technology", and its particular AST format.
I'm thinking to write the MyST spec basically as an extension on mdast and then, in principle π¬, you can just use any parser, renderer, LSP that supports it
from jupyterlab-markup.
Yeah: the balance between future architectural correctness and getting software into peoples' hands is elusive.
In the near term...
We probably need to continue making the most pragmatic choices such that we can ship software that folk can use, today, with other tools they like. So for now: we have to deal with in CM(5). That first step might be a new ipythongfm
, to be a little better, not optimal... in my mind, the most important being non-bog-standard markdown magic tokens e.g. {directive}
, etc. and the ability to switch into dedicated modes.
So, the really messy option today that would be to make it possible would be maybe some kinda middleware junk:
export interface IMarkdownModeOpts {
modes: {[key: string]: any} // initially, gfm, tex
multiplexingModes: any[];
config: CodeMirror.EditorConfiguration; // the runtime ones
modeOptions?: any; // the runtime ones
}
export interface IPluginProvider {
// ...
syntaxExtension: (options: IMarkdownModeOpts) => IMarkdownModeOpts;
}
...and then we stack everything up when a mode is requested.
But longer term...
I am a pretty big proponent of WASM. Seems like an appropriate thing for a rendering engine, but feels overkill for "just" syntax highlighting. Indeed, we had to deploy some wasm for jupyterlab-simple-syntax because textMate highlighting bundles use a flavor of regex that is... non-trivial. Felt icky. But for full LSP-grade analysis, as has been in noted this thread and elsewhere (e.g. sync scrolling)... yeah, might as well get your syntax highlighting in the same parse.
Moving outside of the text editing/rendering experience: the jupyterlite experiment has been great, showing that a (mostly) familiar interactive computing experience (pyolite on pyodide on emscripten) is workable... but as a new platform, we're still limited in many ways. I'd say stay tuned in 2022 for more composable stuff that people can plug into...
I think WASM is going to be the bottom of a next-level version of reproducible, interactive computing, and jupyter is well suited to be a banner under which it gets into users' hands. I doubt the next generation of users will think so much about what language a particular function is implemented in, and whether code is being run in-loop in the browser per keystroke or being executed in a massively parallel HPC setting. Things like WASM Types, extended to work with Arrow, and wrapped with metadata like real SI units, will make doing real science pretty awesome.
from jupyterlab-markup.
#Thanks @bollwyvl, is there any good reading on wasm, it's not something I've looked into much yet.
Does it basically mean you have to implement the parser in C++/Rust?
And @agoose77 did you mention that there is also some possibility of direct integration with Python? (as opposed to having to call it in a subprocess)
from jupyterlab-markup.
Yes, my thoughts on this topic are motivated by the wider landscape of who is using jupyterlab-markup, and who needs markdown rendering more generally. I just don't like the fact that if I want to implement extensions to commonmark that support syntax highlighting + executable books, I'd have to write the same parser/lexer three times!
With respect to solving "delivering solutions now", I am currently in favour of not using the CM5 Mode API, and instead relying solely on the Marks API. I think that is workable, and if so it would allow us to get started on using a high-granularity parser today.
The common problem that we all have is generating a document-aware syntax tree. Whether that is a CST or AST is less important. If we could standardise the parsing of Markdown for "commonmark extensions", then LSP + EB + Jupyter would all get that for free. The rendering again could be shared between EB + Jupyter. I don't know how VSCode would fit into this w.r.t rendering - they seem currently reluctant to expose the Markdown renderer itself as an extension point. Maybe it wouldn't be so bad to add another editor, which seems to be what they recommend.
mdast
(based on unist
) does include position information, and so it should be possible to reconstruct the concrete syntax tree from the AST and the original source. For highlighting purposes, one would probably want to keep the CST around and generate the AST from that. As a general syntax tree, it should be possible to generate mdast from lezer, for example. I think choosing mdast as a specification would not be a bad idea. It would certainly move one step towards unifying the ecosystem.
WASM & Rust (which can compile to WASM) are both accessible from Python. This means it is possible to write the implementation once, and re-use it in Python + JS. Of course, this would mean writing code in the common denominator language, e.g. Rust. One could do this in Python as Python can be compiled to WASM, but right now that involves a lot of work & bloat as @bollwyvl alludes to.
from jupyterlab-markup.
There is also the benefit of standardising the AST for existing tools: the ToC extension IIRC parses the Markdown to identify headings in notebooks / Markdown documents. Having a generated AST / being able to request the AST would mean:
- we only do this parsing once (hopefully)
- syntax extensions are supported out of the box
Another benefit is prosemirror integration: I imagine that is a lot easier if you're working at the level of an AST.
from jupyterlab-markup.
write the same parser/lexer three times!
Well... part of that comes from tools hand-writing parser/lexers in an implementation language in the first place. But markdown is a crazy mess, to parse properly, even before adding extensibility. But if starting over... rather than jumping straight to PARSER IN RUST NOW, at least taking a cursory look at a portable stack like antlr or lark, which focus effort on writing declarative specifications and then generating implementations, might be worthwhile.
Indeed: Jupyter would really benefit from a declarative (preferrably, JSON-compatible) way for e.g. kernels to describe their language grammars (especially dynamic deviances a la jupyter-lsp/jupyterlab-lsp#191). Briefly, on jupyter-lsp: despite its warts, for the larger code editing mission, we can't afford to lose what CM5 already represents to the community. We are excited to get to our hands on CM6. Maybe i'll warm simple-syntax
back up, as those TextMate bundles, supported by half the editors out there, would be even better, but again, see magic-regexes-that-need-wasm! Which brings us to...
any good reading on wasm
Here's a high level site, some specs (including the forthcoming types) as well as some nuts-and-bolts blog posts, like asciinema, and some position pieces.
you have to implement the parser in C++/Rust?
WASM is a target for a number of compiled languages, now: c, rust, erlang, go, haskell, etc. There are some higher-level languages, such as the typescript-like assemblyscript. Initially, this grew out of the corpus of tricks in asm.js, and was to enable reasonably performant in-browser execution of otherwise-opaque software: in 2022, it's not much of a stretch to say it's easier to run a lot of things in the browser than natively (and well) on windows. More recently, is proving interesting as a non-browser technology due to its sandboxing: or even more weirdly, firefox will soon be shipping some vendored stuff compiled from C to WASM, and then back into C!
direct integration with Python
In JupyterLite, which only cares about (real) browsers, we're using pyodide to deliver the IPython/ipykernel stack, including ipywidgets. Most packages run unmodified! But the biggest win is that you can deploy certain interactive experience to, theoretically, millions of simultaneous users (willing to maybe download ~100mb of python to their browsers π€£) with just a free/low cost static web host and a CDN.
work & bloat
Pyodide is basically a CPython distribution, and has a conda-like build chain, to get up a Linux-like system with numpy/pandas with emscripten. Unfortunately, its build chain is just conda-like... there's some work starting soon to see if this can actually be conda(-forge) so that we can start getting automated updates of thousands of packages, instead of one every pyodide release to update/add libraries.
However: the ticket to get in the door for that python integration is ~20mb, per kernel. As such, we have been pushing back against using any python wasm as part of the "web server" that runs in the browser, instead re-implementing key parts of jupyter_server
and jupyterlab_server
in typescript.
Meanwhile...
On the "server" there are a number of standalone runtimes, such as wasmer and wasmtime, as well as things that are shooting for even greater security such as enarx. Wasmer, in particular, has many language-specific bindings, such as wasmer-python. The win here for jupyter-adjacent projects would be to not be chasing the moving target of python ABI complexity per-platform-per-python-per-wheel, and just be able to ship a single WASM blob that would execute anywhere, including the browser, but enjoy a performance profile closer (by order of mangitude) to C-level code than python code.
from jupyterlab-markup.
least taking a cursory look at a portable stack like antlr or lark,
I'm not sure it's even theoretically possible to parse commonMark as context free grammar? (See e.g. https://roopc.net/posts/2014/markdown-cfg/). Let alone with any syntax extensions
from jupyterlab-markup.
Here's a high level site, some specs (including the forthcoming types) as well as some nuts-and-bolts blog posts, like asciinema, and some position pieces.
Cheers, will check it out!
from jupyterlab-markup.
even theoretically possible to parse commonMark as context free grammar
right, I'll grant that even "old high markdown" is basically the social media engrish of markup languages.
But there are grammars and then there are grammars. for syntax highlighting, especially in a narrative language, it just needs to be good enough and fast enough and be really good at handling broken state.
Indeed, having a lenient grammar with terminals like IDK_MAYBE_BROKEN_LOL
that quickly consume ambiguous conditions until the next block boundary is probably not the worst thing in the world... for syntax highlighting.
And there's no helping things like footnote-style markdown refs.
But even something block level would be a fairly big step up for portability, especially for the case of embedding multiple syntax modes inside other syntaxes.
syntax extensions
in lark, at least, one can make extensible grammars... but if that particular feature isn't portable to other implementations it wouldn't be as much fun. and i would not wish runtime antlr generation on anyone!
from jupyterlab-markup.
least taking a cursory look at a portable stack like antlr or lark,
I'm not sure it's even theoretically possible to parse commonMark as context free grammar? (See e.g. https://roopc.net/posts/2014/markdown-cfg/). Let alone with any syntax extensions
Right, from the reading that I've done (given that I've not had time to look at it myself yet), writing a formal grammar for Markdown is a very difficult challenge. The author of this link makes a few other comments elsewhere, and essentially their argument is:
- Any text is valid Markdown, so it is not possible to formalise it in a conventional grammar.
- Given this, the best thing for a markdown specification is an algorithm implementation e.g. state machine, that encodes the parse rules unambiguously.
There are definitely a number of different concerns/priorities in this thread. As I see them, we have:
- Performant parsing
- Non-DRY code for plugin authors
- Implementation conformance between Jupyter projects
- Code duplication between ecosystems
- Standardisation of markup
Maybe some of these concerns do not need to be solved any time soon. But, if we allow ourselves the opportunity to consider them, we have:
- Performant parsing (for edits) - use an incremental parser. Ultimately, a parser needs to avoid scaling O(doc size) with any small change.
- Non-DRY code - use the same tokenizer / parser (depending upon how you highlight) for HTML generation and syntax highlighting
- Implementation conformance - define a canonical implementation, or share the implementation itself! (e.g. targeting WASM)
- Code duplication between ecosystems - share the implementation, or interop e.g. a headless browser process.
- Standardisation - see below
I did note that roopc implemented a Markdown specification for a modified Markdown. However, once you start having variations (plugins) on this specification, it would be difficult to resolve how ambiguities should be handled. The easiest and most robust solution that I can see to that is to just have a canonical implementation and decree that that is the right way to parse it. My gut feeling is that the best direction for Jupyter projects as a whole is to:
- Require that plugins do not break commonmark (test) conformance
- Share implementations as much as possible
We are already doing this in part with EB + jupyterlab-markup: both use Markdown-it / markdown-it ports, and (assuming conformance) that is more consistency than the range of Markdown renders in use by different platforms (JupyterLab/notebook with Markedjs, colab? kaggle? GitHub renderer?)
If we don't consider an implementation-defined spec, then the next best thing is a big test suite defining implementation.
from jupyterlab-markup.
That's a lot of stuff, and sorry for encouraging wandering off down the wasm path.
I look forward to a future where a user-driven set of choices are documented and honored by the tools (a la #13) but feel like "conformance" is a very big word to use in use case, and definitely out of scope of a PR that answers the title/description of this issue.
Basically, after said PR was merged, installing a future 1.x release of this extension would extend the existing JupyterLab 3.x editing experience to highlight some of the new syntax it supports rendering e.g. mermaid (now part of GFM), without breaking the experience provided by other extension authors (e.g. LSP, modes from other languages, collaborative editing with presence). Ideally, this would be managed in a way that downstreams of this plugin could also add additional features... but maybe #40 would demand this anyway.
Even having gross block-level modes, as supported by the existing cm5+ipythongfm would be sufficient for today's notebook markdown cell editing experience and markdown documents of reasonable size, like a project's README, and whole (jupyter) books are again a whole other beast.
from jupyterlab-markup.
Just as an additional point of reference, you also now have https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_semanticTokens
I guess this is similar to overlays, in that it is not intended to provide the full highlighting just enhance it.
from jupyterlab-markup.
Related Issues (20)
- Binder url not working HOT 2
- Add support for plugin options HOT 1
- Sync scrolling in markdown preview? HOT 4
- Add `[[wiki]]` style link support HOT 1
- Synchronise checkboxes with source
- Separate existing plugins into sub-packages HOT 4
- How do you add plugins? HOT 5
- Feat: support settings reloading
- Support for PlantUML HOT 1
- Broken yaml code block syntax highlighting HOT 2
- Import error during development install HOT 1
- Accessing notebook cell metadata via a BUILTIN plugin HOT 7
- svgbob Code Fence Label inconsistent with sphinx-contrib/svgbob HOT 5
- Support new `IMarkdownParser` interface
- Update to markdown-it 13.x
- Add docs/demo with ReadTheDocs HOT 3
- Document-level rendering? HOT 2
- Explain why I would want this extension
- Support for pythonic preprocessing HOT 1
- Update mermaidjs to support Class diagram HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jupyterlab-markup.