I got an issue reported on my repo that markup is not working in the headlines, which

Markup implementation about tree-sitter-org HOT 8 CLOSED

milisims commented on September 13, 2024

Markup implementation

from tree-sitter-org.

Comments (8)

milisims commented on September 13, 2024 1

Because of an original implementation detail, this turned out to be a little tricky, and so I decided to merge the issue with another general "Improve the markdown syntax parsing" which involves a pretty substantial _textelement rewrite. I did most of the work on this already, but it has some issues that I need to spend some time tackling still. If I can't figure it out by this weekend I'm going to throw up the PR for tracking.

In a nutshell, each text element has 4 versions, oneline/multiline and immediate oneline/multiline, where the immediate versions just have the token.immediate on the first node of each text element. The grammar overall mostly just uses oneline/multiline versions where appropriate, and the immediate versions are pretty much just for inside of markdown elements.

from tree-sitter-org.

milisims commented on September 13, 2024

It's on the TODO list. We can't use the existing markup groups because they allow multi line items in their definition, and a "simple" addition will not solve conflicts, so the solution requires single-line markup groups and dealing with conflicts appropriately. If you feel like attempting it, please do, but conflicts are a bit tricky to wrap your head around. I'll bump it's priority, I should be able to get a first attempt today or tomorrow. I suspect it'll be "easy", but sometimes weirdness pops up that I don't know how to solve.

from tree-sitter-org.

kristijanhusak commented on September 13, 2024

Thanks for info. I'll leave it to you, since I don't have enough knowledge around it. No need to rush on it, since there is a workaround, and it doesn't completely solve the markup issue for me at the current Neovim state (concealing markup markers), which are currently solvable only by Vim syntax.

from tree-sitter-org.

milisims commented on September 13, 2024

Okay so I've been struggling with this for a while now. Not just headlines, but markup in general. Basically, having the pre-markup and post-markup requirements (listed here) are... difficult to deal with, and hardcoding any markup causes the parser to be unable to be customized. So as far as I see there are three solutions:

1. Implement markup in the grammar as in the syntax

Pros:

libraries are easier to write using this parser.
No ambiguity

Cons:

dramatically more difficult to implement in the grammar to work well.
Substantially larger parser.c (markup has the largest number of possible states, by far)
Slows parsing down.
Minimal configurability

2. Implement the markup symbols at starts and ends

For example, a *b c* would mark the two *s as possible candidates, then have the libraries use queries or subsequent code to determine if they are valid or not.

Pros:

easier to write than (1) in the grammar, somewhat.
Results in a much simpler and faster parser.
"More permissive" parsers are generally favored in the mind of tree-sitter's author, as it's designed to use queries to parse out specifics

Cons:

Libary authors need to do some extra legwork.
Still doesn't offer a huge amount of customization.
The biggest difficulty with (1) is handling all of the pre-markup and post-markup possibilities. Post-markup is easier, because of the external scanner, but lookbehind doesn't really exist without a LOT of work. This issue remains unresolved

3. Don't implement it, and instead expose all words, numbers, and symbols in the `_text` node as anonymous nodes.

Pros:

Easiest to implement in the grammar, by far.
Fastest parser.
Most permissive
Allows the greatest flexibility, the user can use any symbol
Could put together a set of queries that would help the user find possible markup regions
Best: it would be fairly easy to write a tree-sitter-org-markup-generator, which might allow a user to generate a parser to attach to the element parser in headline and paragraph regions, or wherever they want, using whatever symbols they want. This generator wouldn't need an external scanner, so it would even be possible to generate a parser.c that could be modified by script, removing the node.js requirement to generate parsers with different symbols. No comments on simplicity though.

Cons:

If using only queries, more work for the library authors than the other two options.
If using a markup generator parser, then the 'end user' might have additional steps they need to take... but sane defaults and some variants could be provided easily.

So originally, I started writing this post to get some clarity and opinions. After writing it, it seems clear to me that (3) is the best option, which is really two separate components. Exposing all the information as nodes, and then writing the tree-sitter-org-markup-generator. Part of what makes that a lot easier is that I don't have to worry about conflicts with other element tokens, which get pretty nasty.
What are your thoughts?

from tree-sitter-org.

kristijanhusak commented on September 13, 2024

I'm perfectly fine with both 2nd and 3rd solution, but I agree that 3rd solution seems the best. Currently I'm using Vim's syntax region to do these, and most likely I will require a Vim syntax alongside the TS for some time because of the concealing. I think 2nd solution would maybe cause some issues here, because I wouldn't be able to do this concealing so easily with it, even if I know the start/end node.
3rd solution also improves the parsing speed, which is very crucial, since I noticed some slowness with highlights enabled on bigger files (>1k lines). This slowness could also be caused by TS highlights which are still experimental or Vim's syntax that's used on some parts of the code.
Only worry that I have with 3rd is this:

it would be fairly easy to write a tree-sitter-org-markup-generator, which might allow a user to generate a parser to attach to the element parser in headline and paragraph regions, or wherever they want, using whatever symbols they want. This generator wouldn't need an external scanner, so it would even be possible to generate a parser.c that could be modified by script, removing the node.js requirement to generate parsers with different symbols. No comments on simplicity though.

I'm not exactly sure how would I achieve this with Neovim. Is it just setting up the additional parser?

from tree-sitter-org.

milisims commented on September 13, 2024

I just pushed a new branch, more or less as we chatted a bit about here. The markup queries can be used to find where those should be highlighted, but it won't be plug-and-play with the current TSHighlighter in neovim. I'm mostly there on having the extmarks created properly with some variant of the algorithm below, and maintaining extmarks should be straightforward (testing the markup queries against the changed nodes in an on_bytes callback). I don't think we need to tie it directly into nvim_set_decoration_provider, and instead just maintain extmarks.

Pythony pseudo algorithm for finding the matches:

def markup(nodes):
    # assumes sorted list of nodes
    # all nodes should be from a single (paragraph), (itemtext), or (item)

    seeking = []
    markup = []
    for node in nodes:
        if node.type == start:
            seeking.append((node.type, node))
        elif node.type in seeking:
            ix = seeking.index(node.type)
            stnode = seeking[ix][1]
            if validate(node, stnode):  # check the pre/post markup, can be modified
                # stop seeking everything after the node we just matched
                seeking = seeking[:ix-1]
                markup.append({'type': node.type, 'range': (stnode.start, node.end)})
    return markup

I have a bunch of notes I'll be adding to a new issue to discuss the changes.

from tree-sitter-org.

kristijanhusak commented on September 13, 2024

I have a bunch of notes I'll be adding to a new issue to discuss the changes.
I'm looking forward to this!

Does these changes affect how other things are parsed? Change from 4 days ago regarding the plan timestamp caused some breaking changes that I managed to fix today, but I don't know how these changes will affect it. I guess giving it a try will be the best way to figure it out. I covered some portion of parsing with tests so I'm hoping that will help me.

from tree-sitter-org.

milisims commented on September 13, 2024

Does these changes affect how other things are parsed?
Yes, a lot of things are different. I'm not really sure if it's sensible to summarize, but start with a look at #13. I'll see if it makes sense to attempt to write a quick list of node differences, but it basically changes everything.

from tree-sitter-org.

Markup implementation about tree-sitter-org HOT 8 CLOSED

Comments (8)

1. Implement markup in the grammar as in the syntax

2. Implement the markup symbols at starts and ends

3. Don't implement it, and instead expose all words, numbers, and symbols in the `_text` node as anonymous nodes.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (8)

1. Implement markup in the grammar as in the syntax

2. Implement the markup symbols at starts and ends

3. Don't implement it, and instead expose all words, numbers, and symbols in the _text node as anonymous nodes.

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

3. Don't implement it, and instead expose all words, numbers, and symbols in the `_text` node as anonymous nodes.