milisims / tree-sitter-org Goto Github PK

Org grammar for tree-sitter

License: MIT License

Scilab 2.34% JavaScript 0.40% C++ 0.03% Python 0.01% Rust 0.10% C 96.92% Scheme 0.19%

tree-sitter-org's Introduction

tree-sitter-org

Org grammar for tree-sitter. Here, the goal is to implement a grammar that can usefully parse org files to be used in any library that uses tree-sitter parsers. It is not meant to implement emacs' orgmode parser exactly, which is inherently more dynamic than tree-sitter easily allows.

Overview

This section is meant to be a quick reference, not a thorough description. Refer to the tests in corpus for examples.

Top level node: (document)
Document contains: (directive)* (body)? (section)*
Section contains: (headline) (plan)? (property_drawer)? (body)?
headline contains: ((stars), (item)?, (tag_list)?)
body contains: (element)+
element contains: (directive)* choose(paragraph, drawer, comment, footnote def, list, block, dynamic block, table) or a bare (directive)
paragraph contains: (expr)+
expr contains: anonymous nodes for 'str', 'num', 'sym', and any ascii symbol that is not letters or numbers. (See top of grammar.js and queries for details)

Like in many regex systems, */+ is read as "0/1 or more", and ? is 0 or 1.

Example

#+TITLE: Example

Some *marked up* words

* TODO Title
<2020-06-07 Sun>

  - list a
  - [-] list a
    - [ ] list b
    - [x] list b
  - list a

** Subsection :tag:

Text

Parses as:

(document [0, 0] - [16, 0]
  body: (body [0, 0] - [4, 0]
    directive: (directive [0, 0] - [1, 0]
      name: (expr [0, 2] - [0, 7])
      value: (value [0, 9] - [0, 16]
        (expr [0, 9] - [0, 16])))
    (paragraph [2, 0] - [3, 0]
      (expr [2, 0] - [2, 4])
      (expr [2, 5] - [2, 12])
      (expr [2, 13] - [2, 16])
      (expr [2, 17] - [2, 22])))
  subsection: (section [4, 0] - [16, 0]
    headline: (headline [4, 0] - [5, 0]
      stars: (stars [4, 0] - [4, 1])
      item: (item [4, 2] - [4, 12]
        (expr [4, 2] - [4, 6])
        (expr [4, 7] - [4, 12])))
    plan: (plan [5, 0] - [6, 0]
      (entry [5, 0] - [5, 16]
        timestamp: (timestamp [5, 0] - [5, 16]
          date: (date [5, 1] - [5, 11])
          day: (day [5, 12] - [5, 15]))))
    body: (body [6, 0] - [13, 0]
      (list [7, 0] - [12, 0]
        (listitem [7, 2] - [8, 0]
          bullet: (bullet [7, 2] - [7, 3])
          contents: (paragraph [7, 4] - [8, 0]
            (expr [7, 4] - [7, 8])
            (expr [7, 9] - [7, 10])))
        (listitem [8, 2] - [11, 0]
          bullet: (bullet [8, 2] - [8, 3])
          checkbox: (checkbox [8, 4] - [8, 7]
            status: (expr [8, 5] - [8, 6]))
          contents: (paragraph [8, 8] - [9, 0]
            (expr [8, 8] - [8, 12])
            (expr [8, 13] - [8, 14]))
          contents: (list [9, 0] - [11, 0]
            (listitem [9, 4] - [10, 0]
              bullet: (bullet [9, 4] - [9, 5])
              checkbox: (checkbox [9, 6] - [9, 9])
              contents: (paragraph [9, 10] - [10, 0]
                (expr [9, 10] - [9, 14])
                (expr [9, 15] - [9, 16])))
            (listitem [10, 4] - [11, 0]
              bullet: (bullet [10, 4] - [10, 5])
              checkbox: (checkbox [10, 6] - [10, 9]
                status: (expr [10, 7] - [10, 8]))
              contents: (paragraph [10, 10] - [11, 0]
                (expr [10, 10] - [10, 14])
                (expr [10, 15] - [10, 16])))))
        (listitem [11, 2] - [12, 0]
          bullet: (bullet [11, 2] - [11, 3])
          contents: (paragraph [11, 4] - [12, 0]
            (expr [11, 4] - [11, 8])
            (expr [11, 9] - [11, 10])))))
    subsection: (section [13, 0] - [16, 0]
      headline: (headline [13, 0] - [14, 0]
        stars: (stars [13, 0] - [13, 2])
        item: (item [13, 3] - [13, 13]
          (expr [13, 3] - [13, 13]))
        tags: (tag_list [13, 14] - [13, 19]
          tag: (tag [13, 15] - [13, 18])))
      body: (body [14, 0] - [16, 0]
        (paragraph [15, 0] - [16, 0]
          (expr [15, 0] - [15, 4]))))))

Install

For manual install, use make.

For neovim, using nvim-treesitter/nvim-treesitter, add to your configuration:

local parser_config = require "nvim-treesitter.parsers".get_parser_configs()
parser_config.org = {
  install_info = {
    url = 'https://github.com/milisims/tree-sitter-org',
    revision = 'main',
    files = { 'src/parser.c', 'src/scanner.c' },
  },
  filetype = 'org',
}

To build the parser using npm and run tests:

Install node.js as described in the tree-sitter documentation
Clone this repository: git clone https://github.com/milisims/tree-sitter-org and cd into it
Install tree-sitter using npm: npm install
Run tests: ./node_modules/.bin/tree-sitter generate && ./node_modules/.bin/tree-sitter test

tree-sitter-org's People

Contributors

Stargazers

Watchers

Forkers

kristijanhusak lukas-reineke tgbugs whythat konstantindjairo schoettl bbigras char gaetgu gagbo vkochan syntacti 0xadk jeroendehaas ali7line amaanq mattmassicotte o-santi nvim-orgmode

tree-sitter-org's Issues

Hyphen in table cell and empty lines between parent/child line items causes error

Hi,

I got a report today with some failed folding here.
Investigation led me to two things:

Tables with - in the cell fails to parse. Example:

* TODO table
|   |     # |
|---+-------|
| - | 42965 |
| D | 41947 |
|   | 43414 |

Output:

(document [0, 0] - [6, 0]
  (section [0, 0] - [6, 0]
    (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 12]
        (expr [0, 2] - [0, 6])
        (expr [0, 7] - [0, 12])))
    (body [1, 0] - [6, 0]
      (table [1, 0] - [6, 0]
        (row [1, 0] - [2, 0]
          (cell [1, 0] - [1, 1])
          (cell [1, 4] - [1, 11]
            contents: (contents [1, 10] - [1, 11]
              (expr [1, 10] - [1, 11]))))
        (hr [2, 0] - [3, 0])
        (row [3, 0] - [4, 0]
          (cell [3, 0] - [3, 11]
            (ERROR [3, 2] - [3, 5])
            contents: (contents [3, 6] - [3, 11]
              (expr [3, 6] - [3, 11]))))
        (row [4, 0] - [5, 0]
          (cell [4, 0] - [4, 3]
            contents: (contents [4, 2] - [4, 3]
              (expr [4, 2] - [4, 3])))
          (cell [4, 4] - [4, 11]
            contents: (contents [4, 6] - [4, 11]
              (expr [4, 6] - [4, 11]))))
        (row [5, 0] - [6, 0]
          (cell [5, 0] - [5, 1])
          (cell [5, 4] - [5, 11]
            contents: (contents [5, 6] - [5, 11]
              (expr [5, 6] - [5, 11]))))))))
./table.org	0 ms	(ERROR [3, 2] - [3, 5])

Empty lines between parent list item and it's children causes error. Example:

* TODO list items
- A

  - A1
  - A2

- B

Output:

(document [0, 0] - [7, 0]
  (section [0, 0] - [7, 0]
    (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 17]
        (expr [0, 2] - [0, 6])
        (expr [0, 7] - [0, 11])
        (expr [0, 12] - [0, 17])))
    (body [1, 0] - [7, 0]
      (list [1, 0] - [5, 0]
        (listitem [1, 0] - [3, 0]
          bullet: (bullet [1, 0] - [1, 1])
          (ERROR [1, 2] - [2, 0]
            (paragraph [1, 2] - [2, 0]
              (expr [1, 2] - [1, 3]))))
        (listitem [3, 2] - [4, 0]
          bullet: (bullet [3, 2] - [3, 3])
          contents: (paragraph [3, 4] - [4, 0]
            (expr [3, 4] - [3, 6])))
        (listitem [4, 2] - [5, 0]
          bullet: (bullet [4, 2] - [4, 3])
          contents: (paragraph [4, 4] - [5, 0]
            (expr [4, 4] - [4, 6]))))
      (paragraph [6, 0] - [7, 0]
        (expr [6, 0] - [6, 1])
        (expr [6, 2] - [6, 3])))))
./list-item.org	0 ms	(ERROR [1, 2] - [2, 0])

Table horizontal rule strangely splits table into multiple nodes

Hi,

Today I started playing with tables, and noticed one thing I can't really figure out why it's happening.

Having this content:

  | Test | Foo  | Bar    |
  |------+------+--------|
  | Test |      | Seven  |
  | dda  | test |        |
  |      |      |        |
  | dda  | test |        |

Generates this:

(document [0, 2] - [6, 0]
  (body [0, 2] - [6, 0]
    (table [0, 2] - [2, 0]
      (row [0, 2] - [1, 0]
        (cell [0, 2] - [0, 8])
        (cell [0, 9] - [0, 14])
        (cell [0, 16] - [0, 21])))
    (table [2, 2] - [6, 0]
      (row [2, 2] - [3, 0]
        (cell [2, 2] - [2, 8])
        (cell [2, 9] - [2, 10])
        (cell [2, 16] - [2, 23]))
      (row [3, 2] - [4, 0]
        (cell [3, 2] - [3, 7])
        (cell [3, 9] - [3, 15])
        (cell [3, 16] - [3, 17]))
      (row [4, 2] - [5, 0]
        (cell [4, 2] - [4, 3])
        (cell [4, 9] - [4, 10])
        (cell [4, 16] - [4, 17]))
      (row [5, 2] - [6, 0]
        (cell [5, 2] - [5, 7])
        (cell [5, 9] - [5, 15])
        (cell [5, 16] - [5, 17])))))

Also, if I add one more hr rule to the table like this:

  | Test | Foo  | Bar    |
  |------+------+--------|
  | Test |      | Seven  |
  | dda  | test |        |
  |      |      |        |
  |------+------+--------|
  | dda  | test |        |

Generates this:

(document [0, 2] - [7, 0]
  (body [0, 2] - [7, 0]
    (table [0, 2] - [3, 0]
      (row [0, 2] - [1, 0]
        (cell [0, 2] - [0, 8])
        (cell [0, 9] - [0, 14])
        (cell [0, 16] - [0, 21]))
      (row [2, 2] - [3, 0]
        (cell [2, 2] - [2, 8])
        (cell [2, 9] - [2, 10])
        (cell [2, 16] - [2, 23])))
    (table [3, 2] - [7, 0]
      (row [3, 2] - [4, 0]
        (cell [3, 2] - [3, 7])
        (cell [3, 9] - [3, 15])
        (cell [3, 16] - [3, 17]))
      (row [4, 2] - [5, 0]
        (cell [4, 2] - [4, 3])
        (cell [4, 9] - [4, 10])
        (cell [4, 16] - [4, 17]))
      (row [6, 2] - [7, 0]
        (cell [6, 2] - [6, 7])
        (cell [6, 9] - [6, 15])
        (cell [6, 16] - [6, 17])))))

I would not expect horizontal rule to split the table in any way. I think this should be single table with multiple rows.
Looking at grammar doesn't give me a clue, because there it looks like it can either be a row with valid cells, or hr rule. I can't figure out why it would cause this.

Also, is there a chance to expose the horizontal rule node itself? It's much easier to collect it while parsing the tree instead of figuring it out manually, since I want to introduce reformatting the table on edit, and I also potentially need to shrink/expand the hr rule line.

Hyperlink after headline causes error

If there is a hyperlink in a place where plan dates are set (1st line after headline), parser fails to parse it.

Example content:

* TODO Go to google
  [[https://google.com]]

Parse output:

(document [0, 0] - [2, 0]
  (section [0, 0] - [2, 0]
    (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 19]
        (expr [0, 2] - [0, 6])
        (expr [0, 7] - [0, 9])
        (expr [0, 10] - [0, 12])
        (expr [0, 13] - [0, 19])))
    (body [1, 2] - [2, 0]
      (paragraph [1, 2] - [2, 0]
        (expr [1, 2] - [1, 24]
          (ERROR [1, 3] - [1, 22]
            (expr [1, 3] - [1, 22])))))))
./test.org	0 ms	(ERROR [1, 3] - [1, 22])

Parsing errors with latest plan parsing changes

Since the latest commit on main that parses plan dates without checking the exact type (deadline,scheduled,closed), I started getting errors for regular plain content. For example:

* TODO Orgmode
  Docs: [[https://orgmode.org/worg/doc.html][Doc]]

Parses like this:

(document [0, 0] - [2, 0]
  (section [0, 0] - [2, 0]
    (headline [0, 0] - [0, 14]
      (stars [0, 0] - [0, 1])
      (item [0, 2] - [0, 14]))
    (ERROR [1, 2] - [1, 50]
      (name [1, 2] - [1, 7])
      (ERROR [1, 25] - [1, 26])
      (ERROR [1, 38] - [1, 39]))))
test.org	0 ms	(ERROR [1, 2] - [1, 50])

Once I remove the :, it works fine.

Seems like markup-redo branch takes care of that by checking if the right hand side of expression is a date, but I'm not sure what's the release plan for it.

error when expr appears at end of dynamic_block

#+BEGIN: clocktable :maxlevel 2 :emphasize nil :scope file
#+END: clocktable

see the example in 8.4.2 The clock table
parses to

(document [0, 0] - [2, 0]
  body: (body [0, 0] - [2, 0]
    (dynamic_block [0, 0] - [2, 0]
      name: (expr [0, 9] - [0, 19])
      parameter: (expr [0, 20] - [0, 29])
      parameter: (expr [0, 30] - [0, 31])
      parameter: (expr [0, 32] - [0, 42])
      parameter: (expr [0, 43] - [0, 46])
      parameter: (expr [0, 47] - [0, 53])
      parameter: (expr [0, 54] - [0, 58])
      (ERROR [1, 7] - [1, 17]))))

Incomplete date range causes Error

A date range would be something like <2023-01-26 Thu>--<2023-01-27 Fri>. Having an incomplete date range like <2023-01-26 Thu>-- in some parts of the document leads to a parsing error, but not in others:

* headline
  <2023-01-26 Thu>--

leads to

subsection: (section) [1:1-3:0]
 headline: (headline) [1:1-2:0]
  stars: (stars) [1:1-1]
  item: (item) [1:3-10]
   (expr) [1:3-10]
 plan: (plan) [2:3-3:0]
  (entry) [2:3-18]
   timestamp: (timestamp) [2:3-18]
    date: (date) [2:4-13]
    day: (day) [2:15-17]
  (ERROR) [2:19-20]

but

* headline

  <2023-01-26 Thu>--

leads to

subsection: (section) [1:1-4:0]
 headline: (headline) [1:1-2:0]
  stars: (stars) [1:1-1]
  item: (item) [1:3-10]
   (expr) [1:3-10]
 body: (body) [2:1-4:0]
  (paragraph) [3:3-4:0]
   (expr) [3:3-13]
   (expr) [3:15-20]

Sexp diary entries support

I am not really sure but from the first look, this grammar does not implement parsing sexp diary entries.

Would you consider adding support for that (or at least accepting PR with this addition)?

Build fails with no such file or directory, uv_cwd

running npm -g install [email protected]:milisims/tree-sitter-org.git
I get

npm ERR! code 7
npm ERR! path /usr/local/lib/node_modules/tree-sitter-org
npm ERR! command failed
npm ERR! command sh -c prebuild-install || node-gyp rebuild
npm ERR! sh: line 1: prebuild-install: command not found
npm ERR! shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
npm ERR! gyp info it worked if it ends with ok
npm ERR! gyp info using [email protected]
npm ERR! gyp info using [email protected] | linux | x64
npm ERR! gyp ERR! UNCAUGHT EXCEPTION 
npm ERR! gyp ERR! stack Error: ENOENT: no such file or directory, uv_cwd
npm ERR! gyp ERR! stack     at process.wrappedCwd [as cwd] (node:internal/bootstrap/switches/does_own_process_state:126:28)
npm ERR! gyp ERR! stack     at setopts (/usr/lib64/node_modules/npm17/node_modules/glob/common.js:89:21)
npm ERR! gyp ERR! stack     at new Glob (/usr/lib64/node_modules/npm17/node_modules/glob/glob.js:132:3)
npm ERR! gyp ERR! stack     at Function.glob.hasMagic (/usr/lib64/node_modules/npm17/node_modules/glob/glob.js:98:11)
npm ERR! gyp ERR! stack     at rimraf (/usr/lib64/node_modules/npm17/node_modules/rimraf/rimraf.js:106:36)
npm ERR! gyp ERR! stack     at clean (/usr/lib64/node_modules/npm17/node_modules/node-gyp/lib/clean.js:11:3)
npm ERR! gyp ERR! stack     at Object.self.commands.<computed> [as clean] (/usr/lib64/node_modules/npm17/node_modules/node-gyp/lib/node-gyp.js:41:37)
npm ERR! gyp ERR! stack     at run (/usr/lib64/node_modules/npm17/node_modules/node-gyp/bin/node-gyp.js:80:30)
npm ERR! gyp ERR! stack     at processTicksAndRejections (node:internal/process/task_queues:78:11)
npm ERR! gyp ERR! System Linux 5.19.2-1-default
npm ERR! gyp ERR! command "/usr/bin/node17" "/usr/lib64/node_modules/npm17/node_modules/node-gyp/bin/node-gyp.js" "rebuild"
npm ERR! node:internal/bootstrap/switches/does_own_process_state:126
npm ERR!     cachedCwd = rawMethods.cwd();
npm ERR!                            ^
npm ERR! 
npm ERR! Error: ENOENT: no such file or directory, uv_cwd
npm ERR!     at process.wrappedCwd [as cwd] (node:internal/bootstrap/switches/does_own_process_state:126:28)
npm ERR!     at errorMessage (/usr/lib64/node_modules/npm17/node_modules/node-gyp/bin/node-gyp.js:127:28)
npm ERR!     at issueMessage (/usr/lib64/node_modules/npm17/node_modules/node-gyp/bin/node-gyp.js:133:3)
npm ERR!     at process.<anonymous> (/usr/lib64/node_modules/npm17/node_modules/node-gyp/bin/node-gyp.js:117:3)
npm ERR!     at process.emit (node:events:527:28)
npm ERR!     at process._fatalException (node:internal/process/execution:167:25) {
npm ERR!   errno: -2,
npm ERR!   code: 'ENOENT',
npm ERR!   syscall: 'uv_cwd'
npm ERR! }
npm ERR! 
npm ERR! Node.js v17.7.1

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2022-09-13T15_59_20_845Z-debug-0.log

Plan not parsed when line starts with tab

Original issue reported here: nvim-orgmode/orgmode#362.

Org content:

* TODO Test
	 DEADLINE: <2022-09-10 Sat 15:21>

Expected:

(document [0, 0] - [2, 0]
  subsection: (section [0, 0] - [2, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 11]
        (expr [0, 2] - [0, 6])
        (expr [0, 7] - [0, 11])))
    plan: (plan [1, 2] - [2, 0]
      (entry [1, 2] - [1, 34]
        name: (entry_name [1, 2] - [1, 10])
        timestamp: (timestamp [1, 12] - [1, 34]
          date: (date [1, 13] - [1, 23])
          day: (day [1, 24] - [1, 27])
          time: (time [1, 28] - [1, 33]))))))

Got:

(document [0, 0] - [2, 0]
  subsection: (section [0, 0] - [2, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 11]
        (expr [0, 2] - [0, 6])
        (expr [0, 7] - [0, 11])))
    body: (body [1, 0] - [2, 0]
      (paragraph [1, 0] - [2, 0]
        (expr [1, 0] - [1, 1])
        (expr [1, 2] - [1, 11])
        (expr [1, 12] - [1, 23])
        (expr [1, 24] - [1, 27])
        (expr [1, 28] - [1, 34])))))

Fixed width areas elements

I have noticed that the grammar implemented here does not support "Fixed width areas" elements which basically are paragraphs that start with the colon character followed by space.

These elements are usually used to denote code evaluation output. For instance:

#+begin_src python
1 + 1
#+end_src

#+results: 
: 2

Is this a deliberate choice? It would be nice to have such support.

Are subscript captures correct?

Never messed with the actual parsers myself so not entirely sure where to start with debugging this, however is seems that subscript items are matched inside of markup blocks, which I'm not entirely sure is correct. For example:

~TEST ITEM~
~TEST_ITEM~

results in the second item being parsed as a subscript. From what I've read, superscripts would be ^{...} and subscripts would be _{...}, right? The above ends up looking like the following (highlighting and concealing via Neovim, but test is as noted above):

Looks like just plain ^ and _ are supported as well, but they both require a space after them to be counted as a super/subscript, so the answer to this seems to just be "yes".

I'll probably just have to change the highlighting on my end so things look the same, although the capture groups are actually different. Will leave this open for a bit and close it otherwise, as I believe they are indeed correct.

Empty check boxes are treated as 3 separate expression unlike all other checkboxes

Hi 👋🏿 ,

I've been refactoring my plugin https://github.com/akinsho/org-bullets.nvim which is similar to the emacs bullets plugin. I recently switched to identifying nodes to conceal using treesitter (your parser) and that's been working really well except for this issue I just hit.

I've noticed that using [X] will return one expr node which I can #match/#eq against whereas an empty check box i.e. [ ] return 3 expression nodes for each character so matching fails.

I'm guessing this is a bug or workaround of some sort but would be great to have the whole thing returned as a node so I can conceal it.

Checklist for 1.0.0

With the changes in the current markup-redo branch, I'd like to pause before integration (primarily understanding orgmode.nvim uses the main branch) to have some discussion about the work here, and compile a 1.0 checklist.

Changes to markup

To start, thoughts on changes in the markup-redo branch. The first big one is that all markup parsing is completely removed, and in place of processing text as hidden nodes, whitespace delimited words are now parsed as (expr) nodes. An expression here is parsed first for looking for ascii symbols as hidden nodes, then letters as "str", then numbers as "num", and finally any remaining symbols as "sym".

The reasoning for this set up is first that anything parsed in the parser is unchangeable, so markup was difficult to customize, and if a user wanted to add another link style (such as markdown-style links), they were unable to do so. Now, they just need to write queries and no modification for the parser is required.

For example, a (*b1 cdef*) is parsed to (expr) (expr) (expr) and when looking at hidden nodes, (expr "str") (expr "(" "*" "str" "num") (expr "str" "*" ")"). Which we can query for in a variety of ways. Unfortunately, a major caveat for this is the fact that tree-sitter queries capture exactly one node (see tree-sitter/tree-sitter#1508), so capturing alone (without writing a directive or callback of some sort) will not be able to highlight markdown as is. Additionally, we need to account for nested expressions.

The other major caveat is that anonymous nodes are unable to be anchored (tree-sitter/tree-sitter#1461), so we can't make the queries as precise as I would like right now, but single node queries can just use the #match? predicate so I'm not worried about that.

The positive side of this is that because the markup characters will be hidden nodes, they are easily queried for and whatever algorithm is applied will generally only need to look at very few nodes. Any language should be able to do that very quickly. Queries to find possible pairs are in the branch in the markup.scm file, and I'm working the kinks out of a lua implementation of this pseudocode:

def markup(nodes):
    # assumes sorted list of nodes
    # all nodes should be from a single (paragraph), (itemtext), or (item)

    seeking = []
    markup = []
    for node in nodes:
        if node.type == start:
            seeking.append((node.type, node))
        elif node.type in seeking:
            ix = seeking.index(node.type)
            stnode = seeking[ix][1]
            if validate(node, stnode):  # check the pre/post markup or whatever else, can be modified as desired
                # stop seeking everything after the node we just matched: *a /b*    Like this '/', when we just matched the * after the b
                seeking = seeking[:ix-1]
                # If we complete a verbatim-style markup, we need to purge everything interior of it
                markup.append({'type': node.type, 'range': (stnode.start, node.end)})
    return markup

The details of this algorithm will change depending on exactly which node is captured (The symbol vs (expr)) and whether or not we're using match predicate in advance. My example does not do that, but after writing my lua example I'm thinking it makes sense to do so.

Note how easily constructed the markup queries are via some generative code for customization.

Lastly, queries/algorithms are needed for latex fragments, subscript and superscript, and bracketed expressions for the *scripts.

Regex patterns

I still need to work through the list of changes I made, but one change I've made/am still making to the markup branch is I've tried replacing specific patterns as often as possible with some variant of (expr). I think using queries to determine if a propery plan name is used, for example, is much cleaner than throwing a parser error. And this way it's easier to change languages. I think there are limits though, and it might be nice to simply support different languages in the parser directly (for example, END, PROPERTIES, etc.). That could be read from a file, or simply left for others to re-compile to their own language.

Additionally, queries are highlighting, so allowing more things to be specified as queries nicely can be pretty cool. For example, I might only ever use TITLE and FILETAGS directives, so I could consider highlighting those nicely and any other patterns as an error.

A good example here is timestamp contents. Right now, I've hard coded possible regex expressions. Should that just be queriable? That could be nice if people want different formats. On the other hand, having fields and nodes directly in the tree is really nice, and I think in timestamps there are few enough formats that we can just support all of them in the parser.

Fields and nodes/aliases

When writing the parser I didn't add a lot of fields because it was constantly in so much flux. But they're useful even if they link directly to a named node with the same name, because access via a name can be a lot more convenient than named nodes, even if there is a small number of nodes.

I've added a large number of (name) nodes, and a few others, and many fields. I'll try to compile a list later today. When thinking about where I've added nodes, one of the things I was thinking about was incremental selection via nodes, I just want it to make sense. For example, tag -> taglist -> item -> headline -> section.

Versioning

These changes were a lot more than I'd like to do in a single commit or merge in the future, but since I was writing queries as I went, I kept finding small changes that make using the parser a lot more straightforward. I don't work on coding projects that are public really, so I don't think much about this, so this is kind of a "Yeah, why didn't you do that sooner?" section.

So AFTER 1.0.0, I'll be using consistent conventional commits in the future and an auto-updating semantic versioning system based on that. With major.minor.patch versioning: fix changes increment patch, feat changes increment minor, and any breaking change (appending a ! to the scope) will increment major versioning. I just want to make sure anything using the parser has a tag that they can link to so changes can be made to main without breaking dependent projects.

Specific questions

Are there any missing/hidden aliases or fields that would be useful?
Are the names of aliases and fields sensible?
Should newlines be part of (body)? (Newlines before body in (section) and in (document))
So, should an empty section have a body, basically, or should a body exist only if there is an
element?
If we have text that is a paragraph followed by a footnote after a new line, is that parsed in
emacs as a footnote reference or a definition?
Should (_element) be available to listitems and in drawer contents, or should those just be
expressions? I'm under the impression that whether or not elements are in a list and drawer
are customizable options for orgmode, and I prefer to keep the parser simple as possible. So it
makes sense to me to inject an org parser in the itemtext if a user wants it.
Should (taglist) be a named node?

1.0.0 checklist

I want to expand upon this list (some items should be multiple), but really quickly:

Revisit all tests - A lot are out of date
~~[ ] Write tests for queries~~
Fix some regexes to better query for: :block: is the whole name, the goal would be ':', (name), ':'. (easier highlighting ':'s)
~~[ ] Cleanup table precedences. Yikes. (I don't care anymore)~~
~~[ ] Add semantic versioning git hook script~~
Revisit the readme build instructions
Revisit the npm dependencies (could really use some help here, no idea what I'm doing there)
Add plan entries
Add newlines to (contents) and (body)
Fix failing headline test

Thoughts? Anything else I'm missing?

Support for markup: `bold`, `/italic/`, `_underlined_`, `=verbatim=`, `~code~` and `+strike-through+`

Is it a non-goal of the project to support markup elements or would you be open to contributions implementing it?

i.e. *bold*, /italic/, _underlined_, =verbatim=, ~code~ and +strike-through+

Technically in org-mode the characters for each type of markup can be configured through the org-emphasis-alist variable [1], but it doesn't seem very common to radically alter it [2]. From a cursory glance, people seem to be mostly altering it to change colors.

Supporting markup would be pretty useful for syntax highlighting.

I'm curious what your take on it is, before I start that effort (or maybe you've already have)?

LaTeX support

Hi,

I got an issue about LaTeX support, and wanted to check with you what is actually supported? I don't have any LaTeX experience so I can hardly figure it out. From what I tested, only something like this works:

\begin{align}
2x - 5y &= 8 \\
3x + 9y &= -12
\end{align}

This is an example from here https://orgmode.org/worg/dev/org-syntax.html#LaTeX_Environments, but without the * in the align, since that causes error in parsing.
Other examples seems to parse like a regular paragraph.

Fix issue with latest tree-sitter

Current Neovim nightly uses latest version of tree-sitter. Entering * causes freezing of Neovim, which I reported here nvim-treesitter/nvim-treesitter#3258. Creator of the PR that introduced using HEAD version of tree-sitter says that issue is most likely in scanner.cc file here, since tree-sitter introduced some changes.

Comment that mentions it: nvim-treesitter/nvim-treesitter#3258 (comment)
Tree-sitter change: tree-sitter/tree-sitter#1783

expose bullet of list item

Currently, the bullet of a list item is not exposed.

I want to handle adding an item to a numbered list. It would be much easier to do if I could get all the numbers from TS, but listitem is only the text part after the number.

Activity/Discussion on HackerNews posting

Hey @milisims ! I've been using your parser via @kristijanhusak's nvim-orgmode/orgmode and I really appreciate all the work you've done with this.

I posted a link to the project on HackerNews and it looks like there has been some interesting discussion and side-project/interests worth checking out.

Wanted to give you a poke somewhere to make sure you see the positive reception 👍

Publish to crates.io

Can you publish this to https://crates.io/ ?

Add tree-sitter-org to nvim-treesitter's list of supported languages?

It's not yet mentioned here:
https://github.com/nvim-treesitter/nvim-treesitter#supported-languages

Do you think it's ready to add it?

I couldn't test it yet, it's not so easy to install while still having a big .vimrc not compatible with neovim :/

Markup implementation

I got an issue reported on my repo that markup is not working in the headlines, which works fine in Emacs orgmode. That's currently an issue because Vim's syntax doesn't allow overlapping of two hl groups (Example below):

I tried adding a markup to the headline item here, but it caused some different issues which I'm not sure how to address. My current workaround is just to use Vim's syntax for markup, and let headline level highlights be handled through treesitter highlights, which makes it work ok:

Only downside is that it requires enabling Vim's syntax highlighting (additional_vim_regex_highlighting). What are your thoughts on this? How hard would it be to add it to headline item?

Paragraph starting with LateX inline math causes parsing to fail

Paragraph starting with inline math causes error in parsing.

org file content:

\(1 + 1\) foobar

Result:

(document [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 3])
  body: (body [0, 3] - [1, 0]
    (list [0, 3] - [1, 0]
      (listitem [0, 4] - [1, 0]
        bullet: (bullet [0, 4] - [0, 5])
        contents: (paragraph [0, 6] - [1, 0]
          (expr [0, 6] - [0, 9])
          (expr [0, 10] - [0, 16]))))))
test.org	0 ms	(ERROR [0, 0] - [0, 3])

Original issue: nvim-orgmode/orgmode#427

Not compiling via Neovim

I'm new to using org mode in Neovim (and Neovim itself...) so I could be doing something wrong but I'm getting the following error when trying to install the org treesitter parser:

Neovim version: 0.7.2

I simplified my init.vim down to just the Plug install command for treesitter:

call plug#begin('~/.config/nvim/plugged')
Plug 'nvim-treesitter/nvim-treesitter', { 'do': ':TSUpdate' }
call plug#end()

Command in Neovim: :TSInstall org

This project is awesome, thanks for working on it!

"Escaped" text fails to parse and reports an error

Hey,

I just got an issue report on orgmode.nvim repo that found a problem with parsing things that are "escaped". For example, with this content:

* TODO Test
  This \[is error]

I get this:

(document [0, 0] - [2, 0]
  (ERROR [0, 0] - [2, 0]
    (headline [0, 0] - [0, 11]
      (stars [0, 0] - [0, 1])
      (item [0, 2] - [0, 11]))))
test.org	0 ms	(ERROR [0, 0] - [2, 0])

Backslash before the [ causes the issue.

Similar thing happens with dot for example:

* TODO Test
  This \. Test

Output:

(document [0, 0] - [2, 0]
  (section [0, 0] - [2, 0]
    (headline [0, 0] - [0, 11]
      (stars [0, 0] - [0, 1])
      (item [0, 2] - [0, 11]))
    (body [1, 2] - [2, 0]
      (paragraph [1, 2] - [2, 0]
        (ERROR [1, 7] - [1, 9]
          (ERROR [1, 8] - [1, 9]))))))
test.org	0 ms	(ERROR [1, 7] - [1, 9])

Latter seems more specific than former. Is this because it looks like a regex?

Unexpected switch to drawer content

Hey,

I've got an issue where user reported freezing. It is caused by my manual markup highlighter, but the problem is also a big file.

I tried to narrow down what's the issue, and I noticed one thing.

Having this file:

* Headline 1
  Text 1
:LOGBOOK:
:LAST_REPEAT: [2023-10-11 Wed 13:46]
:END:
* Headline 2
  Text 2
:LOGBOOK:
:LAST_REPEAT: [2023-10-09 Mon 16:36]
:END:

Parses it correctly:

(document [0, 0] - [10, 0]
  subsection: (section [0, 0] - [5, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 12]
        (expr [0, 2] - [0, 10])
        (expr [0, 11] - [0, 12])))
    body: (body [1, 2] - [5, 0]
      (paragraph [1, 2] - [2, 0]
        (expr [1, 2] - [1, 6])
        (expr [1, 7] - [1, 8]))
      (drawer [2, 0] - [5, 0]
        name: (expr [2, 1] - [2, 8])
        contents: (contents [3, 0] - [4, 0]
          (expr [3, 0] - [3, 13])
          (expr [3, 14] - [3, 25])
          (expr [3, 26] - [3, 29])
          (expr [3, 30] - [3, 36])))))
  subsection: (section [5, 0] - [10, 0]
    headline: (headline [5, 0] - [6, 0]
      stars: (stars [5, 0] - [5, 1])
      item: (item [5, 2] - [5, 12]
        (expr [5, 2] - [5, 10])
        (expr [5, 11] - [5, 12])))
    body: (body [6, 2] - [10, 0]
      (paragraph [6, 2] - [7, 0]
        (expr [6, 2] - [6, 6])
        (expr [6, 7] - [6, 8]))
      (drawer [7, 0] - [10, 0]
        name: (expr [7, 1] - [7, 8])
        contents: (contents [8, 0] - [9, 0]
          (expr [8, 0] - [8, 13])
          (expr [8, 14] - [8, 25])
          (expr [8, 26] - [8, 29])
          (expr [8, 30] - [8, 36]))))))

But when I start adding a headline in between, it converts first :END: and last :END into a drawer, and treats everything in between as a drawer content:

Updated file:

* Headline 1
  Text 1
:LOGBOOK:
:LAST_REPEAT: [2023-10-11 Wed 13:46]
:END:
*
* Headline 2
  Text 2
:LOGBOOK:
:LAST_REPEAT: [2023-10-09 Mon 16:36]
:END:

Parsed as:

(document [0, 0] - [11, 0]
  subsection: (section [0, 0] - [11, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 12]
        (expr [0, 2] - [0, 10])
        (expr [0, 11] - [0, 12])))
    body: (body [1, 2] - [11, 0]
      (paragraph [1, 2] - [4, 0]
        (expr [1, 2] - [1, 6])
        (expr [1, 7] - [1, 8])
        (expr [2, 0] - [2, 9])
        (expr [3, 0] - [3, 13])
        (expr [3, 14] - [3, 25])
        (expr [3, 26] - [3, 29])
        (expr [3, 30] - [3, 36]))
      (drawer [4, 0] - [11, 0]
        name: (expr [4, 1] - [4, 4])
        contents: (contents [5, 0] - [10, 0]
          (expr [5, 0] - [5, 1])
          (expr [6, 0] - [6, 1])
          (expr [6, 2] - [6, 10])
          (expr [6, 11] - [6, 12])
          (expr [7, 2] - [7, 6])
          (expr [7, 7] - [7, 8])
          (expr [8, 0] - [8, 9])
          (expr [9, 0] - [9, 13])
          (expr [9, 14] - [9, 25])
          (expr [9, 26] - [9, 29])
          (expr [9, 30] - [9, 36]))))))

Once I add the space after the * it parses it correctly.
This is more-less expected, but for that 1 change a lot of nodes are updated. In the reported issue this happens between two headlines where it ends up putting ~500 lines into the drawer content for a split second and freezes the editor.

I can think of a few solutions here, but I was able to test only one myself:

Never allow :END: to be a start of a drawer
Consider asterisk(s) at the start of a line a valid node (expr) even if it does not have a space after it

For the 2. point, what I mean is this:

Content:

* TODO Test
  Content
*
* TODO Test
  Content

Parsed:

(document [0, 0] - [5, 0]
  (ERROR [0, 0] - [2, 1]
    subsection: (section [0, 0] - [2, 0]
      headline: (headline [0, 0] - [1, 0]
        stars: (stars [0, 0] - [0, 1])
        item: (item [0, 2] - [0, 11]
          (expr [0, 2] - [0, 6])
          (expr [0, 7] - [0, 11])))
      body: (body [1, 2] - [2, 0]
        (paragraph [1, 2] - [2, 0]
          (expr [1, 2] - [1, 9]))))
    (stars [2, 0] - [2, 1]))
  body: (body [2, 1] - [3, 0])
  subsection: (section [3, 0] - [5, 0]
    headline: (headline [3, 0] - [4, 0]
      stars: (stars [3, 0] - [3, 1])
      item: (item [3, 2] - [3, 11]
        (expr [3, 2] - [3, 6])
        (expr [3, 7] - [3, 11])))
    body: (body [4, 2] - [5, 0]
      (paragraph [4, 2] - [5, 0]
        (expr [4, 2] - [4, 9])))))

When something is added to that line that's not a space:

* TODO Test
  Content
*t
* TODO Test
  Content

(document [0, 0] - [5, 0]
  subsection: (section [0, 0] - [3, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 11]
        (expr [0, 2] - [0, 6])
        (expr [0, 7] - [0, 11])))
    body: (body [1, 2] - [3, 0]
      (paragraph [1, 2] - [3, 0]
        (expr [1, 2] - [1, 9])
        (expr [2, 0] - [2, 2]))))
  subsection: (section [3, 0] - [5, 0]
    headline: (headline [3, 0] - [4, 0]
      stars: (stars [3, 0] - [3, 1])
      item: (item [3, 2] - [3, 11]
        (expr [3, 2] - [3, 6])
        (expr [3, 7] - [3, 11])))
    body: (body [4, 2] - [5, 0]
      (paragraph [4, 2] - [5, 0]
        (expr [4, 2] - [4, 9])))))

This change in the parser does the trick for me:

diff --git a/src/scanner.c b/src/scanner.c
index f305612..37be99e 100644
--- a/src/scanner.c
+++ b/src/scanner.c
@@ -267,6 +267,10 @@ bool scan(Scanner *scanner, TSLexer *lexer, const bool *valid_symbols) {
             skip(lexer);
         }
 
+        if (lexer->lookahead == '\n') {
+          return false;
+        }
+
         if (valid_symbols[SECTIONEND] && iswspace(lexer->lookahead) &&
             stars > 0 && stars <= VEC_BACK(scanner->section_stack)) {
             VEC_POP(scanner->section_stack);

I think the 2nd option is simpler and probably more correct. Unless there is a space after *, there's no need to treat it as headline.

Let me know what you think.

Links not recognized

Hello,
I'm using neovim with the orgmode plugin and treesitter. When opening the treesitter playground, it looks like the links are not recognized, they just appear as "expr" inside a paragraph. Is it a problem on my machine or is it not implemented?
If I wrote on the wrong place, then I'm sorry, but it's kind of hard to understand which packages are used...

Latex errors out

Hey! So I've had a few problems with using \[ \] and $. WHen I do this:

\[
T(n) = T(n/2) + 2T(n/4) + n,
T(1) = 0
\]

it errors out as well as well as:

$$
# stuff
$$

Is there something wrong with my syntax?

milisims / tree-sitter-org Goto Github PK

tree-sitter-org's Introduction

tree-sitter-org

Overview

Example

Install

tree-sitter-org's People

Contributors

Stargazers

Watchers

Forkers

tree-sitter-org's Issues

Changes to markup

Regex patterns

Fields and nodes/aliases

Versioning

Specific questions

1.0.0 checklist

Recommend Projects

Recommend Topics

Recommend Org