Giter VIP home page Giter VIP logo

unist's Introduction

unist

Universal Syntax Tree.


unist is a specification for syntax trees. It has a big ecosystem of utilities in JavaScript for working with these trees. It’s implemented by several other specifications.

This document may not be released. See releases for released documents. The latest released version is 3.0.0.

Contents

Intro

This document defines a general-purpose format for syntax trees. Development of unist started in July 2015. This specification is written in a Web IDL-like grammar.

Syntax tree

Syntax trees are representations of source code or even natural language. These trees are abstractions that make it possible to analyze, transform, and generate code.

Syntax trees come in two flavors:

  • concrete syntax trees: structures that represent every detail (such as white-space in white-space insensitive languages)
  • abstract syntax trees: structures that only represent details relating to the syntactic structure of code (such as ignoring whether a double or single quote was used in languages that support both, such as JavaScript).

This specification can express both abstract and concrete syntax trees.

Where this specification fits

unist is not intended to be self-sufficient. Instead, it is expected that other specifications implement unist and extend it to express language specific nodes. For example, see projects such as hast (for HTML), nlcst (for natural language), mdast (for Markdown), and xast (for XML).

unist relates to JSON in that compliant syntax trees can be expressed completely in JSON. However, unist is not limited to JSON and can be expressed in other data formats, such as XML.

unist relates to JavaScript in that it has a rich ecosystem of utilities for working with compliant syntax trees in JavaScript. The five most used utilities combined are downloaded thirty million times each month. However, unist is not limited to JavaScript and can be used in other programming languages.

unist relates to the unified, remark, rehype, and retext projects in that unist syntax trees are used throughout their ecosystems.

unist relates to the vfile project in that it accepts unist nodes for its message store, and that vfile can be a source file of a syntax tree.

Types

If you are using TypeScript, you can use the unist types by installing them with npm:

npm install @types/unist

Nodes

Syntactic units in unist syntax trees are called nodes, and implement the Node interface.

Node

interface Node {
  type: string
  data: Data?
  position: Position?
}

The type field is a non-empty string representing the variant of a node. This field can be used to determine the type a node implements.

The data field represents information from the ecosystem. The value of the data field implements the Data interface.

The position field represents the location of a node in a source document. The value of the position field implements the Position interface. The position field must not be present if a node is generated.

Specifications implementing unist are encouraged to define more fields. Ecosystems can define fields on Data.

Any value in unist must be expressible in JSON values: string, number, object, array, true, false, or null. This means that the syntax tree should be able to be converted to and from JSON and produce the same tree. For example, in JavaScript, a tree can be passed through JSON.parse(JSON.stringify(tree)) and result in the same tree.

Position

interface Position {
  start: Point
  end: Point
}

Position represents the location of a node in a source file.

The start field of Position represents the place of the first character of the parsed source region. The end field of Position represents the place of the first character after the parsed source region, whether it exists or not. The value of the start and end fields implement the Point interface.

If the syntactic unit represented by a node is not present in the source file at the time of parsing, the node is said to be generated and it must not have positional information.

For example, if the following value was represented as unist:

alpha
bravo

…the first word (alpha) would start at line 1, column 1, offset 0, and end at line 1, column 6, offset 5. The line feed would start at line 1, column 6, offset 5, and end at line 2, column 1, offset 6. The last word (bravo) would start at line 2, column 1, offset 6, and end at line 2, column 6, offset 11.

Point

interface Point {
  line: number >= 1
  column: number >= 1
  offset: number >= 0?
}

Point represents one place in a source file.

The line field (1-indexed integer) represents a line in a source file. The column field (1-indexed integer) represents a column in a source file. The offset field (0-indexed integer) represents a character in a source file.

The term character means a (UTF-16) code unit which is defined in the Web IDL specification.

Data

interface Data { }

Data represents information associated by the ecosystem with the node.

This space is guaranteed to never be specified by unist or specifications implementing unist.

Parent

interface Parent <: Node {
  children: [Node]
}

Nodes containing other nodes (said to be children) extend the abstract interface Parent (Node).

The children field is a list representing the children of a node.

Literal

interface Literal <: Node {
  value: any
}

Nodes containing a value extend the abstract interface Literal (Node).

The value field can contain any value.

Glossary

Tree

A tree is a node and all of its descendants (if any).

Child

Node X is child of node Y, if Y’s children include X.

Parent

Node X is parent of node Y, if Y is a child of X.

Index

The index of a child is its number of preceding siblings, or 0 if it has none.

Sibling

Node X is a sibling of node Y, if X and Y have the same parent (if any).

The previous sibling of a child is its sibling at its index minus 1.

The next sibling of a child is its sibling at its index plus 1.

Root

The root of a node is itself, if without parent, or the root of its parent.

The root of a tree is any node in that tree without parent.

Descendant

Node X is descendant of node Y, if X is a child of Y, or if X is a child of node Z that is a descendant of Y.

An inclusive descendant is a node or one of its descendants.

Ancestor

Node X is an ancestor of node Y, if Y is a descendant of X.

An inclusive ancestor is a node or one of its ancestors.

Head

The head of a node is its first child (if any).

Tail

The tail of a node is its last child (if any).

Leaf

A leaf is a node with no children.

Branch

A branch is a node with one or more children.

Generated

A node is generated if it does not have positional information.

Type

The type of a node is the value of its type field.

Positional information

The positional information of a node is the value of its position field.

File

A file is a source document that represents the original file that was parsed to produce the syntax tree. Positional information represents the place of a node in this file. Files are provided by the host environment and not defined by unist.

For example, see projects such as vfile.

Preorder

In preorder (NLR) is depth-first tree traversal that performs the following steps for each node N:

  1. N: visit N itself
  2. L: traverse head (then its next sibling, recursively moving forward until reaching tail)
  3. R: traverse tail
Postorder

In postorder (LRN) is depth-first tree traversal that performs the following steps for each node N:

  1. L: traverse head (then its next sibling, recursively moving forward until reaching tail)
  2. R: traverse tail
  3. N: visit N itself
Enter

Enter is a step right before other steps performed on a given node N when traversing a tree.

For example, when performing preorder traversal, enter is the first step taken, right before visiting N itself.

Exit

Exit is a step right after other steps performed on a given node N when traversing a tree.

For example, when performing preorder traversal, exit is the last step taken, right after traversing the tail of N.

Tree traversal

Tree traversal is a common task when working with a tree to search it. Tree traversal is typically either breadth-first or depth-first.

In the following examples, we’ll work with this tree:

graph TD
    A-->B-->C
        B-->D
        B-->E
    A-->F-->G
Breadth-first traversal

Breadth-first traversal is visiting a node and all its siblings to broaden the search at that level, before traversing children.

For the syntax tree defined in the diagram, a breadth-first traversal first searches the root of the tree (A), then its children (B and F), then their children (C, D, E, and G).

Depth-first traversal

Alternatively, and more commonly, depth-first traversal is used. The search is first deepened, by traversing children, before traversing siblings.

For the syntax tree defined in the diagram, a depth-first traversal first searches the root of the tree (A), then one of its children (B or F), then their children (C, D, and E, or G).

For a given node N with children, a depth-first traversal performs three steps, simplified to only binary trees (every node has head and tail, but no other children):

  • N: visit N itself
  • L: traverse head
  • R: traverse tail

These steps can be done in any order, but for non-binary trees, L and R occur together. If L is done before R, the traversal is called left-to-right traversal, otherwise it is called right-to-left traversal. In the case of non-binary trees, the other children between head and tail are processed in that order as well, so for left-to-right traversal, first head is traversed (L), then its next sibling is traversed, etcetera, until finally tail (R) is traversed.

Because L and R occur together for non-binary trees, we can produce four types of orders: NLR, NRL, LRN, RLN.

NLR and LRN (the two left-to-right traversal options) are most commonly used and respectively named preorder and postorder.

For the syntax tree defined in the diagram, preorder and postorder traversal thus first search the root of the tree (A), then its head (B), then its children from left-to-right (C, D, and then E). After all descendants of B are traversed, its next sibling (F) is traversed and then finally its only child (G).

Utilities

Utilities are functions that work with nodes.

There are several projects that deal with nodes from specifications implementing unist:

List of utilities

References

Contribute

See contributing.md in syntax-tree/.github for ways to get started. See support.md for ways to get help.

A curated list of awesome syntax-tree, unist, hast, xast, mdast, and nlcst resources can be found in awesome syntax-tree.

This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.

Acknowledgments

The initial release of this project was authored by @wooorm.

Special thanks to @eush77 for their work, ideas, and incredibly valuable feedback! Thanks to @anandthakker, @anko, @arobase-che, @azu, @BarryThePenguin, @ben-eb, @blahah, @blakeembrey, @brainkim, @ChristianMurphy, @davidtheclark, @denysdovhan, @derhuerst, @dozoisch, @fazouane-marouane, @gibson042, @hrajchert, @ikatyang, @inklesspen, @izumin5210, @jasonLaster, @JDvorak, @jlevy, @justjake, @kmck, @kt3k, @KyleAMathews, @luca3m, @mattdesl, @muraken720, @mrzmmr, @nwtn, @rhysd, @Rokt33r, @Sarah-Seo, @sethvincent, @shawnbot, @simov, @staltz, @TitanSnow, @tmcw, and @vhf, for contributing to unist and related projects!

License

CC-BY-4.0 © Titus Wormer

unist's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unist's Issues

adast - asciidoc syntax tree

Came across unist & remark via mdast, and I was wondering what would be involved in creating an entirely new flavour of syntax tree within this family?

More concretely, I'm interested in creating adast, an asciidoc syntax tree, and while the input format is rather different from markdown, there's potentially a huge chunk of the implementation that could potentially be same or similar to that of mdast, as the output is going to be quite similar to it.

Thoughts?

Unist vs Loyc trees

How would you say the purpose of Unist is different from Loyc trees?

The primary use of Loyc trees is to represent the core elements of source code - identifiers, literals, and "calls" - and from looking at the readme, it doesn't look like Unist has the same goals, but I'm not quite sure what its goals are.

Add `enter` and `exit` terms to glossary

Subject of the feature

Other that the different ways of traversing a tree, as raised in GH-22 and resolved in GH-23, the terms enter and exit are also often used when discussing tree traversal. These should be added to the glossary as well.

Problem

Some form of state is often mutated when entering or exiting a node by unist utilities. Describing these terms here means they can be linked to from other docs to clarify them in a single place.

Expected behaviour

Both should be added to the glossary.

Alternatives

The alternatives (status quo) are shown in problem above.

What's the unit of character in Point

In Point section, it's mentions:

The line field (1-indexed integer) represents a line in a source file. The column field (1-indexed integer) represents a column in a source file. The offset field (0-indexed integer) represents a character in a source file.

What's the unit of 'character' and 'column'? Is it UTF-16 code unit (used in JavaScript) or Unicode code point? See Wikipedia:

[UTF-16] encoding is variable-length, as code points are encoded with one or two 16-bit code units

I tried using remark to parse this markdown piece:

a𠮷b

Here, 𠮷 is one Unicode code point that can not be encoded into one UTF-16 code unit. In JavaScript, because String uses UTF-16, so:

'a𠮷b'.length
//=> 4

But in other languages like Python:

len('a𠮷b')
#=> 3

As for remark, the above markdown piece is parsed into:

{
  "type": "text",
  "value": "a𠮷b",
  "position": {
    "start": {
      "line": 1,
      "column": 1,
      "offset": 0
    },
    "end": {
      "line": 1,
      "column": 5,
      "offset": 4
    },
    "indent": []
  }
}

The column of end is 5, while the offset of end is 4, that means remark treat this text four 'chars' long, measured in UTF16 code units.

So what's the unit of character? It's so confused.

Error at C:/web/node_modules/@types/unist/index.d.ts:92:58: ';' expected.

Initial checklist

Affected packages and versions

@types/unist & 2.0.6

Link to runnable example

No response

Steps to reproduce

angular - 4.3.6
node- 10.15.0
react-markdown- 4.0.3
with above version if we try to build angular app, we are getting below error.
MicrosoftTeams-image (7)

Expected behavior

it should build the app.

Actual behavior

due to @types/unist package issue we wont be able to build angular app.

Runtime

Other (please specify in steps to reproduce)

Package manager

yarn v1

OS

Windows, Linux

Build and bundle tools

Webpack

Specify `indent`

node.position.indent isn’t used a lot, but it could be specced better.

Currently, it’s a list of integers, if a node spans multiple lines, where each value refers to the column a line (node.position.start.line + index).
This only supports the start of a line (which is useful in markdown). But not the end of a line.
It’s also awkward to access, as there’s no explicit line access.

It could make sense for indent to be an Array.<{start: Point, end: Point}>.

Would unist make a good programming language AST format?

Hi again @wooorm (and other AST enthusiasts),

I develop an experimental programming language called eslisp, which is basically a JavaScript syntax optimised for code-modifying macros that let users add language features. It might be helpful to think of it as a programming language processing tool.

Eslisp's current AST representation contains exactly the same information as Unist, right down to location data, but currently organised differently. I was writing my own tools for reading and modifying it, then realised I'm basically duplicating Unist utilities.

I thought I'd open a dialogue before I start "hammering on screws" and making it a dependency. Have you considered programming languages as a Unist use-case? Is this a sane thing to be doing, long-term?

No utility for creating a selector from a root and a leaf node.

@wooorm is there a utility to build a selector if you pass it a root and a leaf node?

e.g. buildSelector(root, node); // returns 'html > body > div:nth-child(1) > h1'

I'm creating a tree component that displays the AST and when you click on a node, it selects the selected nodes of the AST. and I'm wondering how to create the selector needed for the select util.

Empty children arrays

This is a question and a suggestion regarding this part of Unist readme (emphasis mine):

Unist nodes:

  • may have either a value property set to a string or a children property set to an array of one or more Unist nodes;

I read it as “Unist nodes may have a children property, in which case it is guaranteed that its length is ≥1”. If this is correct, then it follows that both retext, mdast, and hast violate this specification by producing trees with empty children arrays:

> retext.parse('')
{ type: 'RootNode', children: [] }
> mdast.parse('')
{ type: 'root',
  children: [],
  position: { start: { line: 1, column: 1 }, end: { line: 1, column: 1 } } }
> mdast.parse('#')
{ type: 'root',
  children: [ { type: 'heading', depth: 1, children: [], position: [Object] } ],
  position: { start: { line: 1, column: 1 }, end: { line: 1, column: 2 } } }
> hast.parse('')
{ type: 'root', children: [] }

If I haven't missed anything then I guess this requirement should be relaxed to include empty children (or removed if it doesn't require anything) or, alternatively, retext, mdast, and hast should be fixed to never output nodes with empty children arrays. The latter seems more problematic (the obvious workaround is returning null on empty input but I feel that it's better for parsers to always output a valid syntax tree) so I opened the issue here.

Add tree traversal methods to glossary

Subject of the feature

There are several ways to traverse a tree, typically preorder, postorder, but also inorder and breadth-irst / level-order. See also WikiPedia.

Problem

Projects working with unist typically do this, but either a) don’t document how they do it, or b) document this themselves. Both lead to lacking, incorrect, or incomplete docs.

Expected behaviour

These (most common) types should be documented in the glossary. Maybe with a diagram.

Alternatives

The alternatives (status quo) are shown in problem above.

Rename abstract `text` interface to `literal`

The abstract text interface (nodes with a value) interferes with the type: "text" node provided by hast and mdast (and the type: "TextNode" provided by nlcst).

Another downside is that text implies (and specifies) string values on the value field.
Say unist was used for programming values, the value of value could be specified as number, for example.

I’m open to other names, but I’m searching for something close to “raw”, “leaf”, and whatnot.

/CC @ChristianMurphy What do you think?

Use proper naming for position/location

Currently, location and position is used interchangeably, while they differ.
This confusion derives from node.position, which holds a location, and a location has start and end set to a position.

I propose:

  • position for node.position (aka “location”, “positional info”)
  • point for node.position.start, node.position.end (as it refers to a point in a file)

More strict notion of JSON-stringifiability for all properties

Currently the only requirement for data property is stringifiability which is defined as follows:

Its only limitation being that each property should by stringifyable: not throw when passed to JSON.stringify().

I think this is too broad, in particular because JSON.stringify won't usually throw:

> JSON.stringify({ foo: function () { console.log('foo') }})
'{}'
> JSON.stringify({ foo: undefined })
'{}'

I was worried about this when hacking on unist-builder-blueprint: there is no reliable way to compile functions to source code (with closures and stuff like heap references), so a more strict guarantee like "data and JSON.parse(JSON.stringify(data)) should be equivalent and interchangeable" or even deepEqual(JSON.parse(JSON.stringify(data)), data) would be helpful.

Using hast instead of mdast to describe markdown documents

I like a lot the Unist ecosystem and its AST-oriented approach.

However, I do not understand the motivation behind creating two widely different ASTs for markdown and HTML documents. I would expect HAST to include MDAST (as shown below) and therefore an HTML parser to produce an AST that a markdown compiler would understand, without requiring a transformation step (e.g., with rehype-remark). Conversely, a markdown parser should be able to produce an AST that an HTML compiler would understand.

In other words, would it make sense to build a rehype-stringify-markdown and a rehype-parse-markdown and ignore the MDAST? Is there something that would prevent that?


For example, this MDAST node:

{
  type: "paragraph",
  children: [{
    type: "text",
    value: "Hello!"
  }]
}

…contains at most as much information as this HAST node:

{
  type: "element",
  tagName: "p",
  properties: {},
  children: [{
    type: "text",
    value: "Hello!"
  }]
}

`line`, `column`, and `offset` in `point` underspecified

According to docs:

The end field of Position represents the place of the first character after the parsed source region.

If the last parsed character is a newline, does end have a column of 0 and a line of current line + 1? If we are at the end of the source, does the end position represent an imaginary character after the end of the document?

What is a `column` or `offset`, exactly?

Initial checklist

Problem

…or are they supposed to be vague and should we reflect that?

For ASCII characters, such info is pretty clear.
But once you get to other unicode characters, taking emoji as a well-known
example, it gets complex.
And, unist is designed for other programming languages too, practically
now with Rust, we may need to choose how to represent this across languages.

There are two main ways that positional info like this is used:

  1. to access the source string: if there is something appearing from
    1000 to 1002, a user will want to do doc.slice(1000, 1002)
    to access that thing.
    In Rust, 1000 and 1002, in an example of &doc[1000..1002],
    will yield a different result
  2. to point a code editor to such a thing, for warnings and such, so that
    the squigly lines, or “jump to” are correct

To make the first easy and fast, it makes a lot of sense to use the positions
that are based on how the host language stores strings.
But that means markdown-rs and micromark will yield different results.
Quickly checking VS Code, injecting 👨‍👩‍👧‍👦 into the document, seems to increment
by 7, which equals [...'👨‍👩‍👧‍👦'].length.
So that’s different from what markdown-rs and micromark use.
We can also be vague about this here in the spec, and replace the section added
in 49032b9 to reflect that.
Or we could adhere to that?

Solution

  • a) Make vague
  • b) make everything consistent

Alternatives

?

is it possible to insert a node or remove a node in ast tree?

Subject of the feature

insert a node or remove a node in ast tree

Problem

could not find some method for Node to modify its text, its position.

Expected behaviour

add some api for Node

Alternatives

jquery can modify the dom ele positions and its attrs

A swiss army knife for Unist-util-*.

swiss army knife

Just like lodash, I need an object bound every unist-util-* methods.
It might be a bad practice for the production code. But, for testing code, I think it would be fine.

Also, I need others for hast-util-* and mdast-util-* too...

What do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.