tree-sitter / tree-sitter-graph Goto Github PK

Construct graphs from parsed source code

Home Page: https://docs.rs/tree-sitter-graph/*/tree_sitter_graph/

License: Apache License 2.0

Rust 100.00%

tree-sitter-graph's Introduction

tree-sitter

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited. Tree-sitter aims to be:

General enough to parse any programming language
Fast enough to parse on every keystroke in a text editor
Robust enough to provide useful results even in the presence of syntax errors
Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application

tree-sitter-graph's People

Contributors

Stargazers

Watchers

Forkers

glgarg seanpm2001 quentinledilavrec akoomtech iq-scm lotdeef mayhemheroes 5l1v3r1 ingestors dhirenmathur nomicfoundation omphalos-project jcarlson23 vee-effekt bm-w

tree-sitter-graph's Issues

C API

Hi. I see that the whole project is written in Rust and so I wanted to ask if there are any plans to add a C API in the future?

Check for duplicate nodes

(x) @x {
  node @x.y
  node @x.y
}

The above is a trivial example of a duplicate node error that we'd catch at runtime, but which we could instead catch statically by a simple analysis approximating the dynamic semantics w.r.t. node assignments. More interesting examples—guarded by interesting patterns, for example—are likely the only ones that we care about in practice (the above doesn't seem to happen very often and is easy enough to discover and resolve when it does), and are pretty much all dependent on exhaustiveness checking (#64).

cf #64 (exhaustiveness checks)

--global rejects empty string values

As of #78, we can pass --global var=value arguments on the CLI to specify globals. However, the parser rejects var= as a value, which should correctly specify a value of the empty string. Either way, there is currently no way to supply an empty-valued string global.

Doc link dead

Hi, I am very interested in this project, however, the doc link seems dead:
https://docs.rs/tree-sitter-graph/

`Globals::nested` should take a non-mutable reference

Globals are not actually mutated, but Globals::nested currently requires a mutable reference anyway. This leads to confusing APIs.

Incorrect Behavior of Quantification Operators

It seems that the quantification operator behaves incorrectly for the following query:

function_definition parameters:(parameters (identifier)* @name) @parameters)@func

My expectation is that @name would contain a list of identifiers but it only seems to contain the first identifier. My test is done using python with the following source

def start(a, b): x=a

The query correctly grabs each identifier without the quantification operator. I've also done a similar query using (function_definition body:(block (expression_statement)* @expr_stmts)@block)@func which behaves as expected. Since the query matches the function parameters correctly, I assume this is an issue with the graph library. I was using version 0.7 for this test.

Can't distinguish integers, graph node IDs, and syntax node IDs in JSON output format

Right now we just spit out numbers for all three.

JSON output format is unversioned

We ought to know better <slaps wrist>

Batch processing to JSON output

It'd be nice to process multiple source files at a time, and output them all to e.g. JSON.

Probably depends on #69.

Indicate node type in duplicate variable errors involving syntax nodes

I've got an error in my rules indicating a duplicate variable @block.after_scope where @block is (of course) a syntax node. It would be very helpful to know what kind of syntax tree node it is, and where it occurs in the source, just like we show for undefined variables.

Trying to understand more about tsg-files and the orchestration between the projects

Hello!

I recently came into contact with the world of stack-graph and everything related. I really like the ease of usage of tree-sitter as well as I think I've begun to understand the connection between tree-sitter, tree-sitter-graph and stack-graph.
I read that the python-dsl bindings for tree-sitter-graph hasn't yet been released, so I though of creating a very simplistic version to use for myself in the meantime. What I'm trying to build is a simple goto functionality combining the three libs -- my first iteration doesn't need to be perfect.
I have some questions I would love to have answered.

I began looking into creating tsg-files, the only documentation I found on how to write them was in src/reference/mod.rs. My question would be: Can you combine scm-querries with tsg-querries? E.g., is there a way to use scm-querries as the first part to set up identifiers and then use tsg-querries on some of the scm but not all of them? I'm not sure if that functionality would help or not, or if I just need to understand the DSLs better.
My current thought on the workflow is:
1. Use tree-sitter docs to aid creating scm-querries that finds all relevant scopes + all function calls and imports in a file.
2. Further expand on the scm querries from 1., modify each scm-query to a tsg-query. Tag each scope with the parent scope + current scope, as well as finding source and sink of imports and function calls.
3. Use the output of tsg in stack-graph to create a graph, using the example-code in stack-graphs/tests/it/test_graphs/cyclic_imports_python.rs as aid. Make sure to build up the graph using the scopes obtained in 2.
4. Use stack-graph partial-path functionality to find the path from caller-source to call-location

Does that seem to be correct way of thinking about how to combine the tools?

Can a tsg-file consist of a "tagging-step" to give attributes to each mapping, and then after a matching-step where the attributes set is consumed? E.g. tag each found scope with it's parent scope + current-scope, then afterwards only use the ones matched against function-calls.

Sorry if the questions have been answered before or if they doesn't make sense, just trying to grasp my head around everything.
I love the work you're doing, it's a really interesting project!

Outline/jump to symbol support in VS Code plugin

It'd be nice to have the VS Code plugin let you use the outline UI & ⌘R jump-to-symbol UI automatically, populating both with the defined rules.

Improve errors & warnings

Currently we only have errors, and we always return on the first error. This is annoying, every error requires an additional run to discover the following. Most errors are local to a stanza, and we could easily discover multiple errors in one run.

Improvements for error handling:

Collect errors (at least during checking) and return all of them, instead of only the first.
Introduce warnings and return those even for successful parses.

Show source for rule file locations

When we show an error indicating a given location in a rule file, I'd like to see an excerpt of the source around that location. One such should be shown for each rule file location involved in the error (tho maybe we should filter for duplicates).

I don't know if we should also show an excerpt for non-error locations, e.g. "Executing edge … -> … at (x, y)" entries. Perhaps if they're the last one before a source file (syntax node) location?

More docs and demo

Hi,
it is a good project.

However, can you provide more docs and demo?

Thank you!

DOT support

This would be handy for debugging .tsg rules.

Initially suggested by @p0.

Panic with invalid .tsg

The following .tsg input parses, but can lead to a panic at run time:

(const_item)
(function_item)
{}

To reproduce:

git clone [email protected]:jorendorff/tree-sitter-rust.git
cd tree-sitter-rust
git checkout jorendorff/internal-error
script/setup
script/test

Backtrace:

thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /home/jorendorff/.cargo/registry/src/github.com-1ecc6299db9ec823/tree-sitter-graph-0.5.0/src/lazy_execution.rs:62:27
stack backtrace:
   0: rust_begin_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:107:14
   2: core::panicking::panic_bounds_check
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:75:5
   3: tree_sitter_graph::execution::<impl tree_sitter_graph::ast::File>::execute_into
   4: tree_sitter_stack_graphs::StackGraphLanguage::build_stack_graph_into
   5: tree_sitter_stack_graphs::test::Command::run_test_with_context
   6: tree_sitter_stack_graphs::test::Command::run
   7: tree_sitter_stack_graphs::main

Graph schemas

I'd like to be able to provide some sort of schema for the kinds of nodes I'm constructing within a graph, and where they can be found in relation to the syntax tree.

(This would give us a nice way to describe stack graph or eval rules for example, in a way that's machine-checkable.)

Talked with @dcreager about this, and he suggested testing against such as a midway point to actually verifying.

In addition to verification/validation, this could also be used to give us constructors for subgraphs, which could be quite convenient when building stack graphs for example.

Query parsing failures are reported relative to the query, not the file

Steps

Write a .tsg file with a query referencing node symbols that don't exist in the target language, e.g.:

; push the query down by a line so line numbers don't coincide accidentally
(symbol_name_which_does_not_exist) @this
{}

Run tree-sitter-graph with this file and some suitable source file, e.g. a Python file.

Expected

I expected it to report that symbol_name_which_does_not_exist is an invalid node type on line 2.

Actual

It reports that symbol_name_which_does_not_exist is an invalid node type on line 1.

Notes

This becomes more obvious if you split a query over multiple lines, e.g. (valid\n(invalid)).

Errors are imprecise as to syntax node types

This Python file:

import common

def explode_twice():
  explode()
  raise "explode again"

triggers this error:

Caused by:
    0: Executing for child in @children { ... } at (68, 3)
    1: Undefined variable child.node on node [syntax node expression_statement (4, 3)]

because the .tsg file in question doesn't have a rule to handle the function call. (cf #64 for catching this earlier.)

The error describes this as an expression_statement, which is true, but it's more precisely a call, as we can see using tree-sitter parse:

Disclose tree-sitter parse output


❱ tree-sitter parse semantic-analysis/a.py
(module [0, 0] - [5, 0]
  (import_statement [0, 0] - [0, 13]
    name: (dotted_name [0, 7] - [0, 13]
      (identifier [0, 7] - [0, 13])))
  (function_definition [2, 0] - [4, 23]
    name: (identifier [2, 4] - [2, 17])
    parameters: (parameters [2, 17] - [2, 19])
    body: (block [3, 2] - [4, 23]
      (expression_statement [3, 2] - [3, 11]
        (call [3, 2] - [3, 11]
          function: (identifier [3, 2] - [3, 9])
          arguments: (argument_list [3, 9] - [3, 11])))
      (raise_statement [4, 2] - [4, 23]
        (string [4, 8] - [4, 23])))))

Having to run tree-sitter parse to check which syntax node type I haven't handled is a bit annoying; it'd be nicer if tree-sitter-graph would tell me we're talking about a call rather than an expression_statement, particularly since the two nodes have the same range.

Should the JSON CLI flag take a path to output to instead of always using stdout?

Right now we always output to stdout and require users to redirect or pipe in their shell. It might be convenient to give an output path and use --json - or --json /dev/stdout to replicate the current behaviour.

Allow duplicate edges

Propagators have a neat design whereby each node's value is either unset, set to a specific value, or conflicting. You can set a value by propagating something to it; if two values are propagated to it, and they are equal, then it remains unchanged with that value; and if the values differ, it's in conflict—an error.

I'd like to see something similar for tree-sitter-graph edges: setting the same edge is harmless, and thus accepted. Setting the same attr with the same value on that edge is also harmless, and thus accepted. Distinct values for the attr are an error.

(I'd actually like to see something like this for nodes and their attrs as well, but nodes have identity, which makes things more annoying.)

REPL

I'd like to be able to launch a REPL session where I select a target file, write queries/rules, and interactively see:

parse trees
what gets matched by a query
what graph gets produced by rules
what rules are applied
etc

Undeclared global warnings don't indicate source location

Using an undeclared global warns that this is deprecated, but there's no indication in the warning message as to where it's being used.

"Scoped variable" is a misnomer

In the old DSL, scoped variables were available within the lexical scope of the node they were attached to—its entire subtree.

This is no longer the case, leading me to believe that the name is a misnomer, and that we should try to come up with something more indicative of what they mean.

Show source excerpts for source file locations

Errors involving syntax nodes reference locations in the source file. We should show excerpts from the source file, just like #83 discusses for rules files.

See also #82 re: ensuring we include those locations in more errors, and #81 re: tracking multiple locations for duplicates specifically.

Track attr definition sites for duplicate attribute errors

When I define an attribute that turns out to be a duplicate, it can be a substantial amount of effort to track out where the original assignment was, especially if other kinds of nodes use the same attribute names. It'd be nice if tree-sitter-graph were to keep track of where (in the rules file's source) attributes were created so they could be reported for debugging.

Hypothetically we could compute this statically given the grammar (or at least an overapproximation to it), but I think we'd be better off doing it dynamically at least for the time being. If it's too slow to do all the time, maybe we could turn it on with a flag.

Bad error message (and possible execution bug?) with typo

The .tsg below has two bugs:

Typo on line 3 (@ehre should be @here)
Use of @path.context on line 10 which is never defined on that syntax node

[
  (struct_item name: (_) @here)
  (enum_item name: (_) @ehre) ;; oops, typo
] {
  let @here.context = "%pattern"
}

(scoped_type_identifier) @path {
  node s
  attr (s) symbol = @path.context ;; oops, not defined
}

I ran it on this test case.

pub enum A {
    M(std::string::String),
}

Expected behavior: Either some error message for bug 1, or the usual message for bug 2 (Undefined scoped variable [syntax node scoped_identifier (2, 7)].context)

Actual behavior:

$ ./bin/tree-sitter-stack-graphs test --grammar . --tsg ./bug.tsg ./testcase.rs
Error: Error running test ./testcase.rs

Caused by:
    0: Executing attr ((load 1)) symbol = (scoped [syntax node scoped_type_identifier (2, 7)] 'context) at (10, 3)
    1: Expected a syntax node  got #null

This seems like a bug to me because I can't think of any legit way for #null to get into the computation at line 10.

To reproduce:

git clone [email protected]:jorendorff/tree-sitter-rust.git
cd tree-sitter-rust
git checkout jorendorff/internal-error
script/setup
./bin/tree-sitter-stack-graphs test --grammar . --tsg ./bug.tsg ./testcase.rs

Incorrect (?) "scoped variables can only be attached to syntax nodes" error for edges to undefined variables

Steps

Execute this rule:

(_) @this
{
  node @this.a
  var a = @this.a
  edge a -> a.b
}

Expected

I expected it to fail with an "undefined variable" message, similarly to what happens if you comment out the var a = @this.a line.

Actual

It fails with a "scoped variables can only be attached to syntax nodes" message.

Questions

Is it possible that it's attempting to construct a.b on reference?

Predefined + function

We mention this in the docs, it's simple, it's useful, we should include it here.

Support user-defined functions

TSG does not support the definition of functions in the TSG source. As a result, common patterns are often written out in multiple places, making sources harder to read and maintain.

It should be possible to define functions in the source. Something like:

function (foo node names*) {
    for name in names {
        // ...
    }
}

Questions and challenges:

How do we deal with localness of arguments inside a functions?
- We can start with requiring all arguments to be local, and try to relax later (e.g. by inferring from the use in the function body).
Should functions have return values?
- This could be useful for e.g. a function introducing a module name, which may introduce multiple nodes, and the final node is returned.
Are functions treated the same as builtin functions, i.e. called as (foo ...), and can they appear in the same places?
- I'd prefer if expressions were pure, and these functions certainly are not. Alternatively we clearly separate them using e.g. an apply foo ... statement. However, it would get a bit awkward if they return values: would we have to forms, let var = expr and apply var = foo ... or something?

Include source context in errors

tree-sitter-graph errors (for when you don't have a rule to cover some piece of syntax) conveniently show you the node type and range in errors:

Error: Cannot execute TSG file ./semantic-analysis/python.tsg

Caused by:
    0: Executing for child in @children { ... } at (68, 3)
    1: Undefined variable child.node on node [syntax node expression_statement (3, 5)]

It would be neat to extend that with a bit of the source, at least the first line or so (and maybe the preceding and following one or two for context?).

Include source file path in JSON output

There's currently no way to communicate the path we parsed and processed to consumers of JSON, but that's pretty important for use cases processing multiple files (among others).

Add support for automatic attributes for top-level query match and variable scope

Being able to track the matched syntax node for a query, and the scope of a scoped variable would be very helpful in debugging. This is a sketch for how we can implement his. Hat tip to @BekaValentine for the idea!

I think this is a very good idea! I certainly prefer the node definitions to be grouped by syntactic category (e.g., statements) in one common rule, instead of having them, specified in separate rules for each individual statement form. And I think we have the information to add the top-level capture and variable scope as debug info.Let me explain how debug information ends up in the visualization. Propagation of debug information is set up in such a way that any attribute that starts with debug_ is stored in the nodes debug_info in the stack graph, and automatically gets added to the visualization if that node.Some of these attributes can be added automatically for every node by TSG: the TSG location and TSG variable name. This happens if attribute names are set in ExecutionConfig. (These are set in TSSG code.).So adding support for these would mean adding say variable_scope_attr and match_attr to ExecutionConfig , and setting names for those in TSSG where TSG is called. Setting values for those attributes:

For match_attr it can probably be done directly in the code for CreateGraphNode (strict and lazy). Getting the toplevel match is already supported (strict and lazy).
For variable_scope_attr it requires a little more work. In the strict case, the node will be eagerly computed, so a method on Variable could be called on self.node in the CreateGraphNode code. In the lazy case, the scope will be computed much later when the lazy values are forced. It'd have to look at the code in some more detail to see if we can piggyback on existing debug info that is passed around.
A new release of TSG and bumping the dependency in TSSG should do the trick then.

Perhaps doing the match_attr for the top-level match is a good first step that is already very helpful. (After all, for a rules such as [ (stmt1) (stmt2) (stmt3) ]@stmt { node @stmt.before_scope } the match and the variable scope are the same anyway.)

CMD usage

Hi, guys!
I want to know how to convert an AST file (generated by the tree-sitter) to a graph file for visualization.
I find the CMD of the tree-sitter-graph needs tsg file and source file. but I do not know what is the tsg file and how to write it.
Can you present an exact example? The best format of output should be DOT.

Allow duplicate attrs

Like #85, but for attributes: using something like propagators, allow redundant attributes &c. if and only if their values are compatible (for most purposes we can think of this as "equal-valued").

Supply globals from the CLI

I'm working on a .tsg program which is eventually going to be run from some specialized process or other which may or may not already exist. In the meantime, however, I'd like to be able to iterate on it and test it against real sources without having to do that development effort, and enjoy the benefits of tree-sitter-graph's other features (JSON output, error presentation, etc.).

I can do this already, but only up to a point: globals which the eventual host process may provide are of course not available on the CLI—but they could be!

I propose adding a --global NAME=VALUE option which can be supplied zero or more times, taking a global variable name and the value to give it. We'd of course need to do something about types to distinguish (for example) ints from strings, which might become intricate given composites like lists (and lists of lists, and…).

Compile error

When I build the project,

$ cargo install --features cli tree-sitter-graph
error[E0658]: unions with non-`Copy` fields are unstable
   --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/smallvec-1.8.0/src/lib.rs:406:1
    |
406 | / union SmallVecData<A: Array> {
407 | |     inline: core::mem::ManuallyDrop<MaybeUninit<A>>,
408 | |     heap: (*mut A::Item, usize),
409 | | }
    | |_^
    |
    = note: see issue #55149 <https://github.com/rust-lang/rust/issues/55149> for more information

error: aborting due to previous error

How can i do for it?

Thanks!

"Scoped variables can only be attached to syntax nodes" error doesn't indicate variable name

I'm seeing an error where an edge, apparently, is triggering a "scoped variables can only be attached to syntax nodes" error:

    0: Executing scan filepath { ... } at (15, 3)
    1: Matching x with arm "([^/]+)$" { ... }
    2: Executing edge module_def -> module_def.dot at (42, 7)
    3: Scoped variables can only be attached to syntax nodes got [graph node 0]

I'd love to be able to confirm that the edge is in fact responsible. One way would be for the error message to indicate which variable it's talking about. This would also just generally be useful, since it would help you pick up the context for fixing the issue just that little bit faster.

Pass globals of non-string types on CLI

#78 allows us to provide --global var=value arguments on the CLI. This is limited to passing strings, however. It would be nice to support other, non-string globals.

One possibility would be type-specific flags like --global-int. I'm not sure how we'd want to handle collections.

Remove support for undeclared globals

Undeclared globals currently lead to a warning, but not an error.
Support for that should be removed, and using undeclared globals should result in a hard error during checking.

Default values for global variables

Global variables sometimes have sensible default values. It should the possible to provide those in the TSG.

I suggest the following way of doing that:

global VAR = VALUE

The = VALUE part is optional, and globls without it must always be provided. The value is only used if the global variable is not set by the caller.

Support includes

A simple include mechanism would be a first step towards introducing some reuse and better organization for large TSG files.

Something like:

include "../lib/common"

This would be a pragmatic solution. Obviously, includes are not a great mechanism, but setting up a proper module system, imports, scoping of identifiers, qualified names, etc is not something we have time for at the moment. Supporting includes on the other hand can be relatively easy to implement, I expect.

Exhaustiveness checks

When writing a tsg file I often want to know if some or all of the portions of the rule are exhaustive, i.e. whether we cover all of the alternatives of a given sum. Otherwise, I can only discover this by manual checking, or by running against a variety of sources and seeing whether it dies or not.

The necessary information is available from the node-types.json file, and would therefore allow us to provide warnings either as part of the usual parsing pass or via a separate linter pass. Ideally we'd be able to provide them for the whole file, or for specific parts of the file, presumably via some sort of annotation on rules.

Example TSG files for languages?

I tried piecing together a python tsg file from https://docs.rs/tree-sitter-graph/0.5.1/tree_sitter_graph/reference/ but I get errors:

Caused by:
    0: Invalid query pattern: [0, "string", 0, #true, @id]
        ^
    1: Query error at 7:2. Invalid syntax:
       [0, "string", 0, #true, @id]```

Is there a full tsg file available for any languages to make it a bit easier to get started testing out functionality?

Namespacing

I think we're going to want to be able to run the union of multiple .tsg programs at a time, and to do that successfully and not have them clobber each other I think their variable and attr names will need to be namespaced.

I also think we're going to want to be able to explicitly use another namespace to e.g. extend one tsg program with another.

how to understand from where a function call is made?

I am using TreeSitter to parse python codes. I need to determine from which file a function is called.

For example, I need to understand check_files_in_directory is invoked from GPT4Readability.utils. I already captured all the function calls.

But now I have to find out from which file check_files_in_directory is called. I am struggling to understand what would the logic to do it. Can anyone please suggest?

import os
from getpass import getpass
from GPT4Readability.utils import *
import importlib.resources as pkg_resources  


def generate_readme(root_dir, output_name, model):
    """Generates a README.md file based on the python files in the provided directory

    Args:
        root_dir (str): The root directory of the python package to parse and generate a readme for
    """

    # prompt_folder_name = os.path.join(os.path.dirname(__file__), "prompts")
    # prompt_path = os.path.join(prompt_folder_name, "readme_prompt.txt")

    with pkg_resources.open_text('GPT4Readability.prompts','readme_prompt.txt') as f:         
	    inb_msg = f.read()

    # with open(prompt_path) as f:
    #     lines = f.readlines()
    # inb_msg = "".join(lines)

    file_check_result = check_files_in_directory(root_dir)

what do the graphs look like?

Show where the rule matched for dynamic errors

I've got a duplicate edge error, which is discovered dynamically when running against some source file. How can this edge already exist? What node is it actually talking about? Who knows!

I'd like the error to include:

Where the match in the source file began, along with an excerpt of (at least) the first line of that source (#84).
Where the bound nodes were within the source, e.g. by picking a terminal-usable colour for each variable, drawing the variable's name in that, and drawing the corresponding span of the excerpt in that as well, with child nodes taking precedence over parent nodes colour-wise.
What variables exist on the various nodes. (Maybe just the ones we've explicitly bound, although a REPL (#77) might allow one to get more.)

Add string construction function

We have scan to take a string apart, but we cannot construct strings at the moment (except for literals). Some kind of string-format or string-join function should be added.

tree-sitter / tree-sitter-graph Goto Github PK

tree-sitter-graph's Introduction

tree-sitter

Links

tree-sitter-graph's People

Contributors

Stargazers

Watchers

Forkers

tree-sitter-graph's Issues

Steps

Expected

Actual

Notes

Steps

Expected

Actual

Questions

Recommend Projects

Recommend Topics

Recommend Org