Giter VIP home page Giter VIP logo

tree-sitter-graph's Introduction

tree-sitter

DOI discord matrix

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited. Tree-sitter aims to be:

  • General enough to parse any programming language
  • Fast enough to parse on every keystroke in a text editor
  • Robust enough to provide useful results even in the presence of syntax errors
  • Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application

Links

tree-sitter-graph's People

Contributors

bekavalentine avatar dcreager avatar hendrikvanantwerpen avatar rewinfrey avatar robrix avatar tausbn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tree-sitter-graph's Issues

C API

Hi. I see that the whole project is written in Rust and so I wanted to ask if there are any plans to add a C API in the future?

Check for duplicate nodes

(x) @x {
  node @x.y
  node @x.y
}

The above is a trivial example of a duplicate node error that we'd catch at runtime, but which we could instead catch statically by a simple analysis approximating the dynamic semantics w.r.t. node assignments. More interesting examples—guarded by interesting patterns, for example—are likely the only ones that we care about in practice (the above doesn't seem to happen very often and is easy enough to discover and resolve when it does), and are pretty much all dependent on exhaustiveness checking (#64).

cf #64 (exhaustiveness checks)

--global rejects empty string values

As of #78, we can pass --global var=value arguments on the CLI to specify globals. However, the parser rejects var= as a value, which should correctly specify a value of the empty string. Either way, there is currently no way to supply an empty-valued string global.

Incorrect Behavior of Quantification Operators

It seems that the quantification operator behaves incorrectly for the following query:

function_definition parameters:(parameters (identifier)* @name) @parameters)@func

My expectation is that @name would contain a list of identifiers but it only seems to contain the first identifier. My test is done using python with the following source

def start(a, b): x=a

The query correctly grabs each identifier without the quantification operator. I've also done a similar query using (function_definition body:(block (expression_statement)* @expr_stmts)@block)@func which behaves as expected. Since the query matches the function parameters correctly, I assume this is an issue with the graph library. I was using version 0.7 for this test.

Trying to understand more about tsg-files and the orchestration between the projects

Hello!

I recently came into contact with the world of stack-graph and everything related. I really like the ease of usage of tree-sitter as well as I think I've begun to understand the connection between tree-sitter, tree-sitter-graph and stack-graph.
I read that the python-dsl bindings for tree-sitter-graph hasn't yet been released, so I though of creating a very simplistic version to use for myself in the meantime. What I'm trying to build is a simple goto functionality combining the three libs -- my first iteration doesn't need to be perfect.
I have some questions I would love to have answered.

  1. I began looking into creating tsg-files, the only documentation I found on how to write them was in src/reference/mod.rs. My question would be: Can you combine scm-querries with tsg-querries? E.g., is there a way to use scm-querries as the first part to set up identifiers and then use tsg-querries on some of the scm but not all of them? I'm not sure if that functionality would help or not, or if I just need to understand the DSLs better.

  2. My current thought on the workflow is:

    1. Use tree-sitter docs to aid creating scm-querries that finds all relevant scopes + all function calls and imports in a file.
    2. Further expand on the scm querries from 1., modify each scm-query to a tsg-query. Tag each scope with the parent scope + current scope, as well as finding source and sink of imports and function calls.
    3. Use the output of tsg in stack-graph to create a graph, using the example-code in stack-graphs/tests/it/test_graphs/cyclic_imports_python.rs as aid. Make sure to build up the graph using the scopes obtained in 2.
    4. Use stack-graph partial-path functionality to find the path from caller-source to call-location

Does that seem to be correct way of thinking about how to combine the tools?

  1. Can a tsg-file consist of a "tagging-step" to give attributes to each mapping, and then after a matching-step where the attributes set is consumed? E.g. tag each found scope with it's parent scope + current-scope, then afterwards only use the ones matched against function-calls.

Sorry if the questions have been answered before or if they doesn't make sense, just trying to grasp my head around everything.
I love the work you're doing, it's a really interesting project!

Improve errors & warnings

Currently we only have errors, and we always return on the first error. This is annoying, every error requires an additional run to discover the following. Most errors are local to a stanza, and we could easily discover multiple errors in one run.

Improvements for error handling:

  • Collect errors (at least during checking) and return all of them, instead of only the first.
  • Introduce warnings and return those even for successful parses.

Show source for rule file locations

When we show an error indicating a given location in a rule file, I'd like to see an excerpt of the source around that location. One such should be shown for each rule file location involved in the error (tho maybe we should filter for duplicates).

I don't know if we should also show an excerpt for non-error locations, e.g. "Executing edge … -> … at (x, y)" entries. Perhaps if they're the last one before a source file (syntax node) location?

More docs and demo

Hi,
it is a good project.

However, can you provide more docs and demo?

Thank you!

DOT support

This would be handy for debugging .tsg rules.

Initially suggested by @p0.

Panic with invalid .tsg

The following .tsg input parses, but can lead to a panic at run time:

(const_item)
(function_item)
{}

To reproduce:

git clone [email protected]:jorendorff/tree-sitter-rust.git
cd tree-sitter-rust
git checkout jorendorff/internal-error
script/setup
script/test

Backtrace:

thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /home/jorendorff/.cargo/registry/src/github.com-1ecc6299db9ec823/tree-sitter-graph-0.5.0/src/lazy_execution.rs:62:27
stack backtrace:
   0: rust_begin_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:107:14
   2: core::panicking::panic_bounds_check
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:75:5
   3: tree_sitter_graph::execution::<impl tree_sitter_graph::ast::File>::execute_into
   4: tree_sitter_stack_graphs::StackGraphLanguage::build_stack_graph_into
   5: tree_sitter_stack_graphs::test::Command::run_test_with_context
   6: tree_sitter_stack_graphs::test::Command::run
   7: tree_sitter_stack_graphs::main

Graph schemas

I'd like to be able to provide some sort of schema for the kinds of nodes I'm constructing within a graph, and where they can be found in relation to the syntax tree.

(This would give us a nice way to describe stack graph or eval rules for example, in a way that's machine-checkable.)

Talked with @dcreager about this, and he suggested testing against such as a midway point to actually verifying.

In addition to verification/validation, this could also be used to give us constructors for subgraphs, which could be quite convenient when building stack graphs for example.

Query parsing failures are reported relative to the query, not the file

Steps

  1. Write a .tsg file with a query referencing node symbols that don't exist in the target language, e.g.:

    ; push the query down by a line so line numbers don't coincide accidentally
    (symbol_name_which_does_not_exist) @this
    {}
    
  2. Run tree-sitter-graph with this file and some suitable source file, e.g. a Python file.

Expected

I expected it to report that symbol_name_which_does_not_exist is an invalid node type on line 2.

Actual

It reports that symbol_name_which_does_not_exist is an invalid node type on line 1.

Notes

This becomes more obvious if you split a query over multiple lines, e.g. (valid\n(invalid)).

Errors are imprecise as to syntax node types

This Python file:

import common

def explode_twice():
  explode()
  raise "explode again"

triggers this error:

Caused by:
    0: Executing for child in @children { ... } at (68, 3)
    1: Undefined variable child.node on node [syntax node expression_statement (4, 3)]

because the .tsg file in question doesn't have a rule to handle the function call. (cf #64 for catching this earlier.)

The error describes this as an expression_statement, which is true, but it's more precisely a call, as we can see using tree-sitter parse:

Disclose tree-sitter parse output

❱ tree-sitter parse semantic-analysis/a.py
(module [0, 0] - [5, 0]
  (import_statement [0, 0] - [0, 13]
    name: (dotted_name [0, 7] - [0, 13]
      (identifier [0, 7] - [0, 13])))
  (function_definition [2, 0] - [4, 23]
    name: (identifier [2, 4] - [2, 17])
    parameters: (parameters [2, 17] - [2, 19])
    body: (block [3, 2] - [4, 23]
      (expression_statement [3, 2] - [3, 11]
        (call [3, 2] - [3, 11]
          function: (identifier [3, 2] - [3, 9])
          arguments: (argument_list [3, 9] - [3, 11])))
      (raise_statement [4, 2] - [4, 23]
        (string [4, 8] - [4, 23])))))

Having to run tree-sitter parse to check which syntax node type I haven't handled is a bit annoying; it'd be nicer if tree-sitter-graph would tell me we're talking about a call rather than an expression_statement, particularly since the two nodes have the same range.

Allow duplicate edges

Propagators have a neat design whereby each node's value is either unset, set to a specific value, or conflicting. You can set a value by propagating something to it; if two values are propagated to it, and they are equal, then it remains unchanged with that value; and if the values differ, it's in conflict—an error.

I'd like to see something similar for tree-sitter-graph edges: setting the same edge is harmless, and thus accepted. Setting the same attr with the same value on that edge is also harmless, and thus accepted. Distinct values for the attr are an error.

(I'd actually like to see something like this for nodes and their attrs as well, but nodes have identity, which makes things more annoying.)

REPL

I'd like to be able to launch a REPL session where I select a target file, write queries/rules, and interactively see:

  • parse trees
  • what gets matched by a query
  • what graph gets produced by rules
  • what rules are applied
  • etc

"Scoped variable" is a misnomer

In the old DSL, scoped variables were available within the lexical scope of the node they were attached to—its entire subtree.

This is no longer the case, leading me to believe that the name is a misnomer, and that we should try to come up with something more indicative of what they mean.

Show source excerpts for source file locations

Errors involving syntax nodes reference locations in the source file. We should show excerpts from the source file, just like #83 discusses for rules files.

See also #82 re: ensuring we include those locations in more errors, and #81 re: tracking multiple locations for duplicates specifically.

Track attr definition sites for duplicate attribute errors

When I define an attribute that turns out to be a duplicate, it can be a substantial amount of effort to track out where the original assignment was, especially if other kinds of nodes use the same attribute names. It'd be nice if tree-sitter-graph were to keep track of where (in the rules file's source) attributes were created so they could be reported for debugging.

Hypothetically we could compute this statically given the grammar (or at least an overapproximation to it), but I think we'd be better off doing it dynamically at least for the time being. If it's too slow to do all the time, maybe we could turn it on with a flag.

Bad error message (and possible execution bug?) with typo

The .tsg below has two bugs:

  1. Typo on line 3 (@ehre should be @here)
  2. Use of @path.context on line 10 which is never defined on that syntax node
[
  (struct_item name: (_) @here)
  (enum_item name: (_) @ehre) ;; oops, typo
] {
  let @here.context = "%pattern"
}

(scoped_type_identifier) @path {
  node s
  attr (s) symbol = @path.context ;; oops, not defined
}

I ran it on this test case.

pub enum A {
    M(std::string::String),
}

Expected behavior: Either some error message for bug 1, or the usual message for bug 2 (Undefined scoped variable [syntax node scoped_identifier (2, 7)].context)

Actual behavior:

$ ./bin/tree-sitter-stack-graphs test --grammar . --tsg ./bug.tsg ./testcase.rs
Error: Error running test ./testcase.rs

Caused by:
    0: Executing attr ((load 1)) symbol = (scoped [syntax node scoped_type_identifier (2, 7)] 'context) at (10, 3)
    1: Expected a syntax node  got #null

This seems like a bug to me because I can't think of any legit way for #null to get into the computation at line 10.

To reproduce:

git clone [email protected]:jorendorff/tree-sitter-rust.git
cd tree-sitter-rust
git checkout jorendorff/internal-error
script/setup
./bin/tree-sitter-stack-graphs test --grammar . --tsg ./bug.tsg ./testcase.rs

Predefined + function

We mention this in the docs, it's simple, it's useful, we should include it here.

Support user-defined functions

TSG does not support the definition of functions in the TSG source. As a result, common patterns are often written out in multiple places, making sources harder to read and maintain.

It should be possible to define functions in the source. Something like:

function (foo node names*) {
    for name in names {
        // ...
    }
}

Questions and challenges:

  • How do we deal with localness of arguments inside a functions?

    • We can start with requiring all arguments to be local, and try to relax later (e.g. by inferring from the use in the function body).
  • Should functions have return values?

    • This could be useful for e.g. a function introducing a module name, which may introduce multiple nodes, and the final node is returned.
  • Are functions treated the same as builtin functions, i.e. called as (foo ...), and can they appear in the same places?

    • I'd prefer if expressions were pure, and these functions certainly are not. Alternatively we clearly separate them using e.g. an apply foo ... statement. However, it would get a bit awkward if they return values: would we have to forms, let var = expr and apply var = foo ... or something?

Include source context in errors

tree-sitter-graph errors (for when you don't have a rule to cover some piece of syntax) conveniently show you the node type and range in errors:

Error: Cannot execute TSG file ./semantic-analysis/python.tsg

Caused by:
    0: Executing for child in @children { ... } at (68, 3)
    1: Undefined variable child.node on node [syntax node expression_statement (3, 5)]

It would be neat to extend that with a bit of the source, at least the first line or so (and maybe the preceding and following one or two for context?).

Include source file path in JSON output

There's currently no way to communicate the path we parsed and processed to consumers of JSON, but that's pretty important for use cases processing multiple files (among others).

Add support for automatic attributes for top-level query match and variable scope

Being able to track the matched syntax node for a query, and the scope of a scoped variable would be very helpful in debugging. This is a sketch for how we can implement his. Hat tip to @BekaValentine for the idea!

I think this is a very good idea! I certainly prefer the node definitions to be grouped by syntactic category (e.g., statements) in one common rule, instead of having them, specified in separate rules for each individual statement form. And I think we have the information to add the top-level capture and variable scope as debug info.Let me explain how debug information ends up in the visualization. Propagation of debug information is set up in such a way that any attribute that starts with debug_ is stored in the nodes debug_info in the stack graph, and automatically gets added to the visualization if that node.Some of these attributes can be added automatically for every node by TSG: the TSG location and TSG variable name. This happens if attribute names are set in ExecutionConfig. (These are set in TSSG code.).So adding support for these would mean adding say variable_scope_attr and match_attr to ExecutionConfig , and setting names for those in TSSG where TSG is called. Setting values for those attributes:

  • For match_attr it can probably be done directly in the code for CreateGraphNode (strict and lazy). Getting the toplevel match is already supported (strict and lazy).
  • For variable_scope_attr it requires a little more work. In the strict case, the node will be eagerly computed, so a method on Variable could be called on self.node in the CreateGraphNode code. In the lazy case, the scope will be computed much later when the lazy values are forced. It'd have to look at the code in some more detail to see if we can piggyback on existing debug info that is passed around.
  • A new release of TSG and bumping the dependency in TSSG should do the trick then.

Perhaps doing the match_attr for the top-level match is a good first step that is already very helpful. (After all, for a rules such as [ (stmt1) (stmt2) (stmt3) ]@stmt { node @stmt.before_scope } the match and the variable scope are the same anyway.)

CMD usage

Hi, guys!
I want to know how to convert an AST file (generated by the tree-sitter) to a graph file for visualization.
I find the CMD of the tree-sitter-graph needs tsg file and source file. but I do not know what is the tsg file and how to write it.
Can you present an exact example? The best format of output should be DOT.

Allow duplicate attrs

Like #85, but for attributes: using something like propagators, allow redundant attributes &c. if and only if their values are compatible (for most purposes we can think of this as "equal-valued").

Supply globals from the CLI

I'm working on a .tsg program which is eventually going to be run from some specialized process or other which may or may not already exist. In the meantime, however, I'd like to be able to iterate on it and test it against real sources without having to do that development effort, and enjoy the benefits of tree-sitter-graph's other features (JSON output, error presentation, etc.).

I can do this already, but only up to a point: globals which the eventual host process may provide are of course not available on the CLI—but they could be!

I propose adding a --global NAME=VALUE option which can be supplied zero or more times, taking a global variable name and the value to give it. We'd of course need to do something about types to distinguish (for example) ints from strings, which might become intricate given composites like lists (and lists of lists, and…).

Compile error

When I build the project,

$ cargo install --features cli tree-sitter-graph
error[E0658]: unions with non-`Copy` fields are unstable
   --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/smallvec-1.8.0/src/lib.rs:406:1
    |
406 | / union SmallVecData<A: Array> {
407 | |     inline: core::mem::ManuallyDrop<MaybeUninit<A>>,
408 | |     heap: (*mut A::Item, usize),
409 | | }
    | |_^
    |
    = note: see issue #55149 <https://github.com/rust-lang/rust/issues/55149> for more information

error: aborting due to previous error

How can i do for it?

Thanks!

"Scoped variables can only be attached to syntax nodes" error doesn't indicate variable name

I'm seeing an error where an edge, apparently, is triggering a "scoped variables can only be attached to syntax nodes" error:

    0: Executing scan filepath { ... } at (15, 3)
    1: Matching x with arm "([^/]+)$" { ... }
    2: Executing edge module_def -> module_def.dot at (42, 7)
    3: Scoped variables can only be attached to syntax nodes got [graph node 0]

I'd love to be able to confirm that the edge is in fact responsible. One way would be for the error message to indicate which variable it's talking about. This would also just generally be useful, since it would help you pick up the context for fixing the issue just that little bit faster.

Pass globals of non-string types on CLI

#78 allows us to provide --global var=value arguments on the CLI. This is limited to passing strings, however. It would be nice to support other, non-string globals.

One possibility would be type-specific flags like --global-int. I'm not sure how we'd want to handle collections.

Remove support for undeclared globals

Undeclared globals currently lead to a warning, but not an error.
Support for that should be removed, and using undeclared globals should result in a hard error during checking.

Default values for global variables

Global variables sometimes have sensible default values. It should the possible to provide those in the TSG.

I suggest the following way of doing that:

global VAR = VALUE

The = VALUE part is optional, and globls without it must always be provided. The value is only used if the global variable is not set by the caller.

Support includes

A simple include mechanism would be a first step towards introducing some reuse and better organization for large TSG files.

Something like:

include "../lib/common"

This would be a pragmatic solution. Obviously, includes are not a great mechanism, but setting up a proper module system, imports, scoping of identifiers, qualified names, etc is not something we have time for at the moment. Supporting includes on the other hand can be relatively easy to implement, I expect.

Exhaustiveness checks

When writing a tsg file I often want to know if some or all of the portions of the rule are exhaustive, i.e. whether we cover all of the alternatives of a given sum. Otherwise, I can only discover this by manual checking, or by running against a variety of sources and seeing whether it dies or not.

The necessary information is available from the node-types.json file, and would therefore allow us to provide warnings either as part of the usual parsing pass or via a separate linter pass. Ideally we'd be able to provide them for the whole file, or for specific parts of the file, presumably via some sort of annotation on rules.

Namespacing

I think we're going to want to be able to run the union of multiple .tsg programs at a time, and to do that successfully and not have them clobber each other I think their variable and attr names will need to be namespaced.

I also think we're going to want to be able to explicitly use another namespace to e.g. extend one tsg program with another.

how to understand from where a function call is made?

I am using TreeSitter to parse python codes. I need to determine from which file a function is called.

For example, I need to understand check_files_in_directory is invoked from GPT4Readability.utils. I already captured all the function calls.

But now I have to find out from which file check_files_in_directory is called. I am struggling to understand what would the logic to do it. Can anyone please suggest?

import os
from getpass import getpass
from GPT4Readability.utils import *
import importlib.resources as pkg_resources  


def generate_readme(root_dir, output_name, model):
    """Generates a README.md file based on the python files in the provided directory

    Args:
        root_dir (str): The root directory of the python package to parse and generate a readme for
    """

    # prompt_folder_name = os.path.join(os.path.dirname(__file__), "prompts")
    # prompt_path = os.path.join(prompt_folder_name, "readme_prompt.txt")

    with pkg_resources.open_text('GPT4Readability.prompts','readme_prompt.txt') as f:         
	    inb_msg = f.read()

    # with open(prompt_path) as f:
    #     lines = f.readlines()
    # inb_msg = "".join(lines)

    file_check_result = check_files_in_directory(root_dir)

Show where the rule matched for dynamic errors

I've got a duplicate edge error, which is discovered dynamically when running against some source file. How can this edge already exist? What node is it actually talking about? Who knows!

I'd like the error to include:

  1. Where the match in the source file began, along with an excerpt of (at least) the first line of that source (#84).
  2. Where the bound nodes were within the source, e.g. by picking a terminal-usable colour for each variable, drawing the variable's name in that, and drawing the corresponding span of the excerpt in that as well, with child nodes taking precedence over parent nodes colour-wise.
  3. What variables exist on the various nodes. (Maybe just the ones we've explicitly bound, although a REPL (#77) might allow one to get more.)

Add string construction function

We have scan to take a string apart, but we cannot construct strings at the moment (except for literals). Some kind of string-format or string-join function should be added.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.