github / semantic Goto Github PK

Parsing, analyzing, and comparing source code across many languages

Haskell 98.18% Python 0.01% Ruby 0.01% Shell 0.59% Dockerfile 0.06% Starlark 1.15%

semantic's Issues

Matching multiple token nodes is error-prone

Anonymous (token) nodes are dropped before performing committed choices consisting entirely of non-token symbols (the usual case). You can always match a token by explicitly requesting its symbol, but if you have e.g. multiple optional token nodes separating optional non-token nodes, it becomes difficult and error-prone to ensure that e.g. symbol AnonToken *> namedRule doesn’t advance past any tokens after the namedRule.

This is not a common situation, but the current strategy for handling requires some fairly kludgy workarounds.

It would be better if the assignment language included explicit combinators for e.g. dropping anonymous nodes, advancing past the current node, &c., with some mechanism for assigning without the automatic behaviours which otherwise have to be worked around.

Publish semantic docker image

Per @tclem’s post:

We do also publish docker images internally - that should get ported over along with project CI.

Feature vectors are computed serially

The feature vectors for two (sibling) subtrees are computed serially, despite being (partially) independent. Every step (labelling nodes, computing [bags of] p,q-grams, hashing, picking random vectors on the unit d-sphere, and summing unit vectors into feature vectors) is completely parallelizable, depending only on results from the previous step (in the case of computing p,q-grams, the labels for sibling nodes are required, but for all other steps it only requires the result for the node and the subtree descending from it).

These are CPU-bound, so this should speed things up.

Ruby access control

Handling Ruby access control (i.e. public, protected, and private) for method declarations is problematic because protected and private are methods rather than keywords.

This distinction is crucial in Ruby because this makes the protected and private methods open for monkey-patching or aliasing:

class Module
  alias_method :barf :private
end

class Foo
  barf
  def this_should_be_private; end
end

Because we can't rely on keywords in Ruby at the parser level to determine visibility of method declarations, we'll need to potentially handle this during evaluation when constructing the ScopeGraph. One strategy we discussed is checking the containing scope of a method declaration to determine if it is within a class or module, and if that class or module has applied the protected or private method as that state can tell us the current method declaration's access control.

Move from shelly to streaming-process.

shelly is a good library with a well-thought-out API, but in our case its API is causing an unnecessary UTF-8 encode/decode for each blob read from the filesystem. shelly reads from a process as a ByteString, then converts that into a Text value, which we then convert back into a ByteString with Source.fromUTF8. We can skip all of this and use streaming-bytestring to read as directly from the process as possible; the only thing I want to be careful about is that we behave the same on erroneous input or if the process exits unexpectedly.

Branch diffing produces additions before deletions

Convention in diffs is for deletions to come before insertions. We don’t guarantee that, or at least, we didn’t at one point in time, and I’d like us to test that we do the right thing before closing this.

Abstract domains can’t model sequencing

We currently assume that two statements in a sequence are completely unrelated, e.g.:

def foo
  a
  b
end

Here, a and b are indeed orthogonal, but the result of foo isn’t just determined by b, as it would be in a pure language, but rather by the effects performed by both a and b. In other words, treating statement blocks as simply returning their final value means that we’re losing the ability to model important properties using the abstract domain. This has real consequences for us, in that computing the call graph requires us to manually inspect syntax, instead of e.g. representing values as sets of referenced symbols, unioned together.

If our abstract domain had a sequencing operation, akin to the *> method for Applicatives, we could move a lot of out-of-band inspection of syntax into the representation of the domain, resulting in a simpler, more direct implementation for many analyses.

This will additionally be necessary to correctly model type & effect systems.

Feature vector decoration occurs serially

We compute the feature vectors for terms inside the diffing machinery, which means it isn’t exposed to the parallelism offered by Task’s distribute… family of APIs. This means that we

decorate one term in full before
decorating the other term in full before
diffing,

despite the fact that decorating a term with its feature vectors is a completely independent operation (i.e. parallelizable).

Note that this only affects the two terms for any one diff, and thus if we’re computing two diffs in a PR we’ll compute the feature vectors and diffs for one in parallel to the feature vectors and diffs for the other.

RWS hashing & RNG are counterproductive for the similarity metric

The hashing & RNG used by RWS act counter to its intention of approximating the tree edit distance with the squared Euclidean distance between the d-dimensional feature vectors summarizing nodes. This is caused by two factors:

Many hash functions have the property that a change of one bit in the input will result in a change to approximately half the bits in the output. That means that similar values will result in very different hashes. This is good for hash tables, but counter to our purposes where we want similar values to have similar unit vectors. (By way of contrast, the continuity property would be preferable for our purposes.)
The RNG most likely has the same property, such that similar seeds will result in different sequences of random values.

These problems primarily affect the unit vectors, because only individual nodes’ p,q-grams are hashed to seed the RNG and compute the d-dimensional unit vector. So while two similar leaf nodes may have dramatically different unit vectors (and therefore feature vectors), two trees which are mostly the same will nevertheless have mostly similar feature vectors. That is, RWS works in aggregate because similarity is considered across (relatively large) subtrees; the discontinuity introduced by hashing & the RNG is effectively amortized.

While that works well at larger scales, it falls down fairly consistently at smaller ones. That’s particularly problematic for our use case where small changes tend to dominate for human factors: smaller changes are easier to review. It’s also a factor in the unpredictability that we’ve noticed: reordering the syntax constructors changes the hash values assigned to each, which in turn can result in different results from our test fixtures.

Set us up some CI

We should set up some CI to run the test suite for each candidate PR.

(To be clear, we were doing this before! We should just start doing it again now that the library is open sourced.)

Strategies for dealing with partial information

We won’t always have all of the information available for a program. A partial (hah) list of conditions which can and will cause this:

it may be ambiguous what the entry point(s) for a program are
libraries don’t have entry points at all
we might not know which source files are a part of which program if we don’t know about e.g. the build/package system
we might not be able to resolve dependencies properly
closed-source dependencies are a thing
if we don’t know the environment a program is run in, we might not know how to resolve even local modules
code can live in databases and networks that we don’t have access to
there might be OS-specific code that we won’t have access to
there might be a syntax error in a file
there might be a parser, assignment, or analysis bug preventing us from evaluating a file
user code may be ill-typed or buggy

Since we’re generally using approximate techniques, we will often be able to provide functionality proportional to the amount of information available to us—partial information should mean partial functionality, not zero functionality.

A brief and incomplete catalogue of strategies we can apply for this purpose:

Use the information we do have to build up a picture of the information we don’t, à la typed holes in ghc, for example by using unification to guess the types of unknown terms.
More generally, modelling a problem using constraints may allow us to resolve a solution precisely where we have sufficient information, and fall back gracefully to incomplete solutions where we do not. Treating this compositionally, that might mean computing what a function will do modulo what some specific unknown term will do.
Symbolic execution will allow us to model the evaluation of irreducible terms using the terms themselves. This is known to be possible using abstract definitional interpretation (it’s in the paper), and is a valuable tactic for evaluation modulo unknown values in general.
Similar to the technique of representing divergent computation with a bottom value, we can represent unknown values concretely s.t. elimination of an unknown value yields an unknown value.
Using our intermediate language might allow us to represent unknown values/error states as the continuation from some expected result.

Plated Core

I would love to see a Traversal of the immediate children of Core so that we can use things like transform or cosmos.

If depending on lens is a problem, it's possible to just write the traversal without it:

https://github.com/ekmett/lens/wiki/How-can-I-write-lenses-without-depending-on-lens%3F#traversal

CLI usage info doesn’t tell you how to specify languages

The CLI usage info doesn’t mention that you can specify per-file languages with path/to.file:Lang, and it also doesn’t mention the format for the language either there or in the usage for semantic graph under --language.

We should make sure to mention the existence of the former, the format of the language in both cases, and probably also the complete list of supported languages.

Similarly, if you get the format wrong all we say is:

option --language: cannot parse value `haskell'

We should tell you more about the format here as well.

Distribute effect should yield true concurrency, not just parallelism

As pointed out in https://twitter.com/mitchellsalad/status/1135618541376479233

Add support for extending scope graphs with incremental changes

We should modify our scope graph implementation so we can extend an initial scope graph computation with incremental changes.

To do so, we'll need to store partial paths to unknown declarations in the scope graph. Given the initial scope graph and a new set of files to parse, we should be able to generate a new scope graph by filling in the gaps using terms we find in those new files.

SES is semi-parallelizable

@joshvera pointed out that we can parallelize some of our SES implementation using the Control.Parallel primitives par & pseq. Specifically, in an edit graph like this one:

each vertex depends on everything to the right of and below it. That means that the top right and bottom left quarters are independent of each other, tho they both depend on the bottom right quarter, and the top left quarter depends on all three. (And it breaks down further, vertex by vertex, in much the same fashion.)

Use a parser instead of pattern-matching on `Text` when parsing git-ls-tree output.

In Semantic.Git, we should pull in attoparsec to parse the elements of git-ls-tree output, rather than matching on Text values, which is slow.

Port getting-started guide.

We have internal documentation that describes how to get up and running with semantic quickly. We should port it to the docs/ folder.

Parse error on typescript file

On the following typescript file:

export type NoInfer<T> = T & { [K in keyof T]: T[K] };

I'm getting the following error:

cabal new-run semantic -- parse minimal.ts
Up to date
[(Error)
12:08:07]  WARN minimal.ts:1:1-3:1: error: expected end of input nodes, but got ParseError
1: export type NoInfer<T> = T & { [K in keyof T]: T[K] };
   ^...

 task=parse github_request_id=- github_is_public=False path=minimal.ts language=TypeScript

This is a minimal example when I tried to parse the file https://github.com/palantir/tslint/blob/master/src/language/rule/abstractRule.ts

Investigate performance consequences of enabling/disabling parallel GC.

As per this tweet from Neil Mitchell:

GHC turns on the parallel GC if you have multiple threads. My experience: never faster, always slower, sometimes 5x slower wall clock time. Pass -qg to the runtime to disable. I don't think the default is right…

We should do some benchmarking to see if this, or this, has a positive effect on performance.

Detect valid holes in the scope graph

In our current scope graph implementation, declarations can optionally contain an associated scope. This has been a good solution, but as we look toward a future in which iterative scope graph calculation is important for supporting the type of indexing in the alephd service, we can be more precise in terms of what declaration scopes require additional effort for their complete iterative computation.

Rather than a Maybe scope for a declaration's associated scope, a sum type indicating status of an associated scope give us a way to store convenient data needed for later computation:

data AssociatedScope scope name = AssociatedScope scope
                                | NeedToBeCalculated name
                                | NoAssociatedScope

This is not meant to be a prescriptive solution -- but to illustrate the data needed for cases in which we can identify a hole in the scope graph during evaluation of a syntax term. In such cases, if we know an associated scope should exist but is not found, we can construct a placeholder associated scope (e.g. NeedToBeCalculated) that contains the name we should use for the declaration scope lookup later.

Hypothetically, after we store scope graphs in a database or data structure, the ability to query for declarations with missing associated scopes gives us a concrete list of holes we can attempt to fill in with the help of more recently computed scope graphs.

(Bug report originally written by @rewinfrey!)

Unbiased RWS is slower & uses more memory than naïve RWS

Once upon a time, we implemented a strategy to resolve a bias in RWS favouring earlier elements that involved finding the nearest neighbours in both directions at once. This resolved the bias, but unfortunately it added about a fifth to both our runtime and heap usage.

RWS’ constant-time approximation of edit distance is not actually constant-time

editDistanceUpTo is not actually a constant-time approximation of the tree edit distance, because it relies on cutoff, which models a bound on recursion depth, but not on traversal count.

The paper refers to their use of “an iterative deepending [sic] top-down matching that stops after a fixed number of compared nodes and is thus in O(1).” It seems likely that this was a typo of iterative deepening. This is the traversal we’d want to place a constant bound on.

Recursively nested syntax forces recursive/linear diffing

À la carte If syntax is defined s.t. multiple if/else if/else are represented as a recursive right-chained series of Ifs. This means that diffing can’t consider all of the alternatives at once, and therefore that it has to proceed linearly (i.e. we don’t have a list that we can pass to byRWS). That in turn means that if a user inserts or deletes one else if from the middle of a chain, we won’t be able to match up anything after it; the whole rest of the chain will end up as a replacement.

This is going to happen any time we have a recursively nested chain of nodes instead of a list; If is the only example that came to mind immediately, but we should in general attempt to model syntactic sequences with a list field instead of with chains of nodes.

Experiment with use of shallow SES in place of/composed with RWS for some/all branch diffs

We developed our RWS implementation in response to flaws of our SES implementation at the time:

It performed n² comparisons in every case, not just the worst case of disjoint inputs.
It had pretty high constant factors due to the long chains of thunks it built up (IIRC).
It performed a prohibitive full tree diff for each comparison in order to select the lowest-cost path through the results.

Later on, we added a shallow SES pass to RWS to match up identical, and then later equivalent trees early, enabling us to lend more weight to human-readable identifiers (e.g. we’ll match two functions named “foo” at a high priority regardless of changes to their bodies). This shallow treatment of SES + the implementation of Myers’ algorithm for SES address those issues while also enjoying much more predictable & stable behaviour than RWS.

Given the problems with RWS, let’s experiment with adding an SES instruction to Algorithm and using it instead of RWS in some/all cases.

Add streaming interface for `semantic parse --symbols`.

Our heap usage in invocations of semantic in this case is suboptimal: we eagerly load every file in a project into memory, and only relinquish it when tagging is complete. There’s no reason why we can’t do this in a fixed amount of memory, given that in production use cases memory can be very limited. My hypothesis is that if we switch to a streaming model—that is, if we make the git ls-tree process asynchronous, and fire off a request for an API.File per-line, then fold these File values into a single Vector File, we should see better memory performance. We should also be able to have more fine-grained concurrency control.

Streaming in Haskell is… well, it’s hard. There are so many options, so many libraries, so many idioms. I’ve been exploring this, and I’ve gotten a successful proof-of-concept up and running, one that shows significantly less peak heap usage. I think we should add real capabilities for semantic to do this.

Currently the infrastructure for Task is very specific to batch-oriented CLI use. We read a whole Project into memory, then use the Distribute effect to parallelize. It’s not particularly straightforward to just switch out the streaming mechanisms, so I plan to add a new command. What I’d like to see is something like this:

semantic stream --symbols --gitDir=../heaven --sha=HEAD

My proof-of-concept uses the streaming-process library to read the lines from git-ls-tree, the streaming-bytestring package to parse those lines into git entries, the streaming primitives to turn those entries into API.File values, and the foldl package to consume those tags into a single Vector. Streaming architectures are hard to write and sometimes inflexible, but they’re very elegant when they work. In addition, I’ve gotten the streaming ecosystem to work with fused-effects. (I originally considered streamly, but it doesn’t have good support for reading from processes, and it relies on MonadBaseControl, which I consider harmful.)

Investigate a shared cache of build artifacts for CI/local development.

This would be a nontrivial amount of engineering effort, but I’ve heard reports of people compiling 100k SLOC Haskell projects in two minutes. I think effort spent on this would positively impact our ability to turn around revisions quickly, in the manner that a scripting-language-based platform would.

Add macOS/Linux binary downloads.

Is there a plan to do a release of binaries? That would be quite helpful for people who just want to try out the tool.

Path/project ingestion logic is duplicative and painful

Each of our various CLI flags uses slightly different mechanisms to yield a Data.Project that it can analyze. parse uses Semantic.IO.findFilesInDir, parse --gitDir uses readBlobsFromGitRepo, our Util functions use readProjectFromPaths, Files provides readProject, and there are probably more. This entails a whole lot of duplicated code and pain. Some of this complexity is essential (for parsing from git repos, we need to use Git.lsTree, but we don’t do that for arbitrary directories), but we should provide a single place for blobs, projects, etc. to be read, probably with the Files DSL.

Switch to constant-time (replacement) diffs past a certain depth

We can, and should, bound our recursion, falling back to diffing by replacement once we’ve gone n levels deep into the tree.

Holes lose information

Holes are essentially empty, meaning that we lose potentially useful information about what condition caused the hole. The various resuming handlers could store the error(s) in holes, and our abstract domains could be monoidal with respect to these (coalescing sets of error conditions together).

Investigate distribution via third-party package managers.

Are there any plans to put semantic on any common system package managers (i.e. apt, brew, etc)?

It's quite a long from-source install right now, and it would be really useful if there was some sort of distributed binary that could be installed faster.

Publish to Hackage/Stackage.

Is there any plan to publish on hackage and stackage?

Confusing CLI error message when mixing --language and :Lang suffixes

If you use both --language and a :Lang suffix on a path, you get a confusing error message:

λ :!semantic graph --language=Haskell ~/Desktop/test.md:Haskell
semantic: /Users/rob/Desktop/test.md:Haskell: openBinaryFile: does not exist (No such file or directory)

We should probably tell you that you can’t mix these formats instead.

Python assignment error in tensorflow/models

This dies:

    dataset = (tf.data.TextLineDataset(file_path)  # Read text file
        .skip(1)  # Skip header row
        .map(decode_csv, num_parallel_calls=4)  # Decode each line
        .cache() # Warning: Caches entire dataset, can cause out of memory
        .shuffle(shuffle_count)  # Randomize elems (1 == no operation)
        .repeat(repeat_count)    # Repeats dataset this # times
        .batch(32)
        .prefetch(1)  # Make sure you always have 1 batch ready to serve
    )

with

[15:40:26]  WARN ../models/samples/outreach/blogs/blog_custom_estimators.py:71:16-77:15: error: expected DottedName, but got Comment
69:         return d
70: 
71:     dataset = (tf.data.TextLineDataset(file_path)  # Read text file
                   ^...

This is in tensorflow/models.

importing from '.' is not implemented

Saw this while generating the scope graph for https://github.com/pallets/flask. I believe the file was flask/app.py.

[08:14:22]  WARN Pipeline error: Unhandled internal exception: importing from '"."' is not implemented
CallStack (from HasCallStack):
  error, called at src/Language/Python/Syntax.hs:104:58 in semantic-0.4.0-DkhVMfhhsF0KnFJ7Lg1Wz8:Language.Python.Syntax

Windows support

Nothing about the library ~~or executable~~ is Unix-specific, so we should support Windows. Creating this issue to track it.

Performance/benchmarks dashboard

We should pull in something like https://github.com/nomeata/gipeda and graph our performance over time.

Data.Scientific exception when parsing tensorflow/models

Ran into this exception when generating a scope graph for tensorflow/models. I think the file in question might be global_objectives/loss_layers_test.py.

Unhandled internal exception: fromRational has been applied to a repeating decimal which can't be represented as a Scientific! It's better to avoid performing fractional operations on Scientifics and convert them to other fractional types like Double as early as possible.
CallStack (from HasCallStack):
  error, called at src/Data/Scientific.hs:311:23 in scientific-0.3.6.2-KIIlODNgUeO9twnHef2PBn:Data.Scientific

Use gitlib and its data types to interface with Git.

https://github.com/jwiegley/gitlib

We’re currently reinventing a number of Git primitives poorly (usually as wrappers around Text). For our own sanity’s sake, let’s not maintain a Git ref parser: we should just use these data structures instead of those in Semantic.Git.

Move from machines to streaming.

After spending a lot of time with the various streaming libraries, the best-equipped for our needs is streaming and the streaming-* ecosystem. We should convert our machines-based stuff to streaming.

RWS always finds a nearest neighbour in non-empty terms

Given a diff of branch nodes each with one child, a & aʹ respectively, RWS always finds a nearest neighbour, even when a and aʹ are substantially different. Note that if they are categorized differently it’ll still avoid performing the comparison, and also that this won’t affect its selection of (approximately) the best comparisons among larger branches.

The RWS-Diff paper uses the squared Euclidean distance of the feature vectors as its approximation of their edit distance, and perhaps we could set an upper bound in terms of that measurement; but we’d probably want to refine this quite a bit.

Infinite loop in keras utils/vis_utils.py

Found an infinite loop while generating the scope graph for keras. I believe the culprit is keras/utils/vis_utils.py.

This is while interpreting in the concrete domain.

RWS’ hashing clamps label similarity to {0,1}

We compute labels for every node. Currently, label similarity is computed by hashing, clamping it to either 0 (different) or 1 (identical). This is counterproductive; only slightly distinct leaves are treated identically to disjoint leaves.

This is necessitated by RWS, so a better similarity metric on labels will require better similarity metrics on trees and better subterm matching to have any effect.

LeanCheck property tests are biased towards the first argument

@tclem & I found a while back that a property test with four args would use the same value for the latter three on every iteration. It’s like how digits work: you enumerate all the digits in the rightmost place before incrementing the next one, and enumerating all the digits in the rightmost column again.

If you only do a thousand tests, we have enough Category constructors to almost exhaust that right there. Multiply that by the size of Syntax and so forth and you’d have to do an incredible number of tests before trying a different value for the second arg. We no longer have those types, but the problem still occurs when we have annotations which can take multiple values.

The enumeration we do right now is column-major, but we’d prefer an ordering that explores the space in a more balanced way; the Hilbert curve comes to mind, but Morton ordering may be more appropriate.

Alternatively, maybe we should move back to using QuickCheck or another stochastic property testing framework like hedgehog.

Provide a man page

Provide a man page for the deployed tool. A comprehensive operator’s guide, essentially.

Add an issue template.

Investigate incremental parsing/tree reuse

Is the whole file always parsed and diffed or can unchanged sections be easily reused?

The use case I have in mind is LSP clients often sync the whole file or large sections when the user made small changes or a number of small changes interspersed in a large section. LSP servers usually do better with small changes Their is a open HIE issue for this.

Try/catch/finally blocks are structured poorly for analysis

Statement.Try is defined as a pair of a body node and a list of catch/finally/etc nodes; Statement.Catch has an exception and a body; and Statement.Finally has a single node, for its body. This mirrors the structure of many surface syntaxes, but isn’t particularly amenable to evaluation: the Evaluatable instance for Try would have to go through its list of catch/finally nodes, categorize each according to what actions should be taken, and orchestrate them, potentially throwing exceptions if/when unanticipated syntax occurs in those positions. Meanwhile, the Evaluatable instances for Catch and Finally would have to be defined as essentially no-ops, despite them actually being in the ideal circumstances to work with their actual contents.

By contrast, if we structured a try/catch/finally block like so:

try {
  …foo…
} catch (e) {
  …bar…
} finally {
  …quux…
}

as a left-nested structure like so:

Finally (Catch (Try (…foo…)) (e) (…bar…)) (…quux…)

then the Evaluatable instance for Try could simply evaluate its body, while the instances for Catch and Finally could insert the appropriate machinery to catch guest-language exceptions/run cleanup after exceptions. None of them would require any special knowledge of the semantics of any of the others, and we wouldn’t have to do any kind of meta-error-handling to deal with unexpected syntax (beyond what we already do with Unspecialized exceptions). The crucial detail here is that Catch and Finally both contain the term over whose effects they scope.

This would be ideally structured for analysis, but might be surprising to consumers expecting a simpler tree where Try is the root node.

Eta

According to your report here: https://github.com/github/semantic/blob/master/docs/why-haskell.md

Eta might help you with some of your concerns with the language. In particular, with error handling due to improved compiler messages, and obviously all the JVM related parts.

http://eta-lang.org/

github / semantic Goto Github PK

semantic's Issues

Recommend Projects

Recommend Topics

Recommend Org