github / semantic Goto Github PK
View Code? Open in Web Editor NEWParsing, analyzing, and comparing source code across many languages
Parsing, analyzing, and comparing source code across many languages
Anonymous (token) nodes are dropped before performing committed choices consisting entirely of non-token symbols (the usual case). You can always match a token by explicitly requesting its symbol, but if you have e.g. multiple optional token nodes separating optional non-token nodes, it becomes difficult and error-prone to ensure that e.g. symbol AnonToken *> namedRule
doesn’t advance past any tokens after the namedRule
.
This is not a common situation, but the current strategy for handling requires some fairly kludgy workarounds.
It would be better if the assignment language included explicit combinators for e.g. dropping anonymous nodes, advancing past the current node, &c., with some mechanism for assigning without the automatic behaviours which otherwise have to be worked around.
The feature vectors for two (sibling) subtrees are computed serially, despite being (partially) independent. Every step (labelling nodes, computing [bags of] p,q-grams, hashing, picking random vectors on the unit d-sphere, and summing unit vectors into feature vectors) is completely parallelizable, depending only on results from the previous step (in the case of computing p,q-grams, the labels for sibling nodes are required, but for all other steps it only requires the result for the node and the subtree descending from it).
These are CPU-bound, so this should speed things up.
Handling Ruby access control (i.e. public
, protected
, and private
) for method declarations is problematic because protected
and private
are methods rather than keywords.
This distinction is crucial in Ruby because this makes the protected
and private
methods open for monkey-patching or aliasing:
class Module
alias_method :barf :private
end
class Foo
barf
def this_should_be_private; end
end
Because we can't rely on keywords in Ruby at the parser level to determine visibility of method declarations, we'll need to potentially handle this during evaluation when constructing the ScopeGraph
. One strategy we discussed is checking the containing scope of a method declaration to determine if it is within a class or module, and if that class or module has applied the protected
or private
method as that state can tell us the current method declaration's access control.
shelly is a good library with a well-thought-out API, but in our case its API is causing an unnecessary UTF-8 encode/decode for each blob read from the filesystem. shelly reads from a process as a ByteString, then converts that into a Text value, which we then convert back into a ByteString with Source.fromUTF8. We can skip all of this and use streaming-bytestring to read as directly from the process as possible; the only thing I want to be careful about is that we behave the same on erroneous input or if the process exits unexpectedly.
Convention in diffs is for deletions to come before insertions. We don’t guarantee that, or at least, we didn’t at one point in time, and I’d like us to test that we do the right thing before closing this.
We currently assume that two statements in a sequence are completely unrelated, e.g.:
def foo
a
b
end
Here, a
and b
are indeed orthogonal, but the result of foo
isn’t just determined by b
, as it would be in a pure language, but rather by the effects performed by both a
and b
. In other words, treating statement blocks as simply returning their final value means that we’re losing the ability to model important properties using the abstract domain. This has real consequences for us, in that computing the call graph requires us to manually inspect syntax, instead of e.g. representing values as sets of referenced symbols, unioned together.
If our abstract domain had a sequencing operation, akin to the *>
method for Applicative
s, we could move a lot of out-of-band inspection of syntax into the representation of the domain, resulting in a simpler, more direct implementation for many analyses.
This will additionally be necessary to correctly model type & effect systems.
We compute the feature vectors for terms inside the diffing machinery, which means it isn’t exposed to the parallelism offered by Task
’s distribute…
family of APIs. This means that we
despite the fact that decorating a term with its feature vectors is a completely independent operation (i.e. parallelizable).
Note that this only affects the two terms for any one diff, and thus if we’re computing two diffs in a PR we’ll compute the feature vectors and diffs for one in parallel to the feature vectors and diffs for the other.
The hashing & RNG used by RWS act counter to its intention of approximating the tree edit distance with the squared Euclidean distance between the d-dimensional feature vectors summarizing nodes. This is caused by two factors:
These problems primarily affect the unit vectors, because only individual nodes’ p,q-grams are hashed to seed the RNG and compute the d-dimensional unit vector. So while two similar leaf nodes may have dramatically different unit vectors (and therefore feature vectors), two trees which are mostly the same will nevertheless have mostly similar feature vectors. That is, RWS works in aggregate because similarity is considered across (relatively large) subtrees; the discontinuity introduced by hashing & the RNG is effectively amortized.
While that works well at larger scales, it falls down fairly consistently at smaller ones. That’s particularly problematic for our use case where small changes tend to dominate for human factors: smaller changes are easier to review. It’s also a factor in the unpredictability that we’ve noticed: reordering the syntax constructors changes the hash values assigned to each, which in turn can result in different results from our test fixtures.
We should set up some CI to run the test suite for each candidate PR.
(To be clear, we were doing this before! We should just start doing it again now that the library is open sourced.)
We won’t always have all of the information available for a program. A partial (hah) list of conditions which can and will cause this:
Since we’re generally using approximate techniques, we will often be able to provide functionality proportional to the amount of information available to us—partial information should mean partial functionality, not zero functionality.
A brief and incomplete catalogue of strategies we can apply for this purpose:
ghc
, for example by using unification to guess the types of unknown terms.I would love to see a Traversal of the immediate children of Core so that we can use things like transform or cosmos.
If depending on lens is a problem, it's possible to just write the traversal without it:
https://github.com/ekmett/lens/wiki/How-can-I-write-lenses-without-depending-on-lens%3F#traversal
The CLI usage info doesn’t mention that you can specify per-file languages with path/to.file:Lang
, and it also doesn’t mention the format for the language either there or in the usage for semantic graph
under --language
.
We should make sure to mention the existence of the former, the format of the language in both cases, and probably also the complete list of supported languages.
Similarly, if you get the format wrong all we say is:
option --language: cannot parse value `haskell'
We should tell you more about the format here as well.
As pointed out in https://twitter.com/mitchellsalad/status/1135618541376479233
We should modify our scope graph implementation so we can extend an initial scope graph computation with incremental changes.
To do so, we'll need to store partial paths to unknown declarations in the scope graph. Given the initial scope graph and a new set of files to parse, we should be able to generate a new scope graph by filling in the gaps using terms we find in those new files.
@joshvera pointed out that we can parallelize some of our SES implementation using the Control.Parallel
primitives par
& pseq
. Specifically, in an edit graph like this one:
each vertex depends on everything to the right of and below it. That means that the top right and bottom left quarters are independent of each other, tho they both depend on the bottom right quarter, and the top left quarter depends on all three. (And it breaks down further, vertex by vertex, in much the same fashion.)
In Semantic.Git, we should pull in attoparsec to parse the elements of git-ls-tree output, rather than matching on Text values, which is slow.
We have internal documentation that describes how to get up and running with semantic quickly. We should port it to the docs/
folder.
On the following typescript file:
export type NoInfer<T> = T & { [K in keyof T]: T[K] };
I'm getting the following error:
cabal new-run semantic -- parse minimal.ts
Up to date
[(Error)
12:08:07] WARN minimal.ts:1:1-3:1: error: expected end of input nodes, but got ParseError
1: export type NoInfer<T> = T & { [K in keyof T]: T[K] };
^...
task=parse github_request_id=- github_is_public=False path=minimal.ts language=TypeScript
This is a minimal example when I tried to parse the file https://github.com/palantir/tslint/blob/master/src/language/rule/abstractRule.ts
As per this tweet from Neil Mitchell:
GHC turns on the parallel GC if you have multiple threads. My experience: never faster, always slower, sometimes 5x slower wall clock time. Pass -qg to the runtime to disable. I don't think the default is right…
We should do some benchmarking to see if this, or this, has a positive effect on performance.
In our current scope graph implementation, declarations can optionally contain an associated scope. This has been a good solution, but as we look toward a future in which iterative scope graph calculation is important for supporting the type of indexing in the alephd
service, we can be more precise in terms of what declaration scopes require additional effort for their complete iterative computation.
Rather than a Maybe scope
for a declaration's associated scope, a sum type indicating status of an associated scope give us a way to store convenient data needed for later computation:
data AssociatedScope scope name = AssociatedScope scope
| NeedToBeCalculated name
| NoAssociatedScope
This is not meant to be a prescriptive solution -- but to illustrate the data needed for cases in which we can identify a hole in the scope graph during evaluation of a syntax term. In such cases, if we know an associated scope should exist but is not found, we can construct a placeholder associated scope (e.g. NeedToBeCalculated
) that contains the name
we should use for the declaration scope lookup later.
Hypothetically, after we store scope graphs in a database or data structure, the ability to query for declarations with missing associated scopes gives us a concrete list of holes we can attempt to fill in with the help of more recently computed scope graphs.
(Bug report originally written by @rewinfrey!)
Once upon a time, we implemented a strategy to resolve a bias in RWS favouring earlier elements that involved finding the nearest neighbours in both directions at once. This resolved the bias, but unfortunately it added about a fifth to both our runtime and heap usage.
editDistanceUpTo
is not actually a constant-time approximation of the tree edit distance, because it relies on cutoff
, which models a bound on recursion depth, but not on traversal count.
The paper refers to their use of “an iterative deepending [sic] top-down matching that stops after a fixed number of compared nodes and is thus in O(1).” It seems likely that this was a typo of iterative deepening. This is the traversal we’d want to place a constant bound on.
À la carte If
syntax is defined s.t. multiple if/else if/else are represented as a recursive right-chained series of If
s. This means that diffing can’t consider all of the alternatives at once, and therefore that it has to proceed linearly (i.e. we don’t have a list that we can pass to byRWS
). That in turn means that if a user inserts or deletes one else if
from the middle of a chain, we won’t be able to match up anything after it; the whole rest of the chain will end up as a replacement.
This is going to happen any time we have a recursively nested chain of nodes instead of a list; If
is the only example that came to mind immediately, but we should in general attempt to model syntactic sequences with a list field instead of with chains of nodes.
We developed our RWS implementation in response to flaws of our SES implementation at the time:
It performed n² comparisons in every case, not just the worst case of disjoint inputs.
It had pretty high constant factors due to the long chains of thunks it built up (IIRC).
It performed a prohibitive full tree diff for each comparison in order to select the lowest-cost path through the results.
Later on, we added a shallow SES pass to RWS to match up identical, and then later equivalent trees early, enabling us to lend more weight to human-readable identifiers (e.g. we’ll match two functions named “foo” at a high priority regardless of changes to their bodies). This shallow treatment of SES + the implementation of Myers’ algorithm for SES address those issues while also enjoying much more predictable & stable behaviour than RWS.
Given the problems with RWS, let’s experiment with adding an SES instruction to Algorithm
and using it instead of RWS in some/all cases.
Our heap usage in invocations of semantic
in this case is suboptimal: we eagerly load every file in a project into memory, and only relinquish it when tagging is complete. There’s no reason why we can’t do this in a fixed amount of memory, given that in production use cases memory can be very limited. My hypothesis is that if we switch to a streaming model—that is, if we make the git ls-tree
process asynchronous, and fire off a request for an API.File
per-line, then fold these File
values into a single Vector File
, we should see better memory performance. We should also be able to have more fine-grained concurrency control.
Streaming in Haskell is… well, it’s hard. There are so many options, so many libraries, so many idioms. I’ve been exploring this, and I’ve gotten a successful proof-of-concept up and running, one that shows significantly less peak heap usage. I think we should add real capabilities for semantic
to do this.
Currently the infrastructure for Task
is very specific to batch-oriented CLI use. We read a whole Project
into memory, then use the Distribute
effect to parallelize. It’s not particularly straightforward to just switch out the streaming mechanisms, so I plan to add a new command. What I’d like to see is something like this:
semantic stream --symbols --gitDir=../heaven --sha=HEAD
My proof-of-concept uses the streaming-process
library to read the lines from git-ls-tree
, the streaming-bytestring
package to parse those lines into git entries, the streaming
primitives to turn those entries into API.File
values, and the foldl
package to consume those tags into a single Vector
. Streaming architectures are hard to write and sometimes inflexible, but they’re very elegant when they work. In addition, I’ve gotten the streaming
ecosystem to work with fused-effects
. (I originally considered streamly
, but it doesn’t have good support for reading from processes, and it relies on MonadBaseControl
, which I consider harmful.)
This would be a nontrivial amount of engineering effort, but I’ve heard reports of people compiling 100k SLOC Haskell projects in two minutes. I think effort spent on this would positively impact our ability to turn around revisions quickly, in the manner that a scripting-language-based platform would.
Is there a plan to do a release of binaries? That would be quite helpful for people who just want to try out the tool.
Each of our various CLI flags uses slightly different mechanisms to yield a Data.Project
that it can analyze. parse
uses Semantic.IO.findFilesInDir
, parse --gitDir
uses readBlobsFromGitRepo
, our Util
functions use readProjectFromPaths
, Files
provides readProject
, and there are probably more. This entails a whole lot of duplicated code and pain. Some of this complexity is essential (for parsing from git repos, we need to use Git.lsTree
, but we don’t do that for arbitrary directories), but we should provide a single place for blobs, projects, etc. to be read, probably with the Files
DSL.
We can, and should, bound our recursion, falling back to diffing by replacement once we’ve gone n levels deep into the tree.
Holes are essentially empty, meaning that we lose potentially useful information about what condition caused the hole. The various resuming handlers could store the error(s) in holes, and our abstract domains could be monoidal with respect to these (coalescing sets of error conditions together).
Are there any plans to put semantic
on any common system package managers (i.e. apt, brew, etc)?
It's quite a long from-source install right now, and it would be really useful if there was some sort of distributed binary that could be installed faster.
Is there any plan to publish on hackage and stackage?
If you use both --language
and a :Lang
suffix on a path, you get a confusing error message:
λ :!semantic graph --language=Haskell ~/Desktop/test.md:Haskell
semantic: /Users/rob/Desktop/test.md:Haskell: openBinaryFile: does not exist (No such file or directory)
We should probably tell you that you can’t mix these formats instead.
This dies:
dataset = (tf.data.TextLineDataset(file_path) # Read text file
.skip(1) # Skip header row
.map(decode_csv, num_parallel_calls=4) # Decode each line
.cache() # Warning: Caches entire dataset, can cause out of memory
.shuffle(shuffle_count) # Randomize elems (1 == no operation)
.repeat(repeat_count) # Repeats dataset this # times
.batch(32)
.prefetch(1) # Make sure you always have 1 batch ready to serve
)
with
[15:40:26] WARN ../models/samples/outreach/blogs/blog_custom_estimators.py:71:16-77:15: error: expected DottedName, but got Comment
69: return d
70:
71: dataset = (tf.data.TextLineDataset(file_path) # Read text file
^...
This is in tensorflow/models.
Saw this while generating the scope graph for https://github.com/pallets/flask. I believe the file was flask/app.py
.
[08:14:22] WARN Pipeline error: Unhandled internal exception: importing from '"."' is not implemented
CallStack (from HasCallStack):
error, called at src/Language/Python/Syntax.hs:104:58 in semantic-0.4.0-DkhVMfhhsF0KnFJ7Lg1Wz8:Language.Python.Syntax
Nothing about the library or executable is Unix-specific, so we should support Windows. Creating this issue to track it.
We should pull in something like https://github.com/nomeata/gipeda and graph our performance over time.
Ran into this exception when generating a scope graph for tensorflow/models. I think the file in question might be global_objectives/loss_layers_test.py
.
Unhandled internal exception: fromRational has been applied to a repeating decimal which can't be represented as a Scientific! It's better to avoid performing fractional operations on Scientifics and convert them to other fractional types like Double as early as possible.
CallStack (from HasCallStack):
error, called at src/Data/Scientific.hs:311:23 in scientific-0.3.6.2-KIIlODNgUeO9twnHef2PBn:Data.Scientific
https://github.com/jwiegley/gitlib
We’re currently reinventing a number of Git primitives poorly (usually as wrappers around Text
). For our own sanity’s sake, let’s not maintain a Git ref parser: we should just use these data structures instead of those in Semantic.Git
.
After spending a lot of time with the various streaming libraries, the best-equipped for our needs is streaming and the streaming-* ecosystem. We should convert our machines-based stuff to streaming.
Given a diff of branch nodes each with one child, a & aʹ respectively, RWS always finds a nearest neighbour, even when a and aʹ are substantially different. Note that if they are categorized differently it’ll still avoid performing the comparison, and also that this won’t affect its selection of (approximately) the best comparisons among larger branches.
The RWS-Diff paper uses the squared Euclidean distance of the feature vectors as its approximation of their edit distance, and perhaps we could set an upper bound in terms of that measurement; but we’d probably want to refine this quite a bit.
Found an infinite loop while generating the scope graph for keras
. I believe the culprit is keras/utils/vis_utils.py
.
This is while interpreting in the concrete domain.
We compute labels for every node. Currently, label similarity is computed by hashing, clamping it to either 0 (different) or 1 (identical). This is counterproductive; only slightly distinct leaves are treated identically to disjoint leaves.
This is necessitated by RWS, so a better similarity metric on labels will require better similarity metrics on trees and better subterm matching to have any effect.
See also #25.
@tclem & I found a while back that a property test with four args would use the same value for the latter three on every iteration. It’s like how digits work: you enumerate all the digits in the rightmost place before incrementing the next one, and enumerating all the digits in the rightmost column again.
If you only do a thousand tests, we have enough We no longer have those types, but the problem still occurs when we have annotations which can take multiple values.Category
constructors to almost exhaust that right there. Multiply that by the size of Syntax
and so forth and you’d have to do an incredible number of tests before trying a different value for the second arg.
The enumeration we do right now is column-major, but we’d prefer an ordering that explores the space in a more balanced way; the Hilbert curve comes to mind, but Morton ordering may be more appropriate.
Alternatively, maybe we should move back to using QuickCheck or another stochastic property testing framework like hedgehog.
Provide a man page for the deployed tool. A comprehensive operator’s guide, essentially.
Is the whole file always parsed and diffed or can unchanged sections be easily reused?
The use case I have in mind is LSP clients often sync the whole file or large sections when the user made small changes or a number of small changes interspersed in a large section. LSP servers usually do better with small changes Their is a open HIE issue for this.
Statement.Try
is defined as a pair of a body node and a list of catch/finally/etc nodes; Statement.Catch
has an exception and a body; and Statement.Finally
has a single node, for its body. This mirrors the structure of many surface syntaxes, but isn’t particularly amenable to evaluation: the Evaluatable
instance for Try
would have to go through its list of catch/finally nodes, categorize each according to what actions should be taken, and orchestrate them, potentially throwing exceptions if/when unanticipated syntax occurs in those positions. Meanwhile, the Evaluatable
instances for Catch
and Finally
would have to be defined as essentially no-ops, despite them actually being in the ideal circumstances to work with their actual contents.
By contrast, if we structured a try
/catch
/finally
block like so:
try {
…foo…
} catch (e) {
…bar…
} finally {
…quux…
}
as a left-nested structure like so:
Finally (Catch (Try (…foo…)) (e) (…bar…)) (…quux…)
then the Evaluatable
instance for Try
could simply evaluate its body, while the instances for Catch
and Finally
could insert the appropriate machinery to catch guest-language exceptions/run cleanup after exceptions. None of them would require any special knowledge of the semantics of any of the others, and we wouldn’t have to do any kind of meta-error-handling to deal with unexpected syntax (beyond what we already do with Unspecialized
exceptions). The crucial detail here is that Catch
and Finally
both contain the term over whose effects they scope.
This would be ideally structured for analysis, but might be surprising to consumers expecting a simpler tree where Try
is the root node.
According to your report here: https://github.com/github/semantic/blob/master/docs/why-haskell.md
Eta might help you with some of your concerns with the language. In particular, with error handling due to improved compiler messages, and obviously all the JVM related parts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.