microsoft / lsif-node Goto Github PK

Define an index format for Language Servers

License: MIT License

JavaScript 2.34% TypeScript 97.66%

lsif-node's Introduction

Language Server Index Format

The purpose of the Language Server Index Format (LSIF) is it to define a standard format for language servers or other programming tools to dump their knowledge about a workspace. This dump can later be used to answer language server LSP requests for the same workspace without running the language server itself. Since much of the information would be invalidated by a change to the workspace, the dumped information typically excludes requests used when mutating a document. So, for example, the result of a code complete request is typically not part of such a dump.

A first draft specification can be found here.

How to Run the tools

> npm install -g lsif to install the LSIF tool chain.
> lsif tsc -p .\tsconfig.json --stdout creates a LSIF dump for the given typescript project. Output format is new line separated JSON.

If the project provides and npm package or is depending on other npm modules the TypeScript monikers can be converted into stable npm monikers. To do so you can either ask the tsc tool to already do that using

> lsif tsc -p .\tsconfig.json --package .\package.json --stdout

or you can run the tool separate in case you want to inspect the newly generated NPM monikers using

lsif tsc -p .\tsconfig.json --stdout || lsif npm --stdin --package .\package.json --stdout

Please note that the tools are work in progress and that we have not done any extensive testing so far. Known issues are:

Go to Declaration for function overloads doesn't honor the signature
Go to Type Declaration is not fully implement
Document link support and go to implementation is completely missing

Both tools support --help to get information about their command line arguments.

LSIF utility tools

You can validate LSIF output using the LSIF utility tools.

LSIF extension

There is also an extension for VS Code that can serve the content of a LSIF JSON file. Consider you have dumped the content of a workspace into an LSIF JSON file then you can use the extension to serve the supported LSP requests. This works as follows:

follow the steps in 'How to Run the tools` above and create a dump.
> git clone https://github.com/Microsoft/vscode-lsif-extension.git
> cd vscode-lsif-extension
> npm install
> npm run compile
open the workspace on vscode-lsif-extension using code.
switch to the debug viewlet and launch Launch Client
in the launch version of VS Code open the command palette and execute the command: Open LSIF Database
in the open file picker dialog navigate to a created dump and select it.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Legal Notices

Licensed under the MIT License.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

lsif-node's People

Stargazers

Watchers

lsif-node's Issues

Edge emitted before vertex is emitted

Running node .\node_modules\tsc-lsif\lib\main.js -p .\path\to\tsconfig.json --outputFormat=line | node .\node_modules\npm-lsif\lib\main.js .\path\to\package.json emits an edge before a vertex. I've reproduced this on this repo. In this case, an edge referring to vertex "4868" is emitted before the vertex. I've attached the out file for completeness.
out1.txt

LSIF TSC emitter high resource usage

I've run the TSC emitter on https://github.com/Microsoft/vscode/ and it aborted with a V8 out-of-memory error when giving it a 4GB memory limit:

node --max-old-space-size=4096 ~/Code/lsif-node/tsc/bin/lsif-tsc -p tsconfig.json --projectRoot /Users/most/Code/vscode/src/ > vscode.lsif

However, it worked when I gave it 8GB limit.

Is that expected for a project of this size? what about really big projects?

The same applies for CPU usage: both TypeScript and TSC emitter use one thread/process (low utilization) and spend a long time to output lsif (console progress would be nice).

I wonder how would that work for big projects?

Tried running lsif-tsc and it crashed

I was running it over an internal codebase and it printed dots and then crashed with the following stack:

Error: Partition for symbol QKQAbLXLFhLyVyj6S2I7+g== have already been cleared
    at StandardSymbolData.getOrCreatePartition (/node_modules/lsif-tsc/lib/lsif.js:352:19)
    at AliasedSymbolData.addReference (/node_modules/lsif-tsc/lib/lsif.js:454:22)
    at Visitor.handleSymbol (/node_modules/lsif-tsc/lib/lsif.js:1546:20)
    at Visitor.visitIdentifier (/node_modules/lsif-tsc/lib/lsif.js:1527:14)
    at Visitor.visit (/node_modules/lsif-tsc/lib/lsif.js:1321:22)
    at node.forEachChild.child (/node_modules/lsif-tsc/lib/lsif.js:1334:45)
    at visitNode (/node_modules/typescript/lib/typescript.js:16622:24)
    at Object.forEachChild (/node_modules/typescript/lib/typescript.js:16986:24)
    at NodeObject.forEachChild (/node_modules/typescript/lib/typescript.js:121434:23)
    at Visitor.doVisit (/node_modules/lsif-tsc/lib/lsif.js:1334:18)

Allowing equal ranges

The current specification states:

The ranges emitted for a document must follow these rules:
No two ranges can be equal.

This is problematic for tagged ranges which may want to have multiple kinds of tags at the same location due to preprocessing of the source code.

Consider the following C++ example:

#define M int x;
M

The second line is both a reference to M but also a definition of x.
What would be the right way to express this with tagged ranges?
Should it be tagged just as a reference to M because the definition of x is not visible in the code?

Discuss differences between monikers and SymbolDescriptors

SourceGraph introduced the concept of a SymbolDescriptor to identify / locate a symbol (see https://github.com/sourcegraph/language-server-protocol/blob/workspace-dependencies/extension-workspace-references.md)

LSIF defines monikers which are used to relate symbols without the need of interpreting the moniker (at least not the programming specific part). Moniker might be needed to be interpreted in terms of know to which module system and to which module they relate.

We should see if the two concepts can be unified.

Context support

How do we handle languages where the contents of a source file depends on the path from which it is imported. C/C++ is an example, where the content of a header file can vary if imported from two different .c files depending on defines.

One proposal to solve this is to add context information to the request edges using additional properties

Moniker incorrectly computed for Union and intersection types

Example:

export type PropertyCreator = (
    instance: any,
) => void

export type BabelDescriptor = PropertyDescriptor & { initializer?: () => any }

The moniker for initializer is local since the type literal has no name. However the type BebelDescriptor is exported and has a property initializer. So the moniker should be BabelDescriptor.initializer

Should the LSIF only support non-deprecated LSP data types

For example, only MarkupContent

Moniker for Generics / things without a declaration

There are use cases where we need monikers for things that necessarily don't have a declaration. An example is C# where when hovering of List the hover refers to List and not to List. To share the hover between all usages of List we need to have either a declaration for List or at least a result set.

resultSet not connected to any edge

Steps to reproduce:

Clone and install lsif-util
lsif-tsc -p .\tsconfig.json | node .\lib\main.js validate --stdin

The above line should validate the LSIF output of the lsif-tsc tool using the lsif-util itself as the repository to index.

13/452 (roughly 3%) of the resultSet vertices are not connected to any edge. This is likely a bug that should be investigated.

[lsif-util] Tool skips thorough validation

I'm trying to run the validation tool on stdin, and I see a message saying "skipping thorough validation" because lsif-protocol is not found in the global node modules location. Is this expected? It would be nice if the lsif-util tool could install lsif-protocol as a dependency if it is one.

➜ node out/index.js | lsif-util validate --stdin
Reading input... done
Skipping thorough validation: /Users/arjun/.nvm/versions/node/v9.11.1/node_modules/lsif-protocol/lib/protocol.d.ts was not found
...

//cc @jumattos

Double check correct moniker generation for export statement inside module declaration

declare module 'applicationinsights' {
    export = ApplicationInsights;
}

"Already managed document data" error

@DanTup reported the following error while using lsif-tsc 0.4.1 on the Dart-Code repo. See full logs here: output.txt

Error: There was already a managed document data for file: /Users/dantup/Dev/Dart-Code/src/shared/analysis_server_types.ts
    at DataManager.getDocumentData (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:910:19)
    at Visitor.getOrCreateDocumentData (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:1298:39)
    at Visitor.getOrCreateSymbolData (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:1333:37)
    at TypeAliasResolver.resolve (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:802:58)
    at result.dataManager.getOrCreateSymbolData (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:1347:148)
    at DataManager.getOrCreateSymbolData (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:942:22)
    at Visitor.getOrCreateSymbolData (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:1346:35)
    at Visitor.visitIdentifier (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:1265:31)
    at Visitor.visit (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:1073:22)
    at node.forEachChild.child (/Users/dantup/.npm-global/lib/node_modules/lsif-tsc/lib/lsif.js:1076:49)

Steps to repro

Verify that lsif-tsc 0.4.1 is installed
Clone the https://github.com/Dart-Code/Dart-Code repo
Run lsif-tsc -p ./tsconfig.json from the repo directory

Unfortunately, when I try to repro with lsif-tsc 0.4.1, I'm running into a different error, full logs here: full-logs.txt.

TypeError: sourceFiles.values(...) is not a function or its return value is not iterable
    at Visitor.getOrCreateSymbolData (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/lib/lsif.js:1332:44)
    at Visitor.visitIdentifier (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/lib/lsif.js:1265:31)
    at Visitor.visit (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/lib/lsif.js:1073:22)
    at node.forEachChild.child (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/lib/lsif.js:1076:49)
    at visitNode (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/node_modules/typescript/lib/typescript.js:16622:24)
    at Object.forEachChild (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/node_modules/typescript/lib/typescript.js:16975:24)
    at NodeObject.forEachChild (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/node_modules/typescript/lib/typescript.js:121404:23)
    at Visitor.visit (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/lib/lsif.js:1076:22)
    at node.forEachChild.child (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/lib/lsif.js:1076:49)
    at visitNode (/Users/arjun/.nvm/versions/node/v9.11.1/lib/node_modules/lsif-tsc/node_modules/typescript/lib/typescript.js:16622:24)

Hope this helps!

[nit] Property attribute is inconsistent (plural vs singular)

As a disclaimer, I know this issue is very nit-picky, but I thought it could be valuable to bring it up anyway.

The property attribute is inconsistent between different types of vertices. For example, the property for an edge between two reference results is referenceResults, plural. But the property for an edge from a reference result to a definition is simply definition, singular. As per LSIF specification, ReferenceResult looks like this:

export interface ReferenceResult {
  label: 'referenceResult';

  declarations?: (RangeId | lsp.Location)[];

  definitions?: (RangeId | lsp.Location)[];

  references?: (RangeId | lsp.Location)[];

  referenceResults?: ReferenceResultId[];
}

Therefore I would expect the property to be definitions, plural.

This kind of inconsistency happens even within LSIF. For example, the ImplementationResult:

interface ImplementationResult {
  label: 'implementationResult';

  result?: (RangeId | lsp.Location)[];

  implementationResults?: ImplementationResultId[];
}

Both result and implementationResults are array properties that might contain more than one item, but the former is singular and the latter, plural.

It would be less confusing if we kept everything consistent, specially as more languages adopt LSIF. It seems kind of arbitrary whether we use plural or singular for properties.

Remove skip of throughout validation

The lsif-util validation tool was originally written before the release of the lsif-protocol. For that reason, it supports skipping throughout validation, since the user had to clone the lsif-node repo separately to get the protocol file.

Now that lsif-protocol is a dependency, it should never be the case that the protocol file is not available. Skipping throughout validation turned into a bug; if that happens, something is wrong if the tool.

Not finding the protocol file should be treated as an error and immediately shut down the tool.

Exception when running LSIF tool with npm rewriter

{"id":1,"type":"vertex","label":"metaData","version":"0.1.0"}
C:\src\VSCloudKernel\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\npm-lsif\lib\main.js:89
        this.outDir = outDir.charAt(outDir.length - 1) !== '/' ? outDir + '/' : outDir;
                             ^

TypeError: Cannot read property 'charAt' of undefined
    at ExportLinker.addOutAndRoot (C:\src\VSCloudKernel\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\npm-lsif\lib\main.js:89:30)
    at Interface.rd.on (C:\src\VSCloudKernel\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\npm-lsif\lib\main.js:278:34)
    at emitOne (events.js:116:13)
    at Interface.emit (events.js:211:7)
    at Interface._onLine (readline.js:280:10)
    at Interface._normalWrite (readline.js:422:12)
    at Socket.ondata (readline.js:139:10)
    at emitOne (events.js:116:13)
    at Socket.emit (events.js:211:7)
    at addChunk (_stream_readable.js:263:12)
events.js:183
      throw er; // Unhandled 'error' event

Offset encoding in LSIF dumps

The LSP defines that character offsets in lines (e.g. the character property in Positions) is an index into a string encoded using uft-16. Strings are encoded in memory using utf-16 in programming languages like Java, JavaScript and .Net. However newer languages use utf-8 as the encoding for strings in memory. This causes some friction in the LSP mainly because:

implementors of clients and servers are not aware of this
there is no way right now to detect that the client and server are using a different encoding.

There is a push in LSP right now to improve this (no concrete action has been decided yet) but we should try to avoid the same problem in LSIF. So I would like to discuss possible options:

we pick one encoding that is supported in LSIF. This doesn't have to be utf-16. However to my knowledge there is no best pick. utf-32 would be great since it is a fixed encoding, but almost no programming language have support for it. So it requires conversion on all sides. utf-16 is bad for new programming lanaguges utf-8 for more traditional languages. Byte offsets would need to be defined on an encoding as well since we render characters in the UI not bytes. So no actual benefit.
we allow to create dumps with different encodings and store the choosen encoding in the meta data. This means that someone will need to convert if client and server don't speak the same encoding. The tendency in LSP is currently to make servers do the conversion since they usually have access to the files. A web interface for example would have a hard time doing the conversion for example for a find all reference result since it would need to fetch the content of all files mentioned in the references result. Note that this also requires that the file content is part of the dump.
we allow to store ranges with n encodings (pratically utf-8 and utf-16). This doesn't mean that every range is duplicated since as long as all characters on the line before the character offest are smaller than ASCII 127 the offsets are the same. So we could attach the converted ranges somehow to the orginial range.

I tend to go with a combination of 2. and 3. We store the ecodings with the dump. The first encoding is the primary encoding the ranges are in. All other encodings are scondary and the value can be reach from the primary range using an edge like utf-8 or utf-16

Should the indexer be part of a language server or a standalone tool

Or both.

How to efficiently reach a moniker

I see the need for this. Possible options:

Assuming that we implement moniker mappings we could think about inlining the moniker as a value object into the declaration range / result set as an optional property.
We spec that before a request specific edge (e.g. textDocument/* ) can be emitted for a range / result set the corresponding moniker must be emitted if available. This makes sure that the DB imported knows the moniker before any request specific result comes.
We move the whole format away from a vertex / edge model more to an embedded Id model (relational model). So a range or result set would have a monikerId. This way we force that the moniker is emitted before the range / result set

I am actually in favor of 1. or 2. Three gets IMO complicated with something like ranges. We would need to hold back the document until all ranges are known since we would need to reference them from a contains property in the document. I am the opinion that such a model will complicate writing these exporters.

Clarify relationships between vertices, espeically result set.

We should make clear what the relationships introduced by edges between verices is. For example there can be only one DefinitionResult associated with a result set.

[visualization] Distance between vertices is incorrect for 1:N edges

With the advent of 1:N edges, the visualization distance became a little funky. Say we had the following LSIF input (v3):

{ "id": 1, "type": "vertex", "label": "document" }
{ "id": 2, "type": "vertex", "label": "range" }
{ "id": 3, "type": "vertex", "label": "range", "tag": { "text": "foo" } }
{ "id": 4, "type": "edge", "label": "contains", "outV": 1, "inV": 2 }
{ "id": 5, "type": "edge", "label": "contains", "outV": 1, "inV": 3 }

If we asked to visualize range 2 with distance 1, we would only see doc 1 and range 2 (correct behavior). That is because the distance between vertices 2 and 3 is 2.

However, the new LSIF input looks like this:

{ "id": 1, "type": "vertex", "label": "document" }
{ "id": 2, "type": "vertex", "label": "range" }
{ "id": 3, "type": "vertex", "label": "range", "tag": { "text": "foo" } }
{ "id": 4, "type": "edge", "label": "contains", "outV": 1, "inVs": ["2", "3"] }

Unfortunately, the tool cannot distinguish edge 4 as two separate edges. It now calculates the distance between vertices 2 and 3 to be 1, not 2. Therefore, both ranges will be printed out (incorrect behavior).

Do we need to define the basic structure of the monikers

The current monikers use : as a separation character which could be a valid file name character (at least under Linux and macOS). So we need to spec this and escape in a file name.

Full document request support for external libraries

The indexer dumps for external libraries only referenced symbols. Should we dump folding, document symbol, diagnostic, and document link information for the whole content. Should we support this via a flag?

Do monikers need to be either export or import

From @LukaszMendakiewicz

Is it actually required to differentiate between export and import monikers, or could export/import be just an optional property on a generic moniker vertex, for languages where this information is available unambiguously?

In C++ in general case there is not a specific syntax to export or import a symbol (think libs, not DLLs), so any declaration may be either – I only know that in link time based on the fact whether given project provided a definition (then it is exporting) or not (then it is importing). So 1) I would need to defer providing this information until I finish processing the entire project, and 2) I will never be able to tell whether a moniker is importing intentionally or by mistake (because the user forgot to provide a definition).

Additionally for certain code elements like inline functions (defined in a header) or types there really is no a single domicile project – they are neither exported nor imported in any, they are just one definition common to all. (Note that a header file may legally, and commonly is, not included in project file).

A range can be a declaration/definition and a reference in C#

Example:

public static implicit operator string(Program p)
{
    return "hello";
}

The word string is both the definition of the conversion operator as well as a reference of the type string

Declarations being treated as definitions

Consider the following typescript code snippet:

interface Person {
    walk(): void;
}

class Student implements Person {
    walk(): void {
        console.log("Student walking...");
    }
}

Person.walk (line 2) is being recorded by the tool as a definition, although it doesn't have an implementation:

{
    "id": 22,
    "type": "vertex",
    "label": "range",
    "start": {
        "line": 1,
        "character": 4
    },
    "end": {
        "line": 1,
        "character": 8
    },
    "tag": {
        "type": "definition",
        "text": "walk",
        "kind": 7,
        "fullRange": {
            "start": "@{line=1; character=4}",
            "end": "@{line=1; character=17}"
        }
    }
}

This can be particularly troublesome for goto implementation, since it depends on differentiating between definitions and declarations (a declaration cannot be an implementation).

I believe the issue is related to the initializeDeclarations function and propose a check for implementation body before assigning the type of the vertex.

Project Reference Support

HeeJae Chang brought up the subject of local project reference and how we should model them in LSIF. The use case is a solution containing n projects which depend on each other. TypeScript supports a comparable feature so I started to implement project references for TS. A first cut can be found here: https://github.com/Microsoft/lsif-typescript/tree/dbaeumer/projectReferences

To make this work nicely I started to make changes to the moniker. Some of them came up in earlier as well. Monikers are currently a string with no defined structure. An example is “npm:myModule:folder/provide:add” which denotes a symbol add exported by the npm package myModule under the path folder/provide. The nice thing about the string is that it is easy to match. However fuzzy matching becomes complicated especially when we add version information. Structurally a moniker can have the following properties:

export interface Moniker {
	/**
	 * The actual moniker name.
	 */
	name: string;

	/**
	 * The path of the moniker. Usually this is a relative file path
	 * inside the project.
	 */
	path?: string

	/**
	 * The package manager through which this moniker is accessible
	 */
	packageManager?: string;

	/**
	 * The name of the package that provides the symbol.
	 */
	package?: string;

	/**
	 * A version indentifier
	 */
	version?: string;
}

The question for is now how we model this in LSIF. I see the following options:

we keep it as a string and define its structure and separator characters
we store it as an object literal
we break the moniker up and store parts of it with the import, export result and with a new package vertex

Let me clarify the last item a little: a lot of this information is redundant and repeats for every moniker. For example all exported symbols for a file provide.ts will have the path set to provide. And if the moniker comes from a package managed by a package manager all these values are the same for all exported monikers of that package.

My questions are now:

can someone think of any other property necessary on a moniker
do we need to spec the version (semver) or even make this a literal as well to support different version schemes
do we want to break up the moniker (see 3. above). I personally would not do this and would leave storage optimization to the DB importer.

Feedback welcome

LSIF TSC emitter crash with TypeScript error

Running LSIF on vscode source code crashed with an error:

TypeError: Cannot read property 'kind' of undefined
    at pipelineEmitWithHint (/Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:83854:39)
    at print (/Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:83754:13)
    at Object.writeNode (/Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:83614:13)
    at /Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:107880:50
    at Object.mapToDisplayParts (/Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:96944:13)
    at Object.getSymbolDisplayPartsDocumentationAndSymbolKind (/Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:107878:61)
    at /Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:121297:41
    at Object.runWithCancellationToken (/Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:31434:28)
    at Object.getQuickInfoAtPosition (/Users/most/Code/lsif-node/tsc/node_modules/typescript/lib/typescript.js:121296:34)
    at Visitor.getHover (/Users/most/Code/lsif-node/tsc/lib/lsif.js:1447:46)

That was reproducible over the code: ...args: TS in https://github.com/Microsoft/vscode/blob/b2cd559eb1481e220c587f044a514b475e3c6a1e/src/vs/platform/instantiation/common/instantiation.ts#L98

I've worked around by adding try/catch and return undefined in lsif.ts Visitor.getHover

Before I send a PR, I'm wondering if the emitter should be resilient to the underlying language service errors.

Support mode where only the root of a project tree is dumped and not it referenced projects

The vertex in question:

{"id":3,"type":"vertex","label":"document","uri":"file:///c:/Users/dirkb/Projects/mseng/VSCode/lsif-node/samples/typescript-ref/shared/src/provide.ts","languageId":"typescript","contents":"ZXhwb3J0IGZ1bmN0aW9uIGFkZChhOiBudW1iZXIsIGI6IG51bWJlcik6IG51bWJlciB7DQoJcmV0dXJuIGEgKyBiOw0KfQ=="}

Consider reducing redundancy between declaration and references

Consider the following C++ code:

void foo(); // R1
void foo(); // R2

where R1 and R2 indicate two distinct Range vertices.

Following the current LSIF spec, the output should contain the following vertices:

Range for R1
Range for R2
ResultSet
DeclarationResult
ReferenceResult

and the following edges:

a) 1 -> 3; refersTo
b) 2 -> 3; refersTo
c) 3 -> 4; textDocument/declaration
d) 3 -> 5; textDocument/references
e) 5 -> 1; item (property=declaration)
f) 5 -> 2; item (property=declaration)
g) 4 -> 1
h) 4 -> 2

Note that "e" and "f" could be implicit edges (inlined into array), and "g" and "h" in the current spec are implicit edges (formed by the result field array in DeclarationResult), but that does not affect the crux of the issues.

The crux is that "e" is redundant with "g", and "f" is redundant with "h", and such redundancy would be repeated for every additional declaration in the source code.

Similar redundancy would be present for DefinitionResult and item (property=definition), although on much smaller scale (in general, there is only one definition for a given ResultSet).

I propose that declaration and definition edges going out of ReferenceResult are dropped. They can easily be reconstructed for serving textDocument/references query by augmenting that result set with the results from textDocument/declaration and textDocument/definition from the same ResultSet. Note also that after dropping these edges, "property" field on the "item" edge will no longer be necessary, since "reference" and "referenceResult" could be distinguished by the type of the incoming vertex of the "item" edge.

Are there languages that would make the above not possible?

Should some of the edges be modeled as a back reference to save space

For example, the contains edge between a document and range vertex could be modeled as a belongsTo property on the range.

On 1:N edges and retro compatibility

Version 0.4 of LSIF converted some edges from 1:1 to 1:N. It is not clear, however, if the change is retro compatible. The following contains edge, for example:

{ "id": 4, "type": "edge", "label": "contains", "outV": 1, "inV": 2 }

Would this still be a valid edge? According to the protocol, contains is now 1:N and should have the array property "inVs". The v0.4 equivalent would be:

{ "id": 4, "type": "edge", "label": "contains", "outV": 1, "inVs": ["2"] }

Are both notations valid? If so, the protocol needs to be changed to be retro compatible.

Monikers for public visible only

Should we only have monikers for public visible symbols

Should we make declarations and definitions an array type only

In LSP, they are defined as Location | Location[]

"scheme" or "schema"

The protocol uses both the term "scheme":

The scheme of the moniker. For example tsc or .Net
scheme: string;

and "schema":

schema owners are allowed to define the structure if they want.

I think the intended term here is "schema", so I suggest the protocol is adjusted to use that as the name of the field while it is still early and can be changed without breaking too many customers.

Optimizing the text content of libraries

The current approach is to embed these into the document#contents property. But they might be the same for most projects. Should we support reusing them?

Moniker schema versioning

As the languages evolve and/or exporter defects are fixed, we may end up in situations where we want to revise the schema used for monikers. LSIF exporters can trivially achieve that by appending a version number to Moniker.scheme value.

Changing the moniker schema would render old LSIF dumps incompatible (unlinkable) with newer LSIF dumps though.

Should then the moniker schema version be a high level concept that is advertised in LSIF metadata so tools can detect mismatches more effectively?

(Note that #10 discussed similar but actually different question about embedding the source code package version in the moniker.)

Add support for default exports

Consider the following minimal project:

exporter.ts

export default function() {
    return 5;
}

index.ts

import five from './exporter'

function main() {
    five();
}

Currently, the referenceResult for five points to the 2 references in index.ts, skipping the default export reference defined in exporter.ts.

Should we support lazy results for declarations and definitions?

this might be useful for languages like C/C++ where declarations and definitions can be spread over many files.

Moniker and PackageInformation edge output in LSIF

@dbaeumer, would it be possible to require a moniker and its package information link to be fully emitted before the moniker is used by any ranges or results in LSIF output? Are there any scenarios where that might not be possible?

That would help us finalize a moniker node and create a hash for it while referencing it from other objects, otherwise, we would have to wait until the end of LSIF output to finalize everything and would have to do a second pass.

Thanks

What kind of output formats should LSIF support

Currently 'line' and 'json' is supported. Do we need a compressed or binary format to make the dumps smaller. Or is a zip enough?

Support to shard the LSIF output per document and emit begin/end events

Currently the LSIF output is a pure graph with no indication of whether more information will be available for a certain document or symbol. This makes it very hard to decided when information outputted can be optimized for a different storage format (for example to store it into a DB). Currently the whole output must be store (for example on disk) and the post-processed. To ease this and to be able to decided early on if information differs from a previous dump we propose to make the following changes to the LSIF output format:

Emit events

The following events will be emitted:

begin of a project indexing
end of a project indexing
begin of a document indexing
end of a document indexing

After the end has been emitted no more information regarding the project or document can be emitted. It is allowed to have n projects/documents open for process (necessary to support cyclic references). However the exporting tool should do its best to keep the number of open projects/documents to a minimum.

Shard results per document

Results that are split across documents (for example reference results) are currently emitted using lazy results (e.g. items are added using a item edge from a reference result to a range). This has two disadvantages:

large output since a lot of item edges are emitted
hard to determine if the portion of a result belonging to a document has changed compared to the last LSIF export. However this will be eased with the begin/end events already

Since LSIF exporting tools usually process projects on a file by file bases we propose to shard the result per document as well.

Do we need to add version information to the monikers

Edge before vertex is seen

When I use the tool on vsls-guestbook commit# b93232d95da96cb39744151914f9de869358e763, there's an issue where vertex 422 is referenced by edge 610 before it is output.

610 is the only edge that outputs too early, everthing else including other edges to and from 422 work.

Do edges really need ids.

Edges are used to connect to vertices but don't need to be identified separately.

Write 0.4.x version of the spec.

No that we have an implementation for TS we should start writing the changes into a 0.4.x version of the spec.

What command line arguments need to be standardized for an indexer

npm-lsif fails in version 0.1.0

Running the tsc-lsif and npm-lsif tools version 0.1.0 fails on certain typescript repos with the following error:

..\node_modules\npm-lsif\lib\main.js:104
let outUri = this.mapToOut(document.uri);
^

TypeError: Cannot read property 'uri' of undefined
at ExportLinker.exports (C:\VSCloudKernel.jepetty\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\npm-lsif\lib\main.js:104:45)
at Interface.rd.on (C:\VSCloudKernel.jepetty\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\npm-lsif\lib\main.js:255:34)
at emitOne (events.js:116:13)
at Interface.emit (events.js:211:7)
at Interface._onLine (readline.js:280:10)
at Interface._normalWrite (readline.js:422:12)
at Socket.ondata (readline.js:139:10)
at emitOne (events.js:116:13)
at Socket.emit (events.js:211:7)
at addChunk (_stream_readable.js:263:12)
events.js:183
throw er; // Unhandled 'error' event
^

Error: EPIPE: broken pipe, write
at Socket._write (internal/net.js:23:7)
at doWrite (_stream_writable.js:396:12)
at writeOrBuffer (_stream_writable.js:382:5)
at Socket.Writable.write (_stream_writable.js:290:11)
at Socket.write (net.js:711:40)
at Object.emit (C:\VSCloudKernel.jepetty\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\tsc-lsif\lib\emitters\line.js:14:19)
at Visitor.emit (C:\VSCloudKernel.jepetty\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\tsc-lsif\lib\lsif.js:1321:22)
at Visitor.endVisitSourceFile (C:\VSCloudKernel.jepetty\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\tsc-lsif\lib\lsif.js:1070:22)
at Visitor.doVisit (C:\VSCloudKernel.jepetty\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\tsc-lsif\lib\lsif.js:994:18)
at Visitor.visit (C:\VSCloudKernel.jepetty\bin\LSIFLanguageServer\Debug\netcoreapp2.1\node_modules\tsc-lsif\lib\lsif.js:954:22)

The repo/pull request this repros on is https://github.com/vsls-contrib/pomodoro/pull/3/files and the command used to invoke the tools is node ..\tsc-lsif\lib\main.js -p C:\pomodoro\tsconfig.json --outputFormat=line | node .\lib\main.js -p C:/pomodoro/package.json

Investiagte if the TSC LSIF tool should distinguish between declaration and definition

And LISF and not the LSP supports this, however the TS language server has no concept of a definition. We should investigate if we gain something if we start to separate.

Moniker placement - range vs resultset

The reason for this is currently that the ResultSet vertex is optional and considered an optimization. A LSIF output that directly links from a range to a hover result using the textDocument/hover edge is perfect valid as well. I am reluctant on forcing everybody to emit a ResultSet however I fully acknowledge that a moniker can not be emitted late to implement the partitioning optimization. Alternatively we can say that a moniker can only be emitted if there is a ResultSet. But not sure if this is too restrictive as well.