microsoft / tolerant-php-parser Goto Github PK

An early-stage PHP parser designed for IDE usage scenarios.

License: MIT License

PHP 99.34% TypeScript 0.53% JavaScript 0.03% Dockerfile 0.03% Shell 0.08%

php parser error-tolerant fully-representative ast fast memory-efficient

tolerant-php-parser's Introduction

Tolerant PHP Parser

This is an early-stage PHP parser designed, from the beginning, for IDE usage scenarios (see Design Goals for more details). There is still a ton of work to be done, so at this point, this repo mostly serves as an experiment and the start of a conversation.

This is the v0.1 branch, which changes data structures to support syntax added after the initial 0.0.x release line.

Get Started

After you've configured your machine, you can use the parser to generate and work with the Abstract Syntax Tree (AST) via a friendly API.

<?php
// Autoload required classes
require __DIR__ . "/vendor/autoload.php";

use Microsoft\PhpParser\{DiagnosticsProvider, Node, Parser, PositionUtilities};

// Instantiate new parser instance
$parser = new Parser();

// Return and print an AST from string contents
$astNode = $parser->parseSourceFile('<?php /* comment */ echo "hi!"');
var_dump($astNode);

// Gets and prints errors from AST Node. The parser handles errors gracefully,
// so it can be used in IDE usage scenarios (where code is often incomplete).
$errors = DiagnosticsProvider::getDiagnostics($astNode);
var_dump($errors);

// Traverse all Node descendants of $astNode
foreach ($astNode->getDescendantNodes() as $descendant) {
    if ($descendant instanceof Node\StringLiteral) {
        // Print the Node text (without whitespace or comments)
        var_dump($descendant->getText());

        // All Nodes link back to their parents, so it's easy to navigate the tree.
        $grandParent = $descendant->getParent()->getParent();
        var_dump($grandParent->getNodeKindName());

        // The AST is fully-representative, and round-trippable to the original source.
        // This enables consumers to build reliable formatting and refactoring tools.
        var_dump($grandParent->getLeadingCommentAndWhitespaceText());
    }

    // In addition to retrieving all children or descendants of a Node,
    // Nodes expose properties specific to the Node type.
    if ($descendant instanceof Node\Expression\EchoExpression) {
        $echoKeywordStartPosition = $descendant->echoKeyword->getStartPosition();
        // To cut down on memory consumption, positions are represented as a single integer
        // index into the document, but their line and character positions are easily retrieved.
        $lineCharacterPosition = PositionUtilities::getLineCharacterPositionFromPosition(
            $echoKeywordStartPosition,
            $descendant->getFileContents()
        );
        echo "line: $lineCharacterPosition->line, character: $lineCharacterPosition->character";
    }
}

Note: the API is not yet finalized, so please file issues let us know what functionality you want exposed, and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still in our early stages, and any feedback you have is much appreciated 😃.

Design Goals

Error tolerant design - in IDE scenarios, code is, by definition, incomplete. In the case that invalid code is entered, the parser should still be able to recover and produce a valid + complete tree, as well as relevant diagnostics.
Fast and lightweight (should be able to parse several MB of source code per second, to leave room for other features).
- Memory-efficient data structures
- Allow for incremental parsing in the future
Adheres to PHP language spec, supports both PHP5 and PHP7 grammars
Generated AST provides properties (fully representative, etc.) necessary for semantic and transformational operations, which also need to be performant.
- Fully representative and round-trippable back to the text it was parsed from (all whitespace and comment "trivia" are included in the parse tree)
- Possible to easily traverse the tree through parent/child nodes
- < 100 ms UI response time, so each language server operation should be < 50 ms to leave room for all the other stuff going on in parallel.
Simple and maintainable over time - parsers have a tendency to get really confusing, really fast, so readability and debug-ability is high priority.
Testable - the parser should produce provably valid parse trees. We achieve this by defining and continuously testing a set of invariants about the tree.
Friendly and descriptive API to make it easy for others to build on.
Written in PHP - make it as easy as possible for the PHP community to consume and contribute.

Current Status and Approach

To ensure a sufficient level of correctness at every step of the way, the parser is being developed using the following incremental approach:

Phase 1: Write lexer that does not support PHP grammar, but supports EOF and Unknown tokens. Write tests for all invariants.
Phase 2: Support PHP lexical grammar, lots of tests
Phase 3: Write a parser that does not support PHP grammar, but produces tree of Error Nodes. Write tests for all invariants.
Phase 4: Support PHP syntactic grammar, lots of tests
Phase 5 (in progress 🏃): Real-world validation and optimization
- Correctness: validate that there are no errors produced on sample codebases, benchmark against other parsers (investigate any instance of disagreement), fuzz-testing
- Performance: profile, benchmark against large PHP applications
Phase 6: Finalize API to make it as easy as possible for people to consume.

Additional notes

A few of the PHP grammatical constructs (namely yield-expression, and template strings) are not yet supported and there are also other miscellaneous bugs. However, because the parser is error-tolerant, these errors are handled gracefully, and the resulting tree is otherwise complete. To get a more holistic sense for where we are, you can run the "validation" test suite (see Contributing Guidelines for more info on running tests). Or simply, take a look at the current validation test results.

Even though we haven't yet begun the performance optimization stage, we have seen promising results so far, and have plenty more room for improvement. See How It Works for details on our current approach, and run the Performance Tests on your own machine to see for yourself.

Learn more

🎯 Design Goals - learn about the design goals of the project (features, performance metrics, and more).

📖 Documentation - learn how to reference the parser from your project, and how to perform operations on the AST to answer questions about your code.

👀 Syntax Visualizer Tool - get a more tangible feel for the AST. Get creative - see if you can break it!

📈 Current Status and Approach - how much of the grammar is supported? Performance? Memory? API stability?

🔧 How it works - learn about the architecture, design decisions, and tradeoffs.

💖 Contribute! - learn how to get involved, check out some pointers to educational commits that'll help you ramp up on the codebase (even if you've never worked on a parser before), and recommended workflows that make it easier to iterate.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

tolerant-php-parser's People

Contributors

Stargazers

Watchers

tolerant-php-parser's Issues

Add VS Code launch config

Anything to make it easier for people to get started working with and contributing to the codebase

Improve DelimitedList error tolerance

We should consider creating a separate parse context for delimited list types, so we can handle them as we do in parseList

Enable more concrete rulesets on DelimitedList

Some delimited lists can have empty elements, and some delimited lists cannot. However, we don't specify concrete rulesets for each of these list types, and everything gets parsed in the most lenient way possible. This produces a valid tree, but also means that we're not providing errors where we should. We should revisit all of the classes that extend DelimitedList and define concrete rulesets for each.

Better instructions for building syntax visualizer from source

Right now, the instructions are pretty sparse, and probably don't make much sense for anyone who hasn't built an extension in VS Code.
https://github.com/Microsoft/tolerant-php-parser/tree/master/syntax-visualizer/client#build-from-source

Memory profiling tools?

Right now, it's fairly obvious where the majority of our memory usage stems from (ahem PHP objects...), and we have some ideas on how to cut down on that when it comes reducing the cost of the token representation, but as we progress, it would be helpful to have something other than raw arithmetic 😉.

Add API for `getFullyQualifiedName`

Named nodes should include API for GetFullyQualifiedName.

Should follow rules defined here:
http://php.net/manual/en/language.namespaces.rules.php

Add basic tests for syntax visualizer extension

The API is shifting around quite a bit right now, so this'll help us ensure things stay in sync.

Open source parser!

How this tool differs to nikic/PHP-Parser

Just wondering, the PHP world now uses https://github.com/nikic/PHP-Parser for parsing.

What was the motivation to create new tool for parsing instead of extending the nikic's one?

What are the use cases I should use this one?

Thank you

Refresh string literal representation

The representation of string literals is still a work-in-progress, and the implementation became even more inconsistent when we moved away from the handspun lexer towards PhpTokenizer. We should put some thought into how best to represent strings and template strings.

Adopt new Token representation

As discussed in HowItWorks.md#notes, there are ways that we can significantly reduce memory usage by moving away from objects for Tokens. This issue tracks progress on that work.

Establish statistically significant and consistent performance tests

Currently, there is a lot of variance with performance tests - we need to set up an environment where:

we can understand the impact of our work on multiple machine configurations
we can minimize variance between test runs and have higher confidence that a potential performance optimization is actually adding value rather than unnecessarily complicating the codebase
the tests should run continuously so we can detect issues as soon as possible

Add setting for php parser source folder to syntax visualizer tool

This should make it easier to test and compare changes in the parser.

Add a "vendor" top level namespace

From http://www.php-fig.org/psr/psr-4/

The fully qualified class name MUST have a top-level namespace name, also known as a “vendor namespace”.

Ok, you can say "PhpParser" is the vendor... a way to circumvent PSR-4 (as the project SubNamespaceName is not mandatory.. one reason why I dislike this PSR). And PhpParser could be a bit too generic and may create conflict.

Al least, it conflicts with https://packagist.org/packages/nikic/php-parser

Proposal: use Microsoft\PhpParser

I understand this is only doable as part of a new major version, because of BC break.

Inspect and file bugs for failing framework validation tests

We are currently at ~98.5% pass rate on framework validation tests (a test fails if there is an error present in the tree for known valid code). We should inspect remaining issues, create a minimal test case, and file bugs for missing/broken functionality.

Should produce errors on expected scoped property access expressions in trait select clauses

The code below is currently improperly parsed without errors.

class A {
    use \A {
        \a as b;
        \b insteadof C;
    }
}

trait select clauses do not properly detect start of fully qualified name elements

Should also be updated in the language spec.
https://github.com/php/php-langspec/blob/master/spec/19-grammar.md#user-content-grammar-trait-select-insteadof-clause

Clean up usages of `iterator_to_array`

We are using iterator_to_array during some tree traversals, which is good tell-tale sign that those operations could be further optimized so we don't have to duplicate every element in memory.

fix "api" testsuite failures

For nodes: avoid defining properties that would be null, arrays with zero or one elements, etc.

We should consider better ways to "squash" these constructs for common scenarios to help improve memory usage.

Better disambiguation between subscript-expression and compound-statement

This error tolerance case needs some special handling.

class A {
    function b() {
        if ($expression {
            $a = 3;
            $b = "hello";
        }
        echo "hi";
    }
    
    public function c() {
    }
}

In this case, the if statement is missing a close paren. However, rather than getting parsed as an if-statement missing a close paren, it gets parsed as a subscript-expression (which is defined as follows, according to the PHP language spec.

subscript-expression:
  dereferencable-expression   [   expressionopt   ]
  dereferencable-expression   {   expression   }   <b>[Deprecated form]</b>

This results in the first close brace getting treated as a close brace for the method, rather than the if statement. Then the next close brace gets eaten by the Class node (which terminates the class), so c() ends up being a function, rather than a method.

LHS of assignment must be variable

Noticed while looking at example 4:

In PHP an expression like $a == $b = $c is parsed as $a == ($b = $c), because this is the only valid way to parse the code under the constraint that the LHS of an assignment must be a variable. The parser currently treats this as ($a == $b) = $c instead.

replaces usages of array_push with $array[]

This way we reduce the need for an extra function call, and also remain consistent throughout the codebase

Future tooling plans?

Not sure if you're at liberty to discuss at this point, but related to #36 (comment), I was wondering if you had future plans in the PHP tooling space. Will you be building an "official" language server on top of this parser for vscode (or other LSP clients)? Tooling for PHP devs that use vim (or similar) is just abysmal at this point, and it would be awesome to be able to be reasonably productive in vim again!

Add PHPCS for code style validation

Just an idea to slowly start improving the code style

Parser does not detect invalid cast expressions

Cast expressions may not contain anything except spaces and tabs between the parentheses.

In the following example, the parser generates a CastExpression while PHP sees it as a constant named 'int' followed by a variable.

Example:
<?php (/* hello */int)$a;

Actual result:
No error.

Expected result:
syntax error, unexpected $a (T_VARIABLE)

Parser Class Refactoring

Parser class could potentially be split up to make the code more legible.

Support parsing nullable types

PHP 7.1 supports nullable types by prefixing the type with a question mark.
https://wiki.php.net/rfc/nullable_types

This is also currently missing from the type-declaration definition in the language spec, so we should update it there too. https://github.com/php/php-langspec/blob/master/spec/19-grammar.md#user-content-grammar-type-declaration

Add comprehensive set of tests for API

The API was somewhat haphazardly strewn together as an exploratory proof-of-concept, so it'll be good to start being more deliberate and increasing our test coverage in this area.

Prefix namespaces: PhpParser -> Microsoft\PhpParser

Currently we clash with https://github.com/nikic/php-parser, which prevents both libraries from being used at the same time. Plus this appears to be consistent w/ general best practices.

API should support getting next or previous tokens

A basic implementation of this would be adding a method Node::getDescendantTokenAtPosition (similar to Node::getDescendantNodeAtPosition), and because the tree is fully representative, we could simply $rootNode->getTokenAtPosition($token->getEnd()) or $rootNode->getTokenAtPosition($token->getFullStart()-1). Eventually, the method could be optimized further.

Edge cases:

Zero-length error tokens (of type MissingToken) should not be returned by this method. All other tokens are uniquely addressable, as per our defined invariants.

DelimitedList API should expose "getListElements" function

Currently, consumers have to do the work of ignoring delimiter tokens in order to extract these elements.

Question: Will this be used with standard Visual Studio?

Will this be used with standard Visual Studio not vscode?

yield-expression support

Right now "yield" and "yield from" are being treated as skipped tokens - update parser to add proper support.
https://github.com/php/php-langspec/blob/master/spec/19-grammar.md#user-content-grammar-yield-expression

Parser doesn't properly parse namespace-use-clause list

The below code gets parsed fine

<?php
namespace A;
use A;

However, if you include another namespace-use-clause, it does not get parsed properly.

<?php
namespace A;
use A, B;

Make "grammar" testsuite pass

Add any tests we don't intend to fix immediately to skipped.json so we can more easily detect regressions.

What about PSR-1, PSR-2 and PSR-12 codestyle?

Do you plan to format your code to a single php global recommendations?

If not, why not?

Use `strrpos` instead of `strpos` in `GetLineCharacterPositionFromPosition`

strrpos should be far more efficient than the current strpos + while loop.
https://github.com/Microsoft/tolerant-php-parser/blob/master/src/Utilities.php#L77-L94

Set up CI server for OSX and Windows

This will help us detect potential issues earlier and also make it more broadly apparent what the current status of the parser is.

Support parsing multiple exception types in catch clauses

PHP 7.1 introduced support for catching multiple exception types:
https://wiki.php.net/rfc/multiple-catch

We should handle this condition on the parser by parsing as a delimited list of qualified names, where the delimiter is the TokenKind::BarToken, and we should be able to reuse the logic from parseQualifiedNameList

Thorough code review to ensure we didn't miss anything from the spec

The PHP spec was a bit open to interpretation at times, so it'll be good to have a second pass, and should be a good opportunity to add more tests.

While we're doing this pass, we should submit any remaining PRs to the PHP language spec based on our findings from this project. Already submitted a few PRs, and I've been meaning to submit some more.

Template string support

Related to #11 - the current implementation is incomplete, and we need to think about a proper representation.

issue parsing base class-base-clause in object-creation-expression

See below:

to reproduce:

<?php
$a = new class() extends PHPUnit_Framework_BaseTestListener {};

grammar:
https://github.com/php/php-langspec/blob/master/spec/19-grammar.md#user-content-grammar-object-creation-expression

Add more frameworks to validation test suite

We continually run tests on:

CodeIgniter
WordPress
cakephp
math-php
symfony

We should add more frameworks to the list (and feel free to make a PR - to add another submodule:
git submodule add <git url> validation/frameworks/<framework-name>)

API should support getting doc comments

Currently, doc comments are included as part of the leading trivia. We should add an API entrypoint Node::getDocCommentText() and Token::getDocCommentText(string $text) that grabs the part of the leading comment/whitespace trivia corresponding the Node/Token, and parses it to find the doc comment. A simple regex that looks for the leading /** and ending */ should suffice. Alternatively, we can consider retokenizing the text. The behavior should be consistent with token_get_all.

Edge cases:

if there is no doc comment in the string, the method should return null
if there are multiple doc comments in the leading trivia, the method should return the last occurrence.
the string below should not be counted as a doc comment.

/*/** */

the string below should not be counted as a doc comment:

/***/

the string below should be counted as a doc comment because it includes a whitespace character after /**:

/** */

the string below should not be counted as a doc comment because it doesn't include a whitespace character after the leading /**:

/**d */

Ensure tree traversal operations (e.g. Node::getDescendantNodesAndTokens) are passing by reference

Otherwise we end up unnecessarily doubling peak memory while iterating through nodes/tokens.

Consider parsing NamespaceDefinition as parent of future statements

Currently the namespace definition only includes a compound statement or semicolon token as one if it's children. Parsing future statements as a child rather than a separate statement would help us optimize for operations where we want to find the fully qualified name by simply searching for the first ancestor.

We would terminate parsing the namespace when another namespace definition occurs. There may be some edge cases here, though, so it's worth discussing.