Giter VIP home page Giter VIP logo

tolerant-php-parser's Introduction

Tolerant PHP Parser

CI

This is an early-stage PHP parser designed, from the beginning, for IDE usage scenarios (see Design Goals for more details). There is still a ton of work to be done, so at this point, this repo mostly serves as an experiment and the start of a conversation.

image

This is the v0.1 branch, which changes data structures to support syntax added after the initial 0.0.x release line.

Get Started

After you've configured your machine, you can use the parser to generate and work with the Abstract Syntax Tree (AST) via a friendly API.

<?php
// Autoload required classes
require __DIR__ . "/vendor/autoload.php";

use Microsoft\PhpParser\{DiagnosticsProvider, Node, Parser, PositionUtilities};

// Instantiate new parser instance
$parser = new Parser();

// Return and print an AST from string contents
$astNode = $parser->parseSourceFile('<?php /* comment */ echo "hi!"');
var_dump($astNode);

// Gets and prints errors from AST Node. The parser handles errors gracefully,
// so it can be used in IDE usage scenarios (where code is often incomplete).
$errors = DiagnosticsProvider::getDiagnostics($astNode);
var_dump($errors);

// Traverse all Node descendants of $astNode
foreach ($astNode->getDescendantNodes() as $descendant) {
    if ($descendant instanceof Node\StringLiteral) {
        // Print the Node text (without whitespace or comments)
        var_dump($descendant->getText());

        // All Nodes link back to their parents, so it's easy to navigate the tree.
        $grandParent = $descendant->getParent()->getParent();
        var_dump($grandParent->getNodeKindName());

        // The AST is fully-representative, and round-trippable to the original source.
        // This enables consumers to build reliable formatting and refactoring tools.
        var_dump($grandParent->getLeadingCommentAndWhitespaceText());
    }

    // In addition to retrieving all children or descendants of a Node,
    // Nodes expose properties specific to the Node type.
    if ($descendant instanceof Node\Expression\EchoExpression) {
        $echoKeywordStartPosition = $descendant->echoKeyword->getStartPosition();
        // To cut down on memory consumption, positions are represented as a single integer
        // index into the document, but their line and character positions are easily retrieved.
        $lineCharacterPosition = PositionUtilities::getLineCharacterPositionFromPosition(
            $echoKeywordStartPosition,
            $descendant->getFileContents()
        );
        echo "line: $lineCharacterPosition->line, character: $lineCharacterPosition->character";
    }
}

Note: the API is not yet finalized, so please file issues let us know what functionality you want exposed, and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still in our early stages, and any feedback you have is much appreciated πŸ˜ƒ.

Design Goals

  • Error tolerant design - in IDE scenarios, code is, by definition, incomplete. In the case that invalid code is entered, the parser should still be able to recover and produce a valid + complete tree, as well as relevant diagnostics.
  • Fast and lightweight (should be able to parse several MB of source code per second, to leave room for other features).
    • Memory-efficient data structures
    • Allow for incremental parsing in the future
  • Adheres to PHP language spec, supports both PHP5 and PHP7 grammars
  • Generated AST provides properties (fully representative, etc.) necessary for semantic and transformational operations, which also need to be performant.
    • Fully representative and round-trippable back to the text it was parsed from (all whitespace and comment "trivia" are included in the parse tree)
    • Possible to easily traverse the tree through parent/child nodes
    • < 100 ms UI response time, so each language server operation should be < 50 ms to leave room for all the other stuff going on in parallel.
  • Simple and maintainable over time - parsers have a tendency to get really confusing, really fast, so readability and debug-ability is high priority.
  • Testable - the parser should produce provably valid parse trees. We achieve this by defining and continuously testing a set of invariants about the tree.
  • Friendly and descriptive API to make it easy for others to build on.
  • Written in PHP - make it as easy as possible for the PHP community to consume and contribute.

Current Status and Approach

To ensure a sufficient level of correctness at every step of the way, the parser is being developed using the following incremental approach:

  • Phase 1: Write lexer that does not support PHP grammar, but supports EOF and Unknown tokens. Write tests for all invariants.
  • Phase 2: Support PHP lexical grammar, lots of tests
  • Phase 3: Write a parser that does not support PHP grammar, but produces tree of Error Nodes. Write tests for all invariants.
  • Phase 4: Support PHP syntactic grammar, lots of tests
  • Phase 5 (in progress πŸƒ): Real-world validation and optimization
    • Correctness: validate that there are no errors produced on sample codebases, benchmark against other parsers (investigate any instance of disagreement), fuzz-testing
    • Performance: profile, benchmark against large PHP applications
  • Phase 6: Finalize API to make it as easy as possible for people to consume.

Additional notes

A few of the PHP grammatical constructs (namely yield-expression, and template strings) are not yet supported and there are also other miscellaneous bugs. However, because the parser is error-tolerant, these errors are handled gracefully, and the resulting tree is otherwise complete. To get a more holistic sense for where we are, you can run the "validation" test suite (see Contributing Guidelines for more info on running tests). Or simply, take a look at the current validation test results.

Even though we haven't yet begun the performance optimization stage, we have seen promising results so far, and have plenty more room for improvement. See How It Works for details on our current approach, and run the Performance Tests on your own machine to see for yourself.

Learn more

🎯 Design Goals - learn about the design goals of the project (features, performance metrics, and more).

πŸ“– Documentation - learn how to reference the parser from your project, and how to perform operations on the AST to answer questions about your code.

πŸ‘€ Syntax Visualizer Tool - get a more tangible feel for the AST. Get creative - see if you can break it!

πŸ“ˆ Current Status and Approach - how much of the grammar is supported? Performance? Memory? API stability?

πŸ”§ How it works - learn about the architecture, design decisions, and tradeoffs.

πŸ’– Contribute! - learn how to get involved, check out some pointers to educational commits that'll help you ramp up on the codebase (even if you've never worked on a parser before), and recommended workflows that make it easier to iterate.


This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

tolerant-php-parser's People

Contributors

adamtreineke avatar andwur avatar bilge avatar camilledejoye avatar carusogabriel avatar dantleech avatar decanus avatar inxilpro avatar ishan-deepsource avatar jens1o avatar jmarolf avatar kant avatar mousetraps avatar ngyuki avatar nikic avatar ocramius avatar petah avatar roblourens avatar runz0rd avatar staabm avatar sunverwerth avatar sytranvn avatar tpunt avatar tysonandre avatar tysonandre-tmg avatar villfa avatar vinkla avatar wyrihaximus avatar yanmii-is avatar zobo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tolerant-php-parser's Issues

Add VS Code launch config

Anything to make it easier for people to get started working with and contributing to the codebase

Enable more concrete rulesets on DelimitedList

Some delimited lists can have empty elements, and some delimited lists cannot. However, we don't specify concrete rulesets for each of these list types, and everything gets parsed in the most lenient way possible. This produces a valid tree, but also means that we're not providing errors where we should. We should revisit all of the classes that extend DelimitedList and define concrete rulesets for each.

Memory profiling tools?

Right now, it's fairly obvious where the majority of our memory usage stems from (ahem PHP objects...), and we have some ideas on how to cut down on that when it comes reducing the cost of the token representation, but as we progress, it would be helpful to have something other than raw arithmetic πŸ˜‰.

Refresh string literal representation

The representation of string literals is still a work-in-progress, and the implementation became even more inconsistent when we moved away from the handspun lexer towards PhpTokenizer. We should put some thought into how best to represent strings and template strings.

Establish statistically significant and consistent performance tests

Currently, there is a lot of variance with performance tests - we need to set up an environment where:

  • we can understand the impact of our work on multiple machine configurations
  • we can minimize variance between test runs and have higher confidence that a potential performance optimization is actually adding value rather than unnecessarily complicating the codebase
  • the tests should run continuously so we can detect issues as soon as possible

Add a "vendor" top level namespace

From http://www.php-fig.org/psr/psr-4/

The fully qualified class name MUST have a top-level namespace name, also known as a β€œvendor namespace”.

Ok, you can say "PhpParser" is the vendor... a way to circumvent PSR-4 (as the project SubNamespaceName is not mandatory.. one reason why I dislike this PSR). And PhpParser could be a bit too generic and may create conflict.

Al least, it conflicts with https://packagist.org/packages/nikic/php-parser

Proposal: use Microsoft\PhpParser

I understand this is only doable as part of a new major version, because of BC break.

Inspect and file bugs for failing framework validation tests

We are currently at ~98.5% pass rate on framework validation tests (a test fails if there is an error present in the tree for known valid code). We should inspect remaining issues, create a minimal test case, and file bugs for missing/broken functionality.

Clean up usages of `iterator_to_array`

We are using iterator_to_array during some tree traversals, which is good tell-tale sign that those operations could be further optimized so we don't have to duplicate every element in memory.

Better disambiguation between subscript-expression and compound-statement

This error tolerance case needs some special handling.

class A {
    function b() {
        if ($expression {
            $a = 3;
            $b = "hello";
        }
        echo "hi";
    }
    
    public function c() {
    }
}

In this case, the if statement is missing a close paren. However, rather than getting parsed as an if-statement missing a close paren, it gets parsed as a subscript-expression (which is defined as follows, according to the PHP language spec.

subscript-expression:
  dereferencable-expression   [   expressionopt   ]
  dereferencable-expression   {   expression   }   <b>[Deprecated form]</b>

This results in the first close brace getting treated as a close brace for the method, rather than the if statement. Then the next close brace gets eaten by the Class node (which terminates the class), so c() ends up being a function, rather than a method.

LHS of assignment must be variable

Noticed while looking at example 4:

In PHP an expression like $a == $b = $c is parsed as $a == ($b = $c), because this is the only valid way to parse the code under the constraint that the LHS of an assignment must be a variable. The parser currently treats this as ($a == $b) = $c instead.

Future tooling plans?

Not sure if you're at liberty to discuss at this point, but related to #36 (comment), I was wondering if you had future plans in the PHP tooling space. Will you be building an "official" language server on top of this parser for vscode (or other LSP clients)? Tooling for PHP devs that use vim (or similar) is just abysmal at this point, and it would be awesome to be able to be reasonably productive in vim again!

Parser does not detect invalid cast expressions

Cast expressions may not contain anything except spaces and tabs between the parentheses.

In the following example, the parser generates a CastExpression while PHP sees it as a constant named 'int' followed by a variable.

Example:
<?php (/* hello */int)$a;

Actual result:
No error.

Expected result:
syntax error, unexpected $a (T_VARIABLE)

Add comprehensive set of tests for API

The API was somewhat haphazardly strewn together as an exploratory proof-of-concept, so it'll be good to start being more deliberate and increasing our test coverage in this area.

API should support getting next or previous tokens

A basic implementation of this would be adding a method Node::getDescendantTokenAtPosition (similar to Node::getDescendantNodeAtPosition), and because the tree is fully representative, we could simply $rootNode->getTokenAtPosition($token->getEnd()) or $rootNode->getTokenAtPosition($token->getFullStart()-1). Eventually, the method could be optimized further.

Edge cases:

  • Zero-length error tokens (of type MissingToken) should not be returned by this method. All other tokens are uniquely addressable, as per our defined invariants.

Thorough code review to ensure we didn't miss anything from the spec

The PHP spec was a bit open to interpretation at times, so it'll be good to have a second pass, and should be a good opportunity to add more tests.

While we're doing this pass, we should submit any remaining PRs to the PHP language spec based on our findings from this project. Already submitted a few PRs, and I've been meaning to submit some more.

Template string support

Related to #11 - the current implementation is incomplete, and we need to think about a proper representation.

Add more frameworks to validation test suite

We continually run tests on:

  • CodeIgniter
  • WordPress
  • cakephp
  • math-php
  • symfony

We should add more frameworks to the list (and feel free to make a PR - to add another submodule:
git submodule add <git url> validation/frameworks/<framework-name>)

API should support getting doc comments

Currently, doc comments are included as part of the leading trivia. We should add an API entrypoint Node::getDocCommentText() and Token::getDocCommentText(string $text) that grabs the part of the leading comment/whitespace trivia corresponding the Node/Token, and parses it to find the doc comment. A simple regex that looks for the leading /** and ending */ should suffice. Alternatively, we can consider retokenizing the text. The behavior should be consistent with token_get_all.

Edge cases:

  • if there is no doc comment in the string, the method should return null
  • if there are multiple doc comments in the leading trivia, the method should return the last occurrence.
  • the string below should not be counted as a doc comment.
/*/** */
  • the string below should not be counted as a doc comment:
/***/
  • the string below should be counted as a doc comment because it includes a whitespace character after /**:
/** */
  • the string below should not be counted as a doc comment because it doesn't include a whitespace character after the leading /**:
/**d */

Consider parsing NamespaceDefinition as parent of future statements

Currently the namespace definition only includes a compound statement or semicolon token as one if it's children. Parsing future statements as a child rather than a separate statement would help us optimize for operations where we want to find the fully qualified name by simply searching for the first ancestor.

We would terminate parsing the namespace when another namespace definition occurs. There may be some edge cases here, though, so it's worth discussing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.