tc39 / proposal-binary-ast Goto Github PK

View Code? Open in Web Editor NEW

946.0 287.0 25.0 426 KB

Binary AST proposal for ECMAScript

javascript ast parser ecmascript ecma proposal standard

proposal-binary-ast's Introduction

Binary AST Proposal Overview

Shu-yu Guo (Bloomberg)

David Teller, Kannan Vijayan (Mozilla)

Vladan Djeric (Facebook)

Ingvar Stepanyan (Cloudflare)

This is the explainer document for a proposed new binary AST format for JS.

Motivation

Performance of applications on the web platform is becoming increasingly bottlenecked by startup (load) time. Larger amounts of JS code are transferred over the wire by more sophisticated web properties. While caching helps, these properties regularly release new code, and cold load times are very important.

A brief survey from July 2017 of the uncompressed JS payload sizes for some popular web applications on desktop and mobile:

Web App (Desktop)	Uncompressed JS Size
LinkedIn	7.2 MB
Facebook	7.1 MB
Google Sheets	5.8 MB
Gmail	3.9 MB
Yahoo	3.4 MB

Web App (Mobile)	Uncompressed JS Size
LinkedIn	6.2 MB
YouTube	1.9 MB
Twitter	1.8 MB
Facebook	1.8 MB
Reddit	1.3 MB

Startup performance degrades with larger JS payloads, even if only a fraction of the code is actually executed. Parsing time is a significant component, taking more CPU time than bytecode / initial JIT code generation. For example, on a powerful laptop, Chrome spends 10% to 15% of CPU time parsing JS while loading facebook.com.

We propose a new over-the-wire format for JS that is a binary encoding of an AST. We believe this new format would allow for drastically faster parsing. Moreover, web developers are well positioned to adopt a new format as they have embraced build tooling.

We have also implemented a prototype encoder and decoder that demonstrates encouraging performance improvements without bloating file size.

Design Philosophy

The overriding design philosophy is to be conservative, so as to be realistic both in implementation and committee. This proposal is intended to be simply an alternative encoding of the surface syntax with the smallest possible delta to enable high performance parsing. By design, this proposal does not attempt any semantics-level encoding (e.g., bytecode, encoding variables instead of identifiers).

That said, this proposal is highly ambitious.

Current Parsing Bottlenecks

1. Information Not Available Where Needed

One fundamental issue making parsing slow is that information needed to make decisions during parsing is oftentimes not yet available at the point in the input stream at which it is needed. This happens when the parser needs information from code it either hasn't parsed yet (like variable hoisting) or code it doesn't want to parse (like inner functions).

One concrete example is the problem of efficiently representing bindings. In order to make decisions about how to represent or allocate a variable, the parser needs to know whether the variable is closed over, so it becomes necessary to parse any nested functions. The specification only states where a declaration like var x needs to be reachable, not how to allocate it -- the information needed for the allocation decision is not encoded where the variable is declared, nor at the spot where it needs to be allocated.

In the following example, due to variable hoisting, use_x closes over x, which is not known to be declared in f at the point of parsing the inner function. An engine cannot emit optimized accesses to x inside use_x until the entirety of f is parsed.

var x;
function f() {
  function use_x() {
    use(x); // Closes over hoisted var x.
  }

  var x;
}

In the following example, an engine's frontend would like to know how to efficiently allocate space for the var declarations as they are parsed. That is not possible without parsing the rest of the function.

function g() {
  var x; // Not closed over, should go on function frame.
  var y; // Closed over, should go on activation object.

  return (function() { use(y); });
}

In the following example, the presence of eval at the bottom of the function is unknown at the time when the engine needs to decide if var x may be accessed dynamically.

function h(input) {
  var x;

  (function() {
    eval(input);
  })();
}

2. Early Error Semantics

JavaScript’s early error semantics requires the entirety of every file be parsed. Engines employ a lazy parsing (also known as pre-parsing) optimization that avoids building a full AST by skipping initial code generation for inner functions until time of first invocation. This is only 50% faster on average, and parsing effort is still proportional to file size. What's worse, this optimization can backfire. If an inner function whose code generation was skipped happens to be called during application startup, then the engine must in effect re-parse the entire function. This happens often enough that developers have resorted to using immediately-invoked function expressions to altogether bypass lazy parsing.

Consider the following. Currently, since innerWithSyntaxError has an early error, outer also must throw the early error.

function outer() {
  function innerWithSyntaxError() {
    var;
  }
}

The proposed behavior change is analogous to wrapping function bodies with eval, as below.

function outer() {
  function innerWithSyntaxError() {
    eval("var");
  }
}

3. Inefficiencies in Using Characters

When parsing, there can be ambiguity at the character-level about what type of expression a JavaScript syntax is encoding. For example, to distinguish between list expressions and the parameter list for an arrow function, parsers need to do additional bookkeeping to account for all valid interpretations until the ambiguity is resolved, or they need to have the ability to backtrack while parsing.

Similarly, lexing is slow due to reasons such as Unicode encoding and having to handle Unicode escape codes. Engines need to check if some text is single-byte or double-byte encoded, for instance.

Proposal

We propose a binary encoding based on an efficient abstract syntax tree representation of JavaScript syntax. This is a new, alternate encoding of the surface syntax, with a fairly close bidirectional mapping to the text representation. We seek to be as conservative as possible in introducing new semantics, which are currently limited to:

deferring early errors
changing Function.prototype.toString behavior
requiring UTF-8

To further speed up parsing, we propose to encode static semantics as implementation hints, and verify them as deferred assertions.

For this AST format, we currently do not propose to automatically derive from the ECMA-262 grammar since it is too verbose to be an efficient tree representation, and as such it would require a lot of flattening and inlining. Additionally, this might restrict the evolution of the JavaScript specification by effectively making the grammar a normative specification of the binary format.

Instead, we propose a separate tree grammar, with annotations on AST nodes that only express information about the syntax. There are several existing tree grammars which we may take as a starting point, such as Babylon or Shift AST.

(Two grammars is likely to be a contentious issue; this decision is not set in stone and feedback is welcome.)

Borrowing from the WebAssembly approach, the binary encoding would be split into 3 layers:

A simple binary encoding of the AST nodes using basic primitives (e.g., strings, numbers, tuples)
Additional structural compression on top of the previous layer, leveraging knowledge about the nature of the format of the file and the AST (e.g., constant tables)
A generic compression algorithm like gzip or Brotli.

We expect the format to be output by existing compilers such as Babel and TypeScript, and by bundlers such as WebPack.

Grammar

The tree grammar consists of the primitives (booleans, strings, numbers), lists, and disjoint unions. Disjoint unions are tagged and have a fixed, expected structure (ordered properties) for a given file. Certain nodes, such as strings and lists, are encoded together with their encoded byte length, allowing parsers to skip past the encoded representation of the node.

Language Evolution and Versioning

The grammar provides the set of all possible node kinds and their ordered properties, which we expect to be monotonically growing. However, not all vendors will support all nodes at any given time. To this end, each file contains a header containing a list of nodes and their properties that the file expects to be used. For example, suppose the grammar currently provides a FunctionExpression node that includes an isAsync property. If a file expects async functions to be supported, it would include a FunctionExpression entry that includes isAsync in the list of properties. If a node or property in the header is not supported by the implementation, an error is thrown.

The header solves the versioning problem, forwards compatibility, and backwards compatibility. It also maintains the expected behavior of JavaScript parsers with respect to new features, insofar as parsing a correctly formatted file fails if a parser encounters a feature that the engine does not implement, rather than relying upon a monolithic version number. Finally, it acts as structural compression: node kinds are to be referred to by their index in the header instead of their name.

To save file space, presets will be supported (e.g., ES2015).

Static Semantics as Annotations

To address concern 1 Information Not Available Where Needed above, the insight is that all the currently known cases of "information not available where needed" are binding-related. Scope nodes, like function or lexical scope nodes, are able to encode additional annotations about the properties of the AST.

A non-exhaustive list of annotations currently considered:

VarDeclaredNames
LexicallyDeclaredNames
ClosedOverNames (new)
AssignedToOutsideOfDeclarationNames (new)
Has CallExpression where the callee expression is IdentifierReference of string literal eval (new)

These annotations, if present, behave as lazy assertions. By the time the scope node is finished being traversed (which may be arbitrarily delayed in case a skipped inner function is never invoked), if the encoded annotation is contrary to the body of the node, a runtime error is thrown. For example, if a function body is annotated with "has VarDeclaredNames x", but does not in fact contain a var declaration with name x, an error would be thrown.

In order to prevent divergence across implementations with regards to timing, such errors are required to be thrown when the nearest enclosing function is invoked or the nearest enclosing global script is evaluated.

It is important to note that these annotations must be checked and are not, strictly speaking, hints that may be ignored. They are assertions on the structure of the syntax tree, and must be verified whether or not an engine chooses to use them as parsing hints.

These annotations would be specified as existing or new static semantics.

Ultimately, this list is ad-hoc and the union of properties that various engines currently derive during parsing. With these annotations, engines may generate code for functions in a single forward pass without parsing the functions in entirety. If the needed hints are not present for an engine to do the single-pass fast path, the fallback is the existing analysis that engines already implement. Should the need for new annotations arise, we expect them to be standardized as new static semantics.

Annotations and scopes are embedded in the AST just as any other node properties and use the same versioning mechanism.

Deferred Early Errors

To address concern 2 Early Error Semantics, Early Errors, like annotation verification, will be deferred to until when the nearest enclosing function is invoked or the nearest enclosing global script is evaluated.

This is a breaking change from textual JS.

Since this format is expected to be compiler output, early errors would be caught at compile time in practice.

UTF-8 with no escape codes

To address concern 3 Inefficiencies in Using Characters, strings and identifiers will be required to be in UTF-8 without support for escape codes. Again, we expect this to be feasible as the format will be compiler output.

Verification

Verification of the header, annotations, and the tree structure itself (e.g., its length-encoded nodes, the allowed kinds for child nodes) is intended to be possible in a single forward pass as constant work per node, as the AST is being decoded.

Function.prototype.toString()

This method would return something like "[sourceless code]".

Exception Offsets

Thrown exceptions, instead of having a line and column number, will have a byte offset into the binary AST file.

Further Improvements

It is straightforward to layer additional improvements on this format in order to further improve parsing speed or binary file size. For example, the format could allow for a string table, or allow function bodies to be stored separately from function declarations, and so on.

Prototype

We implemented an early prototype in Mozilla’s SpiderMonkey engine, by using a grammar based on internal AST format. This was done for speed of implementation, and our next prototype will be based on the Babylon AST.

For the facebook.com static newsfeed benchmark, the binary AST representation was slightly smaller than the original JavaScript. This held true even after both representations were passed through gzip for compression. The size reduction mainly came from the use of a string table and variable-length identifiers for entries in the table. It also used variable-length encodings for representing numbers. Additional size wins are possible by leveraging domain-specific information, such as factoring out common subtrees.

The time required to create a full AST (without verifying annotations) was reduced by ~70-90%, which is a considerable reduction since SpiderMonkey's AST construction time for the plain JavaScript was 500-800 ms for the benchmark.

FAQ

Why not ship bytecode instead?

On the Web, no vendor would agree to ship bytecode.

Engines don’t want to be tied to a bytecode version.
It’s harder to verify bytecode, as bytecode is more expressive than syntax.
It runs the risk of bifurcating the language, as bytecode may be more expressive than structured JavaScript source.
Designing a new bytecode is an even more ambitious undertaking.

Why not use WebAssembly?

There are massive existing untyped JS codebases, and there is no easy way to convert an untyped, garbage collected language like JS to WebAssembly. Given that currently using WebAssembly to ship JavaScript most likely involves bundling a JS runtime as well, it is unclear such a setup would net any load-time performance improvements.

Why not a semantic graph? Or why not go further? Why not types?

We do not want to invent a new language by adding new front-end semantics, nor do we want to encode analysis result and require more sophisticated verification that this information is correct.

Won't this be a massive maintenance burden?

Parsing a binary AST format will not be as complicated as parsing existing JavaScript. It will undoubtedly increase complexity of implementation, but we believe is worth the cost given the trajectory of the complexity and size of web applications.

Won't an AST format bloat file size?

In our prototype described above, we found the AST format to be slightly smaller, even when comparing compressed sizes.

How much of a performance win is there?

For parsing desktop facebook.com, 10-15% of client-side CPU time is spent parsing JavaScript. The prototype we implemented reduced time to build the AST by 70-90%.

Additionally, many complex sites today fetch JavaScript on demand when a user interacts with some secondary functionality (e.g. when a user opens the Notifications panel on Facebook). Long parsing times on these interactive paths reduce the responsiveness of the site.

Would it be possible to write a tool to convert serialized AST back to JS?

Yes, it would be possible to automatically generate human-readable JavaScript from the AST format. By virtue of the syntax tree being abstract and not concrete, the pretty printer will not be a perfect recreation of the input JavaScript but it will be semantically equivalent.

Such serializers may be crucial to providing a compelling devtooling story.

Is this made obsolete by programmable caching APIs?

No. Programmable caching APIs drastically improve warm load times. This proposal improves cold load times. It composes nicely with caching APIs.

Will this work alongside text JS source?

Yes. HTML integration is forthcoming, but an application is free to mix loading text source and binary AST files.

Will this be a new surface for attacks?

Yes, but we believe this is a much smaller and much more controlled and testable surface than alternatives, such as bytecode.

proposal-binary-ast's People

Contributors

Stargazers

Watchers

proposal-binary-ast's Issues

Reconcile streaming interpreter and streaming compiler usecases

As @sebmarkbage and others have pointed out, the deferred-until-invocation-of-function error model is not amenable to streaming interpreters. That model is amenable to streaming compilers, but to be able to interpret a binary AST stream incrementally, errors need to be even more lazy.

It is not clear to me how to reconcile the two use cases. Maximum laziness on errors enables streaming interpreters, but hurts compiled code performance by requiring runtime checks. Per-function laziness means the entire function needs to be inspected before any code can execute.

In the browser world, streaming interpreters is not a compelling use case IMO. However, streaming interpreters is more compelling for an engine specialized only for run-once code (e.g., no JITs).

We should either come up with a technical solution that enables both use cases, or explicitly decide to not support one.

How are the various "enum"s encoded?

They're listed as strings in the spec, but it would seem highly inefficient to encode them that way. Are they in fact encoded as strings? (If not, you could encode them as LEB128 integers.)

Why allow early errors to be serialized?

Curious as to why even allow early errors to become an AST?

Add verification of property names vs identifiers

FunctionBody should be a Node in order to use "or"

Currently FunctionBody is FrozenArray.

typedef FrozenArray<Statement> FunctionBody;

but ArrowExpressionContents contains an attribute with (FunctionBody or Expression) type, which, I think, means FunctionBody and Expression should have common super interface.

interface ArrowExpressionContents : Node {
...
  attribute (FunctionBody or Expression) body;
};

So FunctionBody should be the following definition.

interface FunctionBody : Node {
  attribute FrozenArray<Statement> statements;
};

(Optional) type information

Binary AST could hold additional type information for variables, parameters and return types. This can help VMs do some optimizations ahead of time. Of course types should be statically validated but potentially this static analyse could be done by the same toolchain as used for binary AST creation. Was this idea considered?

What do deferred early errors mean for JSON?

This might be beyond the scope of the current proposal, but I expect that we'll want to apply binary parsing also for JSON data.

For the specific case of JSON, I suspect that we want the data to be checked eagerly. Does this mean that we want to guarantee that some subsets of EcmaScript will be checked eagerly? That we want to be able to be able to specify that some files need to be parsed eagerly?

Introduce LiteralI32Expression

In practice, pretty much all integer literal numbers we encounter are within [|-2^31, 2^31[| (I believe that I have never seen an integer literal outside of this range). Experience shows that introducing the following is a pretty good filesize win:

typedef (LiteralF64Expression or LiteralI32Expression)
  LiteralNumericExpression;

interface LiteralF64Expression {
  attribute double value;
}

interface LiteralI32Expression {
  attribute long value;
}

Any objection to this? The assumption being that the long will be a variable length encoded integer.

Positions (à la source maps)

The current text of the proposal does not mention anything about mapping binary AST offsets back to source positions. If the original input is JavaScript source code, then I understand that debugging tools could show the pretty-printed version of the AST. However, if the input JS or the binary AST itself was produced by a compiler from another language, we need some way to map back to the original source file.

Directly leveraging source maps as they currently exist does not seem possible. They rely on very precise positions in the compiled .js file. If the .js file used for source mapping is the product of pretty-printing the AST, the source maps are very sensitively dependent on the pretty-printing algorithm. This would require to specify that algorithm along with the binary AST specification. IMO that is not a desirable situation.

Instead, I would suggest two possible paths to address this.

The first would be to store original positions of nodes directly in the AST. This has the advantage that it would produce extremely accurate source positions through the compilation pipeline of the VM, eventually resulting in better source mapping for the binary AST than what we enjoy with source maps at the moment. The disadvantage is that the binary AST itself is encumbered by positions, which would probably amount to a significant portion of the file size. Although useful for development "builds", it would unnecessary increase bloat for production files. That could be mitigated by a global flag at the beginning of the file telling whether positions are stored or not.

An alternative would be to define an equivalent to source maps, specifically designed for binary AST. Such source maps would map binary AST offsets to source positions.

What should be put into AssertedParameterScope for duplicate parameters?

from https://bugzilla.mozilla.org/show_bug.cgi?id=1497788

If there's duplicate parameters, what should be in AssertedParameterScope?

The current spec requires AssertedPositionalParameterName for them.
like,

function f(a, a) {}

will have the following data for AssertedParameterScope:

AssertedParameterScope {
  paramNames: [
    AssertedPositionalParameterName {
      index: 0,
      name: "a",
      ...
    },
    AssertedPositionalParameterName {
      index: 1,
      name: "a",
      ...
    },
  ],
  ...
}

given the purpose of Asserted*Scope is to provide the information about binding, having duplicate entry without any info about duplication won't be nice.

The situation is following:

duplicate parameters are allowed only if there's no destructuring/default/rest parameters
only the last one of the duplicate parameters become the actual binding

what I can think of is the following 2 solutions:

(a) Use yet another `Asserted*Name` with index for non-last duplicate parameters

for example, AssertedPositionalDuplicateParameterName

AssertedParameterScope {
  paramNames: [
    AssertedPositionalDuplicateParameterName {
      index: 0,
      name: "a",
      ...
    },
    AssertedPositionalParameterName {
      index: 1,
      name: "a",
      ...
    },
  ],
  ...
}

this way, CheckParameterNames and CheckPositionalParameterIndices can be done in almost same way as current ones

(b) Do not put non-last duplicate parameters

AssertedParameterScope {
  paramNames: [
    AssertedPositionalParameterName {
      index: 1,
      name: "a",
      ...
    },
  ],
  ...
}

this is smallest, but CheckParameterNames and CheckPositionalParameterIndices should be modified in order to check duplication

Validating Babylon ASTs

Even if it's well-typed, Babylon ASTs permit things which are not valid JavaScript syntax (for example, module declarations in scripts). What's the idea for how to validate these ASTs? I suppose, with the "delayed early error" idea, you validate at the function granularity, when the function is actually called; this would include include not just what would be JavaScript early errors, but also anything that would be malformed about the Babylon AST--is that what you had in mind?

What is HasMalformedDirectives?

I don't understand what this assertion is meant to accomplish. Naively it looks like it would forbid

"directive";
("directive");

which I don't think it should?

Comparison to WebASM

So I fully understand the differences, but some visitors may not. It might be nice to have a section about the differences between this and WebASM, and maybe (for the rest of us) why this is seen as more suitable or different than WebASM philosophically/practically.

Handling bindings introduced by a CatchClause then captured

Consider the following snippet:

try {
 ...
} catch (e) {
  return function() {
    throw e; // Or, really, do anything with `e`.
  }
}

In this snippet, the anonymous function captures a binding e which was introduced implicitly by a CatchClause. For the moment, we do not have a way of representing this in the AST.

[Discussion] Out-of-band signal for requesting Binary AST

I recently asked on the chat about a planned way to request Binary AST from the server and got the following answer:

@Yoric: We plan to have a mechanism, but we haven't attempted to design it yet. The vague consensus for the moment was to use something like <script src="..." binsrc="...">, which seems like the cheap way to keep it backwards-compatible.

While this is a relatively simple solution, I have a concern about limitations it imposes.

In particular, in an ideal world I think it would be reasonable to support a usecase where e.g. a shared CDN with lots of JavaScript libraries could simply create Binary AST variants of all the assets, and return them instead of regular JavaScript when it knows that 1) browser supports it and 2) that such change would be mostly invisible to the consumer (that is, JS was indeed requested via <script> or import(...) or other means purely for execution, and not with XMLHttpRequest or fetch).

To support usecases like that, signal for Binary AST support should come not from HTML level (as it's much harder to get HTML updated on all the websites where script is inserted), but rather on network level.

One way to do this would be adding binast or similar marker to the Accept-Encoding list for script requests in supported browsers, which would tell the server that Binary AST version can be safely returned with Content-Encoding: binast in the response.

Using encoding headers for this goal feels quite natural, as it's mostly an encoding format for JavaScript, although one might argue that because it's not lossless in terms of debugging information, it doesn't belong to Accept-Encoding/Content-Encoding headers - in that case, I'm open to any proposals.

We should be defining Dictionaries rather than Interfaces

I got an impromptu drive-by review from the Firefox DOM team. One of the suggestions was to turn all our Interfaces into Dictionaries.

Javascript Tooling

TL;DR - this is cool because it unifies the JS ecosystem AST, and can lead to perf when compiling JS in the current toolchain.

We expect the format to be output by existing compilers such as Babel and TypeScript, and by bundlers such as WebPack.

If I or someone else doesn't bring it up at the meeting, I'l just note here that another reason that this proposal is a 👍 from me is that it would help to unify the AST of the javascript ecosystem of tools (especially regarding babel, webpack, uglify, other parsers like acorn, esprima, etc). https://github.com/estree/estree was started as most know to unify this effort but it's diverted in certain ways (babel 5 to babel 6 https://github.com/babel/babylon/#output although we have a compat plugin).

I believe some people wanted to do an ESTree 2 (I believe the reasoning that Sebastian changed the AST in Babel 6 was just that the old AST was difficult to use when trying to create Babel plugins). If this becomes part of JS then it could be a nice forcing function for the tools that need to output the binary AST and everyone else that opts in to this could give perf benefits to developers given we can pass the same AST between all tools (ex: babel -> webpack -> uglify) and not have to re-parse in that toolchain as well as in the browser itself.

Discussion on the actual AST is a bit different but ya just saying the idea is enticing!

Also interesting is that future proposals could propose the AST node as part of the staging process (and Babel can help with that given we can start implementation at Stage 0). So we'll already have progress in that area

Can be another issue, but I had a question about if comments would be encoded in the AST or if that can be an extension? Webpack/uglify use comments to do chunking/dead code elim (it doesn't have to be the in output to the browser of course but if the AST is extensible enough to encode that info in a node property/comment that would be good)

A FunctionExpression/Method/Getter/Setter may capture its own name

The following snippet demonstrate how the name of a FunctionExpression can be captured:

let a = function f() {
  return function g() {
    return f;
  }
};
let b = a()(); // This is `f`

The same technique would probably work for Method, Getter, Setter.

For the moment, we have nowhere to store this information.

question: resource-constrained devices

This question might not yet be answerable.

I'm curious about the impact on this proposal on resource-constrained devices.

Am I correct that a spec-compliant VM would need to implement a second, completely separate parser for the binary AST format? A resource-constrained device may find the added weight of this implementation burdensome.

However, if such a VM could be deployed with only the binary AST parser implementation, is it likely or unlikely to actually reduce the footprint of the VM compared to the way things currently stand (in terms of both memory and persistent storage usage)?

This proposal would certainly improve parsing speed on a puny device, but I'm wondering if there are other costs.

Thanks!

UPDATE: By "persistent storage usage" I mean size of the firmware blob.

Can we put more information into AssertedParameterScope ?

Another extension for https://github.com/binast/ecmascript-binary-ast/issues/50

The requirement on the implementation is following:
https://bugzilla.mozilla.org/show_bug.cgi?id=1475458

To satisfy it without changing the implementation (this is just for reference. I'm also thinking about changing implementation side), the following structure is necessary:

enum ParameterKind {
    "simple",
    "default",
    "destructuring",
    "destructuring default",
    "rest",
    "destructuring rest"
};

interface AssertedParameterName {
  attribute unsigned short index;
  attribute ParameterKind kind;
  attribute IdentifierName name;
  attribute boolean isCaptured;
};

interface AssertedParameterScope {
  attribute FrozenArray<AssertedParameterName> paramNames;
  attribute boolean hasDirectEval;
};

index is where the binding is defined in parameters list.

for example, function f(a, b=10, {c}, [d, e] = [], f, ...g) {} has the following scope data:

AssertedParameterScope {
  paramNames: [
    AssertedParameterName {
      index: 0, kind: "simple", name: "a", isCaptured: false
    },
    AssertedParameterName {
      index: 1, kind: "default", name: "b", isCaptured: false
    },
    AssertedParameterName {
      index: 2, kind: "destructuring", name: "c", isCaptured: false
    },
    AssertedParameterName {
      index: 3, kind: "destructuring default", name: "d", isCaptured: false
    },
    AssertedParameterName {
      index: 3, kind: "destructuring default", name: "e", isCaptured: false
    },
    AssertedParameterName {
      index: 4, kind: "simple", name: "f", isCaptured: false
    },
    AssertedParameterName {
      index: 5, kind: "rest", name: "g", isCaptured: false
    },
  ],
  hasDirectEval: false
}

Also, we need bodyScope before param.

Avoiding AST and inserting the needed information into the source code

In the meeting notes of the July TC39 meeting, it was said that the space savings of the AST are minimal (compared to a minified version) - about 5%, so AST-ing the source code doesn't give a lot of benefit in terms of space-savings.

OTOH, the main (and considerable!) parse-time savings are accomplished from:
* Knowing which variables are hoisted in a scope
* Knowing which variables are "closed over" in a scope
* Marking the presence of an eval or with in a scope
* Early errors can de-optimize lazy parsing
* Efficiencies when using binary instead of text characters.

Then again, defining an AST is a big endeavor, and may be fraught with politics, not to mention having to maintain it forever every time the language changes syntactically.

So given that an AST is problematic, and yet the parse-time savings from the proposal are considerable, I was wondering whether we could get the parse-time savings by using various markers in the source code to add the information that is missing. In other words, the above points that are saved can be encoded into the source code using special comments or string directives. Something like:

//@efficient-parse

var x;
function f() {
  //@efficent-parse: hoisted[x]
  function use_x() {
    use(x);
  }

  var x;
}

(I wouldn't even call the above a strawman proposal, but just an incredibly rough draft to determine where I'm going here...)

This will probably enable parsers to deal with the first of the four points described in the proposal, but will not let them deal with the last point (efficiencies from the use of binary instead of text characters), given that it's still keeping the source code as source code with text characters. Unfortunately, I don't know how much of the performance improvements were gotten from each of the points above. It may be that 90% of the optimizations was because text was discarded, which makes my proposal of keeping the source code as text dead on arrival. :-)

So I can't really know whether my proposal is efficient, but if it is, I believe the benefits of keeping the source code as source code and not going into the endeavor of defining an AST for JS may outweigh the decrease in performance that keeping source code as text will make.

Grammar quirks

It is almost certainly too early to worry about this, but a couple notes while I'm thinking of them:

The tree grammar specified does not allow for (var a = b in c);, which is a legal program (in sloppy mode, assuming Annex B) as of tc39/ecma262#614.
There's a variety of ways that well-typed trees can fail to correspond to real programs, which should all be captured in this project (except that said project hasn't been updated for async/await yet). For example, you can't have an if with an else as the body of an if without an else, even though the tree types can represent that. You also have to make sure that Identifiers are actually identifiers and that sort of thing. These aren't captured by the early error rules because they don't match the lexical grammar, and so presumably will need to be checked explicitly.
The type for TemplateExpression can be made more strict by having something like

interface Interpolation {
  attribute Expression value;
  attribute TemplateElement after;
}
interface TemplateExpression : Node {
  attribute Expression? tag;
  attribute TemplateElement start;
  attribute FrozenArray<Interpolation> elements;
};

instead of the current TemplateExpression definition which just has a list of elements which mixes Expressions and TemplateElements. Shift doesn't currently do this because it's kinda awkward to use (or, well, I think that was the justification, but have now forgotten), but this project might find it to be worth it.

Explicit inclusion of magic number in spec

Can we have a commitment to include a magic number in this new format? A lot of the pain that Node is experiencing around supporting ES Modules is the inability to determine whether a file is ESM or CJS based on the ambiguities that exist in the syntax and the lack of some kind of pragma. The controversy around .mjs vs .js exists almost entirely due to this ambiguity. Since Node can't use mime-types like the browser, a leading pragma or magic number in these new file formats that VMs can load would provide much more optionality than having to simply rely on file extension.

Why feature post-default cases with `SwitchStatementWithDefault`?

Semantically, the labels are meaningless after parsing, so why include them as part of the AST?

In case you're wondering what precedent there is for just ignoring the extra case labels:

ESTree
Babel
UglifyJS just treats them all as one and the same, making no distinction between case and default. (The minifier still strips them as dead code, however.)

Optimizing booleans, numbers, etc.

For the moment, literal numbers and booleans are represented by

interface LiteralInfinityExpression : Node { };

interface LiteralNumericExpression {
  attribute double value;
};

interface LiteralBooleanExpression : Node {
  attribute boolean value;
};

This typically means that a boolean will be stored as 2 bytes and a number other than infinity as 9 bytes. The latter is particularly odd, since the vast majority of numbers are 0 and 1. So, in AST v2, I had success decreasing the size of files by introducing special literal values for 0, true and false.

One way of doing this would be to introduce in the grammar

interface LiteralInfinityExpression : Node { };
typedef (LiteralZeroExpression or
  LiteralOneExpression or
  LiteralDoubleExpression)
 LiteralNumericExpression;

interface LiteralDoubleExpression {
  attribute double value;
};
interface LiteralZeroExpression {};
interface LiteralOneExpression {};

typedef (LiteralTrueBooleanExpression or LiteralFalseBooleanExpression) LiteralBooleanExpression

interface LiteralTrueExpression : Node { }
interface LiteralFalseExpression : Node { }

Similarly, it seems odd to reserve one byte for UpdateExpression to determine whether it's a prefix or a postfix expression, so we could rewrite

typedef (PrefixUpdateExpression or PostfixUpdateExpression) UpdateExpression;

interface PrefixUpdateExpression : Node {
  attribute UpdateOperator operator;
  attribute SimpleAssignmentTarget operand;
};

interface PostfixUpdateExpression : Node {
  attribute UpdateOperator operator;
  attribute SimpleAssignmentTarget operand;
};

Admittedly, introducing a change in the AST solely for the sake of compression sounds a bit odd. An alternative would be to either trust compression (but this didn't seem to work that well in AST v2) or somehow make the (de)tokenizers smart enough to introduce the above changes transparently (note to self: the deanonymizer would certainly be smart enough to do this).

Starting the conversation here.

List annotations

For the moment, we have discussed the following annotations:

has direct eval;
let-bindings;
var-bindings;
captured names;
directives.

Parameter binding lists should force parameters bindings to be enumerated in source order.

Right now the reference encoder feels free to sort parameter bindings lexicographically, which doesn't play nicely with engines.

Clarify experiment results

The time required to create a full AST (without verifying annotations) was reduced by ~70-90%, which is a considerable reduction since parsing time in SpiderMonkey for the plain JavaScript was 500-800 ms for the benchmark.

Is "the time required to create a full AST" 500-800 ms? Or is it a subset of that? Maybe stating the actual reduction in milliseconds would be helpful.

.length property should be available for lazy function

Currently FormalParameters params field is inside FunctionExpressionContents interface, which has [Lazy] attribute inside LazyFunctionExpression interface.
Function object's .length property needs that information, even before executing the function.

interface LazyFunctionExpression : Node {
  attribute boolean isAsync;
  attribute boolean isGenerator;
  attribute BindingIdentifier? name;
  attribute FrozenArray<Directive> directives;
  [Lazy] attribute FunctionExpressionContents contents;
};

interface FunctionExpressionContents : Node {
  attribute boolean isFunctionNameCaptured;
  attribute boolean isThisCaptured;
  attribute AssertedParameterScope parameterScope;
  attribute FormalParameters params;
  attribute AssertedVarScope bodyScope;
  attribute FunctionBody body;
};

if we're going to entirely skip parsing FunctionExpressionContents contents field for lazy functions until executing the function, the length of formal parameters should be put into LazyFunctionExpression interface.

Introduce interface attribute [Skippable]

The following is a proposal to make clearer in the specifications which parts may be skipped and which may not.

Current state

In the current state of things, all lists may be skipped. This was decided because it was clear that hardcoding arbitrary interface names in the detokenizer was a bad idea, and because the only alternatives were allowing all interfaces to be skipped or allowing all lists to be skipped.

This is generally a waste of bytes, as we don't care about skipping most lists.

Proposal

Attribute

We add an extended attribute [Skippable] to interfaces. A [Skippable] interface is one in which

the encoder stores the byte-length of the node, so that we can easily jump ahead;
the semantics promise that skipping causes no bad surprises (i.e. captured names, lexically declared variables, etc. are all well-defined).

I'm not entirely clear about clause 2. at this stage, but I imagine that there are (or will be in the future) subsets of Asserted*Scope that are only useful when skipping. If so, this means that we only need them in a [Skippable] interface.

Example

Now, we redefine FunctionDeclaration as follows:

typedef (EagerFunctionDeclaration or SkippableFunctionDeclaration) FunctionDeclaration;

interface EagerFunctionDeclaration {
    // ...
    // Pretty much what we currently have in `FunctionDeclaration`, unless we realize
    // that some of the data is needed only for skipping.
}

[Skippable] interface SkippableFunctionDeclaration {
    // Here, insert any attribute that may be useful for skipping.
    // For instance, we could imagine an instruction for the parser to enqueue
    // the node for background processing, with a given priority.
    attribute EagerFunctionDeclaration content;
}

We apply the same treatment to other functions/method/getter/setter/...

Expected benefits

We give the encoder one more optimization lever, since the encoder may decide between EagerFunctionDeclaration and SkippableFunctionDeclaration.
We have a mechanism to extend laziness to other constructions, for instance large JSON objects.
We have a clearer manner of specifying/checking which parts of the semantics may be affected by the Binary AST.
We have a clearer manner of specifying the (de)tokenizer.
We improve file size.

The binary format doesn't have to directly mirror normal JS.

I'm not convinced it's required to marry the binary format too tightly to the JS source, and there are things you could do to make it simpler. It's not like the binary format is meant to be human-readable, just machine-readable. There are also idioms in compiled-to-JS code (e.g. from Elm, CoffeeScript, and Babel) that could also stand to benefit from a binary format that that's at least aware of some of their needs.

Here's a few of the ideas I have to throw at the wall. Apologies if this comes across as a bit rambly.

Adding a constant for undefined
- This amounts to the vast majority of uses of void 0 (what UglifyJS replaces strict-mode undefined with) and similar.
- UglifyJS already prefers a local permanently undefined variable over void 0 for code size reasons.
- It's already a pseudo-literal in strict mode.
Allowing synthetic, anonymous locals that aren't viewable by direct eval or with
- This would be useful for anything compiling to JS with extra sugar or whatnot, like Scala.js, TypeScript, and Babel.
- This would also be useful for stripping local names from the binary for minification purposes.
Adding built-in source map support.
- Read: nodes can have locations attached.
Adding a built-in description/metadata field.
- Read: you can still include your license and such. It's just not something that can easily be placed as like a comment.
Adding combined type-checking operators for typeof x === "number", etc.
- This amounts to 90% of typeof use cases, and engines already desugar these 100% of the time. (It's an obvious optimization.)
Adding combined coersion operators for x | 0/~~x, !!x, x + "", etc.
- Those are the only uses for any of those productions, and engines never implement these literally.
- In binary asm.js code, one of those could replace the standard x | 0 or at least act as a synonym.
- Elm and TypeScript code could use these quite frequently.
Separating the pragmas from the source/function body.
- This avoids having to parse the source via lookahead to determine what pragmas apply.
- This makes it much simpler to just skip what you don't know, and skip pragmas you do know, but have already matched.
Separating strings from the code, instead storing debug string/name references as {int offset; int length} pairs and putting their values consecutively in a string table before any real code references.
- The string table could be just be slab-allocated from the start, then filled with data as fast as it receives it. (This is very low-allocation and easily optimizable, so parsing this part is I/O-bound.)
- It avoids needing to allocate nearly as much when the string/name is being parsed, needing only a reference to the data.
- The string table is specific to the script/module file, and can be reclaimed after compilation. (String names are copied during bytecode generation post-parsing.)
- There are potential size benefits here in my experience for making this separation: binary data and string data almost always compress differently, and from previous experimentation under similar circumstances (compressing data in a long URL hash), I noticed about a 2%-4% decrease alone just by doing this.
Reducing logical "and"/"or" to corresponding if/else variants: test && other ↔(tmp = test) ? tmp : other, test || other ↔ (tmp = test) ? other : test.
- This could increase code size slightly (no more than a few bytes overhead if it can't reuse an existing synthetic local), but every engine I'm aware of performs this desugaring already, so why make it different.
Using breakable blocks with an expression-based AST instead of a statement-based one.
- Expression-based ASTs are usually smaller than statement-based ones.
- You could optimize certain cases like let x = foo(); try { x = foo(); } catch (e) { if (!(e instanceof SafeError)) throw e }, where it could be simplified to something roughly like let x = try { foo() } catch (e) { e instanceof SafeError ? undefined : throw e }. Edit: Missed a backtick 😄
- Elm's highly expression-oriented nature would love this, too.
Requiring fallthrough to be explicit in switch statements.
- It's not common to actually want the fallthrough mechanism.
- Combined with the previous version, many uses of switch would become much smaller when encoded.
Declaring locals before script/module/function body.
- This makes it easier for engines to type-check which locals are accessible where.
Declaring imports/exports before module data table.
- These would be specified in a separate array, followed by a list of imported bindings from each, for performance reasons.
- This can enable engines to efficiently request and link modules before they even start parsing the meat of them. (In normal JS, you can't do this. At all. Not without some deep wizardry no engine implements support for.)
- Loading the data table is one of the most I/O-bound parts of the compilation process, so it's possible multiple modules could have their data tables being loaded simultaneously, even in single-threaded environments.
Encoding the bytecode as a hybrid register/stack machine.
- This removes most of the need for temporaries.
- This makes the bytecode a little more dense.
- Not 100% sure how this is otherwise helpful, though...

Consider merging LiteralInfinityExpression into LiteralNumericExpression

As discussed on Gitter, we might want to do this to simplify handling and provide optimised representations (like binast/binjs-ref#239) more easily for both kinds of numbers.

Some points from the discussion:

double can store infinity values just fine in any IEEE.754 compatible representation, so this shouldn't be an issue, although one voiced concern is that WebIDL defines double as finite floating 64-bit numbers, but this likely shouldn't be a problem due to how we use it.
Another concern is having to provide a specialised codegen for infinity values in LiteralNumericExpression that would produce something like 1e111111111... (see https://bugzilla.mozilla.org/show_bug.cgi?id=1524302#c2), but implementors would need to do the same for LiteralInfinityExpression as well, so it's only a matter of place where to put the branching.

cc @syg @Yoric @arai-a

Module Target

There seems to be very little mention of modules in this spec and issue queue - I assume a ECMAScript Module target will be supported? Ideally this should likely be the primary use case here.

Consider specializing a subset for JSON

I believe that we should consider offering a subset of the syntax specialized for JSON-style literal expressions.

Expected benefits

combined with Skippable, we could skip over these literals and read the object lazily;
one more optimization lever for the encoder when it meets JSON-style literals;
this can be used straightforwardly to extend JSON.parse() to TypeArray and create a new JSON.binify();
it is possible (but not certain) that we could make this encoding more concise and faster to decode.

Possible spec

typedef (... // Previous stuff
         LiteralExpression)
        Expression;

typedef (LiteralObjectExpression or
         LiteralBooleanExpression or
         LiteralStringExpression or
         LiteralNullExpression or
         LiteralNumericExpression or
         LiteralArrayExpression)
        LiteralExpression;

typedef (EagerLiteralObjectExpression or SkippableLiteralObjectExpression) LiteralObjectExpression;
typedef (EagerLiteralArrayExpression or SkippableLiteralArrayExpression) LiteralArrayExpression;

interface EagerLiteralObjectExpression {
    attribute FrozenArray<LiteralObjectProperty> properties;
}
[Skippable] interface SkippableLiteralObjectExpression {
    attribute EagerLiteralObjectExpression value;
}

interface LiteralObjectProperty {
   attribute LiteralPropertyName name;
   attribute LiteralExpression value;
}
interface EagerLiteralArrayExpression {
   attribute FrozenArray<LiteralExpression> elements;
}
[Skippable] interface SkippableLiteralArrayExpression {
    attribute EagerLiteralArrayExpression value;
}

Add high-level goal to allow ASTs to be inlined into <script> tags

I really think this proposal is great, and I wanted to suggest the addition of one relatively small high-level goal that I didn't see in the proposal.

It can be very useful in some cases to inline critical JavaScript into an HTML page, especially if the browser doesn't support HTTP/2 server push. I would love to see this project have as a high-level goal the ability for ASTs to be embedded into an HTML <script> tag. This could either be as the contents of the tag, or, more likely, as a data URI for the src attribute.

Misleading FAQ on WebAssembly comparison

README states:

Why not use WebAssembly?

There are massive existing untyped codebases, and there is no easy way to convert an untyped, garbage collected language to WebAssembly. And even if there were, there is no guarantee that it would be any faster to transmit/parse/start than what we currently have.

whereas WebAssembly FAQ states:

The kind of binary format being considered for WebAssembly can be natively decoded much faster than JavaScript can be parsed (experiments show more than 20× faster). On mobile, large compiled codes can easily take 20–40 seconds just to parse, so native decoding (especially when combined with other techniques like streaming for better-than-gzip compression) is critical to providing a good cold-load user experience.

Transfer repo to TC39 org

All stage 1+ proposals need to live in the tc39 org on github. Please follow the transfer instructions to move this repo to the appropriate place.

Can we have `AssertedPositionalParameterName` interface which contains index?

At first I thought this is the same issue as https://github.com/binast/ecmascript-binary-ast/issues/30, but I guess it's a bit different.

While creating binding data for parameters, we want a list of positional formal parameters+indices which directly maps to arguments element, so that the binding name maps to arguments slot at the point of reading scope data.
With current spec, it's unknown before parsing FormalParameters.

What I propose is the following:

interface AssertedPositionalParameterName {
  attribute unsigned short index;
  attribute IdentifierName name;
  attribute boolean isCaptured;
};

interface AssertedParameterName {
  attribute IdentifierName name;
  attribute boolean isCaptured;
};

typedef (AssertedPositionalParameterName or
         AssertedParameterName)
        AssertedMaybePositionalParameterName;

interface AssertedParameterScope {
  attribute FrozenArray<AssertedMaybePositionalParameterName> paramNames;
  attribute boolean hasDirectEval;
};

AssertedPositionalParameterName contains the index, which is the index in parameter list, and also the index in arguments. (to be clear, it's not the index in paramNames array).
AssertedParameterName is basically the same thing as current AssertedBoundName.

for example, function f(a, b=10, {c}, [d, e] = [], f, ...g) {} has the following scope data:

AssertedParameterScope {
  paramNames: [
    AssertedPositionalParameterName {
      index: 0, name: "a", isCaptured: false
    },
    AssertedParameterName {
      name: "b", isCaptured: false
    },
    AssertedParameterName {
      name: "c", isCaptured: false
    },
    AssertedParameterName {
      name: "d", isCaptured: false
    },
    AssertedParameterName {
      name: "e", isCaptured: false
    },
    AssertedPositionalParameterName {
      index: 4, name: "f", isCaptured: false
    },
    AssertedParameterName {
      name: "g", isCaptured: false
    },
  ],
  hasDirectEval: false
}

there a and f are positional parameters and their indices are 0 and 4.
isSimpleParameterList field is removed, because it's obvious from paramNames (by, whether there's at least AssertedParameterName).

What about directives?

I haven't attempted to benchmark that, but in the current text source world, when parsing a function, the parser must:

start parsing with the current directives;
if the first child is a string, check the string;
if that string is "use strict", restart from 1 with different options.

That seems awkward. Maybe there is a better way to handle this in the binary world. Or maybe "use strict" doesn't really impact parsing all that much in the binary world, to be checked.

Why do some interfaces not inherit from Node?

Is this a typo or is there a semantics associated to this?

Subresource Integrity integration

How would a binary AST encoded resource be delivered when SRI is involved? How should the hash be calculated on both the encoder and the decoder sides?

Consider adding IfElseStatement for if-then-else, separating from IfStatement for if-then

(moved from binast/binjs-ref#131)

Background

we want to generate bytecode directly from .binjs file, without generating on-memory AST structure
we don't want to seek or lookahead inside .binjs file, for simplicity and performance
we want to reduce the amount of modification to already-generated bytecode (including source note) (patching jump target is necessary anyway tho...)
IfStatement has optional alternate statement, and the existence of the alternate is unknown until we start parsing it, that means, it's unknown when generating branch opcode or generating bytecode for consequent

interface IfStatement : Node {
  attribute Expression test;
  // The first `Statement`.
  attribute Statement consequent;
  // The second `Statement`, if present.
  attribute Statement? alternate;
};

So, with current IfStatement interface, we should modify the branch's kind (source note) when it turns out that there's alternate.
It would be better that kind of information is known at the beginning.

Solution

Separate IfStatement into IfStatement without alternate, and IfElseStatement with alternate

interface IfStatement : Node {
  attribute Expression test;
  attribute Statement consequent;
};

interface IfElseStatement : Node {
  attribute Expression test;
  attribute Statement consequent;
  attribute Statement alternate;
};

Pros

Bytecode generation becomes simpler
IfStatement without alternate becomes smaller in .binjs file

Cons

The number of interfaces increases, that might increase the header size and tree size (depends on file format)
(With multipart encoding, increasing interfaces may result in the number of tuple index exceeding 127, which results in 2-bytes data inside [TREE])

Consideration

This is very implementation specific requirement (at least for my case), and I'm not sure if it's a good idea to modify spec for such purpose
Similar issue (requiring a certain aspect of subtree, before the appearance of the actual child node) would happen to other interfaces as well, and separating all of those interfaces might explode the number of interfaces

Equivalent work

Are you aware of JSZap from Microsoft?
https://www.microsoft.com/en-us/research/publication/jszap-compressing-javascript-code/

Seems very close to the purpose of this proposal

AssertedVarScope needs the actual kind of each name

Currently AssertedVarScope has the following structure:

interface AssertedVarScope {
  // checked eagerly during transformation
  attribute FrozenArray<IdentifierName> lexicallyDeclaredNames;
  attribute FrozenArray<IdentifierName> varDeclaredNames;

  // checked lazily as inner functions are invoked
  attribute FrozenArray<IdentifierName> capturedNames;
  attribute boolean hasDirectEval;
};

So at least it doesn't distinguish between let and const.

This is troublesome for streaming compilation because the type of each name is unknown until we hit VariableDeclaration, which can appear later in the statement list. (there can be many another statements before it), and we cannot create efficient representation of the scope until hitting all of them.

Also, it would be nice if it also distinguish between var and function, which might be too-SpiderMonkey specific tho.

Details on size overhead comparisons

It is unclear to me from the documentation provided whether the size savings specified are the result of the format essentially requiring compression as part of its specification, vs actually being meaningfully smaller; esp. compared to minified code+gzipped code that is currently the de jure deployment.

I ask because this is a significant body of new codegen, and new parsers with new attack surface -- I think it is reasonable to know how the size of the binary AST compares to the size of a standard conforming JS in a normal tool assisted environment (which to be clear, the binary AST would require).

The field order doesn't always match the streaming compilation order

When experimenting streaming compilation from multipart .binjs file to SpiderMonkey bytecode,
the order of the fields often become trouble [1][2][3].

The issue is that, if we don't seek/lookahead, and don't keep on-memory AST converted from .binjs file, we should compile in the same order as the .binjs file.
but SpiderMonkey often emits the bytecode in the different order (or interleaving sub-trees) than the original JS syntax itself,
and sometimes applies optimization depends on the nodes which appears later.

the issue comes from that, we cannot lookahead without extra overhead with current format,
(thus current experimental implementation doesn't support seek),
because non-Skippable nodes doesn't have length property at the beginning of the serialized data.

of course we could emit the different bytecode than original JS, in the same order as .binjs file, but it would be the source of extra .binjs-specific bugs, and I'd like to avoid it as much as possible.

So, it would be nice if we can support (maybe partial-) tree-traversal without reading same field twice, in the file-format level.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1456006#c1
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1456404#c2
[3] https://github.com/binast/ecmascript-binary-ast/issues/33

Consider specifying how browsers tell servers they accept AST files

Thanks for such a cool proposal!

Continuing on from this thread, I wanted to suggest that there should be some sort of structured way for browsers to tell servers whether or not the browser understands Binary AST files, and, if supported, which version(s) are understood.

I think the most likely candidate for implementation is the HTTP Accept header, although it gets a bit complicated when combined with features like server push or inlined scripts.

If browsers don't send an Accept header or something like it, servers will have to use user-agent sniffing to figure out whether to send Binary AST or JavaScript. In my experience with sending different versions of JS to browsers, this is a pretty cumbersome and error-prone solution.

Thanks again for this!

Could the binary AST be...a bit more binary?

I'm revisiting this proposal a few months later, and I'm wondering: could this proposal be better specified in terms of raw bytes? Currently, it seems largely spec'd in terms of a JSON-like format, but IMHO that doesn't really seem like it's as small as it could be. For one, it could leverage LEB128 much like WASM does and in a similar fashion. It also doesn't need to keep type names or even operator names as strings, so I feel being a bit more binary could realize the proposal's intent a little better.

Javascript can represent invalid unicode strings

It's possible for valid Javascript strings to be invalid unicode strings - arising out of the fact that JS strings are specced as arbitrary sequences of 16-bit words. This means that invalid UCS-2 sequences, for example \udc11 (which is a lone surrogate pair component) can show up in our string literals.

The binast encoding needs to handle this - we cannot assume that there is always a valid translation of a JS string to a UTF-8 string. This all relates to situations where 16-bit chars fall into the surrogate pair range.

My suggestion is the following: we translate the 16-bit word sequence as if it was a UTF-16 string. This means that when we see valid surrogate pair sequences, we translate those into unicode codepoints and re-encode as a UTF-8 sequence.

When we see surrogate pair values that occur in invalid circumstances, we encode those directly as codepoints. These 16-bit chars are not valid unicode codepoints, so there is no valid UTF-8 sequence that corresponds to them. Those sequences are thus "free" for us to use to encode invalid 16-bit codepoints.

I'm not 100% sure this needs to be addressed in the spec, but @Yoric suggested I make the issue here because it may need to be addressed here.