Comments (25)
Here are some numbers from the current master branch and Node.js 0.12.
Running node test/benchmark.js
for parse()
:
Underscore 1.5.2 42.5 KiB 6.21 ms ± 2.07%
Backbone 1.1.0 58.7 KiB 6.04 ms ± 0.81%
MooTools 1.4.5 156.7 KiB 30.41 ms ± 0.94%
jQuery 1.9.1 262.1 KiB 32.15 ms ± 0.96%
YUI 3.12.0 330.4 KiB 29.39 ms ± 0.80%
jQuery.Mobile 1.4.2 442.2 KiB 109.21 ms ± 9.28%
Angular 1.2.5 701.7 KiB 74.80 ms ± 6.59%
------------------------
1994.3 KiB 288.21 ms ± 6.65%
To check the tokenizer, apply the following patch:
diff --git a/test/benchmarks.js b/test/benchmarks.js
index cf7673f..d791c20 100644
--- a/test/benchmarks.js
+++ b/test/benchmarks.js
@@ -322,8 +322,8 @@ if (typeof window !== 'undefined') {
size = source.length;
totalSize += size;
return suite.add(filename, function () {
- var syntax = esprima.parse(source, { range: true, loc: true });
- tree.push(syntax.body.length);
+ var tokens = esprima.tokenize(source, { range: true, loc: true });
+ tree.push(tokens.length);
}, {
'onComplete': function (event, bench) {
if (typeof gc === 'function') {
And now running node test/benchmark.js
shows the number for tokenize()
:
Underscore 1.5.2 42.5 KiB 4.93 ms ± 0.85%
Backbone 1.1.0 58.7 KiB 5.35 ms ± 0.66%
MooTools 1.4.5 156.7 KiB 53.27 ms ± 6.47%
jQuery 1.9.1 262.1 KiB 64.54 ms ± 6.52%
YUI 3.12.0 330.4 KiB 48.70 ms ± 5.46%
jQuery.Mobile 1.4.2 442.2 KiB 103.97 ms ± 8.11%
Angular 1.2.5 701.7 KiB 77.81 ms ± 7.07%
------------------------
1994.3 KiB 358.56 ms ± 6.92%
from esprima.
@aslushnikov @ikarienator Can you double check with your own setup/environment?
from esprima.
I knew there is a performance problem with the tokenize function due to the reconstruction of output tokens. In fact the difference is more aggressive if you turn off range
and loc
.
from esprima.
Now with range
and loc
both set to false
.
For parse()
:
Underscore 1.5.2 42.5 KiB 6.28 ms ± 1.77%
Backbone 1.1.0 58.7 KiB 6.15 ms ± 0.76%
MooTools 1.4.5 156.7 KiB 31.12 ms ± 0.91%
jQuery 1.9.1 262.1 KiB 32.82 ms ± 0.86%
YUI 3.12.0 330.4 KiB 29.60 ms ± 0.74%
jQuery.Mobile 1.4.2 442.2 KiB 110.85 ms ± 9.35%
Angular 1.2.5 701.7 KiB 77.49 ms ± 7.43%
------------------------
1994.3 KiB 294.31 ms ± 6.91%
For tokenize()
:
Underscore 1.5.2 42.5 KiB 4.11 ms ± 0.81%
Backbone 1.1.0 58.7 KiB 4.44 ms ± 0.73%
MooTools 1.4.5 156.7 KiB 47.27 ms ± 7.52%
jQuery 1.9.1 262.1 KiB 54.11 ms ± 4.48%
YUI 3.12.0 330.4 KiB 42.64 ms ± 6.67%
jQuery.Mobile 1.4.2 442.2 KiB 92.78 ms ± 6.55%
Angular 1.2.5 701.7 KiB 69.94 ms ± 4.96%
------------------------
1994.3 KiB 315.29 ms ± 6.00%
from esprima.
My real-world example is tokenization of one of Inbox client scripts - that's roughly 580k tokens of obfuscated javascript. Let's get the sample of this size with the following command:
cd test/3rdparty
cat underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js angular-1.2.5.js > big.js
The command produces big.js
file whose tokenization results in 578335 tokens.
I've modified benchmark.js
to include big.js
and to do tokenization instead of parse, and
here are my results for tip-of-tree Esprima:
Underscore 1.5.2 42.5 KiB 4.40 ms ± 4.99%
Backbone 1.1.0 58.7 KiB 5.13 ms ± 1.35%
MooTools 1.4.5 156.7 KiB 43.01 ms ± 5.09%
jQuery 1.9.1 262.1 KiB 58.65 ms ± 7.80%
YUI 3.12.0 330.4 KiB 40.88 ms ± 3.26%
jQuery.Mobile 1.4.2 442.2 KiB 95.26 ms ± 8.17%
Angular 1.2.5 701.7 KiB 89.68 ms ± 4.87%
big 4690.4 KiB 1152.83 ms ± 10.79%
------------------------
6684.7 KiB 1489.86 ms ± 9.96%
We've spent more then a second for tokenization of 4.5 mb of code. This might not
look like outstandingly bad performance, but in fact we can do way better.
To show off that the performance here is suboptimal, let me run the same benchmark, but with the
modified esprima branch, taken from here: https://github.com/aslushnikov/esprima/tree/sax
Instead of pushing tokens into array, the modified branch just reports them via callback. This saves us on memory reallocs (I'm not an expert in V8 - just guessing), and results in significant performance gains. These are results for modified Esprima:
Underscore 1.5.2 42.5 KiB 4.51 ms ± 2.09%
Backbone 1.1.0 58.7 KiB 4.68 ms ± 1.35%
MooTools 1.4.5 156.7 KiB 20.17 ms ± 2.10%
jQuery 1.9.1 262.1 KiB 25.25 ms ± 1.45%
YUI 3.12.0 330.4 KiB 20.73 ms ± 1.34%
jQuery.Mobile 1.4.2 442.2 KiB 42.14 ms ± 2.26%
Angular 1.2.5 701.7 KiB 35.53 ms ± 2.24%
big 4690.4 KiB 324.30 ms ± 7.09%
------------------------
6684.7 KiB 477.32 ms ± 5.95%
That's 3.5 times faster for big.js
.
Update
The diff for modified benchmark.js
could be found here: https://gist.github.com/aslushnikov/fbd25ec7a90e8f87d526
from esprima.
@aslushnikov I agree that the tokenizer can be still optimized. At this step, we try to collect as much information as possible before making any modification.
In the benchmark.js
that you show, do I understand correctly that the callback is just an empty function? Does the benchmark still produce and save the list of the tokens somewhere?
from esprima.
A better way to benchmark will be clear the list in the onComplete callback
and call gc() immediately. The garbage generated in previous runs can be
collected in the next runs, thus increase the variation. I'll send a pull
request to improve this.
On Thu, Feb 12, 2015 at 08:47 Ariya Hidayat [email protected]
wrote:
@aslushnikov https://github.com/aslushnikov I agree that the tokenizer
can be still optimized. At this step, we try to collect as much information
as possible before making any modification.In the benchmark.js that you show, do I understand correctly that the
callback is just an empty function? Does the benchmark still produce and
save the list of the tokens somewhere?—
Reply to this email directly or view it on GitHub
#1019 (comment).
from esprima.
@ariya yep that's just an empty function. If the client wants the list of the tokens, he can just put them in array on its own. But for some problems streaming interface is enough - and it gonna work much faster
from esprima.
@aslushnikov But that means in the benchmark numbers you posted is not a fair comparison. If the callback is also adding the token to an array, what will be the numbers?
from esprima.
@ariya, this is not a comparison of "how much time does it take to return an array of tokens to the client". What I want to compare is "how much time does it take to report tokens to the client".
My point here is that there are clients of esprima.tokenize
that do not need tokens to be stored in an array; they will be just fine with the tokens being streamed. Consider an example where I just want to get number of for
loops in source code. I don't need an array of tokens - why should my performance suffer?
from esprima.
Unfortunately using the current architecture we aren't able to provide a
stream of token because the tokenizer do not support reentrance. But it
will be possible after we reorganize the code and expose tokenizer and
parser as objects.
On Fri, Feb 13, 2015 at 01:31 Andrey Lushnikov [email protected]
wrote:
@ariya https://github.com/ariya, this is not a comparison of "how much
time does it take to return an array of tokens to the client". What I want
to compare is "how much time does it take to report tokens to the
client".My point here is that there are clients of esprima.tokenize that do not
need tokens to be stored in an array; they will be just fine with the
tokens being streamed. Consider an example where I just want to get number
of for loops in source code. I don't need an array of tokens - why should
my performance suffer?—
Reply to this email directly or view it on GitHub
#1019 (comment).
from esprima.
@aslushnikov Sounds like a perfectly valid use case. Just to be clear, I'd like to understand (for this particular use case) which one is more suitable for you:
- using the tokenizer as is but with a much better performance?
- having the tokenization callback (memory and speed-up are bonus)?
from esprima.
@ariya as far as the performance of parsing a few megabytes of js is great - i'm all good! And of course I would prefer the Esprima API to not be changed - if it is even possible.
from esprima.
Any updates on this?
from esprima.
Not yet...
A little bit side track but I think in general running the standalone tokenizer should be slower than running the parser. Also in theory, there is no way to tokenize JS correctly without parsing it (requires a pushdown automaton (equivalent to a parser of any context-free language) instead of an FSM (equivalent to regular expression)). This is especially true for ES6 when we have template literals. We used several heuristics (an algorithm similar to sweetjs) in the standalone tokenizer to guess the next lexical goal, but it's not 100%. You are not supposed to use it in production.
from esprima.
@ikarienator thank you for the response. Can we have a tokenization callback then?
from esprima.
@aslushnikov What is your use case for standalone tokenizer? Why can't you parse the tree and drop the tree information? In fact, I think esprima.tokenizer should do exactly that. What do you think @ariya?
from esprima.
A pure tokenizer is still a valid use case, particularly since it can consume partially valid JavaScript. It is useful for things like syntax highlighter or even auto completer (e.g. aulx).
from esprima.
@aslushnikov We shall try to understand the performance problem first before adding a new API function. Help is welcomed in this matter!
I just ran a quick test using V8 profiler when tokenizing AngularJS code. The sampling profile looks like:
[JavaScript]:
ticks total nonlib name
702 9.1% 23.0% LazyCompile: *collectToken esprima.js:1424:26
344 4.4% 11.3% LazyCompile: *skipMultiLineComment esprima.js:412:34
286 3.7% 9.4% LazyCompile: *scanIdentifier esprima.js:643:28
283 3.7% 9.3% LazyCompile: *filterTokenLocation esprima.js:4101:33
207 2.7% 6.8% LazyCompile: *scanPunctuator esprima.js:678:28
198 2.6% 6.5% LazyCompile: *skipComment esprima.js:472:25
from esprima.
Can you try a minified version?
from esprima.
Minified Esprima or minified AngularJS?
from esprima.
Minified AngularJS
On Wed, Feb 25, 2015 at 18:06 Ariya Hidayat [email protected]
wrote:
Minified Esprima or minified AngularJS?
—
Reply to this email directly or view it on GitHub
#1019 (comment).
from esprima.
I apologize for being away for some time.
Why can't you parse the tree and drop the tree information?
@ikarienator the AST lacks semicolons. Also I would need to know the exact tree structure to walk it and have a mapping between node types and javascript tokens.
I just ran a quick test using V8 profiler when tokenizing AngularJS code.
@ariya AFAIK array reallocs are not represented in v8.log.
Moreover, the test/3rdparty/angular-1.2.5.js
is only 53079 tokens long; you won't be able to feel the destructive effect of array reallocations on the arrays of this size.
Performance audit
esprima.tokenize
working time hugely correlates with the amount of javascript tokens in the input file (and AFAIU the algorithm is O(N) given N is the size of input). The concern here is that esprima.tokenize
performance degrades non-linearly as the size of input gets significantly larger.
In the following table I additionally output amount of javascript tokens in the last column.
lushnikov:~/prog/esprima/test(master)$ node benchmarks.js
Doesn't have exposed gc().
Underscore 1.5.2 42.5 KiB 12.71 ms ± 10.20% 7493 tok
Backbone 1.1.0 58.7 KiB 13.00 ms ± 10.07% 8635 tok
YUI 3.12.0 330.4 KiB 45.07 ms ± 11.39% 34996 tok
MooTools 1.4.5 156.7 KiB 53.96 ms ± 9.80% 38672 tok
jQuery 1.9.1 262.1 KiB 68.32 ms ± 15.26% 46112 tok
Angular 1.2.5 701.7 KiB 74.61 ms ± 12.35% 53079 tok
jQuery.Mobile 1.4.2 442.2 KiB 97.53 ms ± 15.63% 73641 tok
big 4690.4 KiB 1735.33 ms ± 11.79% 578335 tok
The big
has 7 times larger input then jQuery.Mobile
, but works 17 times longer.
1. Data set
I would insist on using a big.js
file for profiling purposes. It is 578335 tokens long and effectively reveals the problem (which is hardly observable on smaller datasets):
cd test/3rdparty
# generate big.js
cat underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js angular-1.2.5.js > big.js
2. Benchmark
Now lets modify the benchmark to run tokenization solely on big.js
. Apply the following patch to the test/benchmark.js
:
--- a/test/benchmarks.js
+++ b/test/benchmarks.js
@@ -31,13 +31,7 @@ var setupBenchmarks,
quickFixture;
fullFixture = [
- 'Underscore 1.5.2',
- 'Backbone 1.1.0',
- 'MooTools 1.4.5',
- 'jQuery 1.9.1',
- 'YUI 3.12.0',
- 'jQuery.Mobile 1.4.2',
- 'Angular 1.2.5'
+ 'big'
];
quickFixture = [
@@ -322,8 +316,8 @@ if (typeof window !== 'undefined') {
size = source.length;
totalSize += size;
return suite.add(filename, function () {
- var syntax = esprima.parse(source, { range: true, loc: true });
- tree.push(syntax.body.length);
+ var syntax = esprima.tokenize(source, { range: true, loc: true });
+ tree.push(syntax.length);
}, {
'onComplete': function (event, bench) {
if (typeof gc === 'function') {
3. Baseline
Let's run the tip-of-the-tree esprima version on the benchmark to figure out its performance.
lushnikov:~/prog/esprima(master)$ git log --oneline | head -1
bf5c615 fixes #1106: do not accept invalid/incomplete string escape sequences
lushnikov:~/prog/esprima(master)$ node test/benchmarks.js
Doesn't have exposed gc().
big 4690.4 KiB 1814.81 ms ± 20.40%
------------------------
4690.4 KiB 1814.81 ms ± 20.40%
Result: 1814ms.
4. Hypothesis
I will verify that the major overhead happens somewhere inside extra.tokens.push
method (extra.tokens
is an array that holds all the produced tokens). This might happen due to
continuous array resizings that happen multiple times.
5. Analysis
I will record v8.log
for future reference, though you won't be able to see array.push
in it.
lushnikov:~/prog/esprima(master)$ time node --prof test/benchmarks.js
Doesn't have exposed gc().
big 4690.4 KiB 1996.43 ms ± 19.37%
------------------------
4690.4 KiB 1996.43 ms ± 19.37%
real 0m23.590s
user 0m22.868s
sys 0m2.367s
The v8 log points that the heaviest function is collectTokens
, followed by filterTokenLocation
. These functions do a similar thing - they iteratively push 538000 objects in arrays.
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
16800 76.4% .../lushnikov/bin/node
6202 36.9% LazyCompile: ~collectToken .../lushnikov/prog/esprima/esprima.js:1427:26
6199 100.0% LazyCompile: *lex .../lushnikov/prog/esprima/esprima.js:1463:17
6199 100.0% LazyCompile: tokenize .../lushnikov/prog/esprima/esprima.js:4330:22
6199 100.0% LazyCompile: ~<anonymous> .../lushnikov/prog/esprima/test/benchmarks.js:318:53
6199 100.0% Function: ~<anonymous> :1:10
3075 18.3% LazyCompile: ~filterTokenLocation .../lushnikov/prog/esprima/esprima.js:4303:33
3073 99.9% LazyCompile: tokenize .../lushnikov/prog/esprima/esprima.js:4330:22
3073 100.0% LazyCompile: ~<anonymous> .../lushnikov/prog/esprima/test/benchmarks.js:318:53
3073 100.0% Function: ~<anonymous> :1:10
3073 100.0% LazyCompile: clock .../lushnikov/prog/esprima/test/3rdparty/benchmark.js:2440:21
In order to indirectly prove that the performance offender is an array and its large size, lets clamp the
extra.tokens
size to 1000. This is safe to do as far as there are at least a few tokens in the array, as the esprima tokenization code relies on some last-parsed tokens in its logic. Apply the following diff to the esprima.js
.
diff --git a/esprima.js b/esprima.js
index aebf398..81c6656 100644
--- a/esprima.js
+++ b/esprima.js
@@ -1294,6 +1294,8 @@
range: [pos, index],
loc: loc
});
+ if (extra.tokens.length > 1000)
+ extra.tokens = extra.tokens.slice(-10);
}
return regex;
@@ -1455,6 +1457,8 @@
};
}
extra.tokens.push(entry);
+ if (extra.tokens.length > 1000)
+ extra.tokens = extra.tokens.slice(-10);
}
return token;
Running the benchmark on the updated version of the code reveals 5x speedup: 322ms in this version
counter the 1814ms for the tip-of-tree.
lushnikov:~/prog/esprima(master)$ node test/benchmarks.js
Doesn't have exposed gc().
big 4690.4 KiB 322.08 ms ± 2.23%
------------------------
4690.4 KiB 322.08 ms ± 2.23%
Conclusion
Given the above, the main performance offender seems to be a large array. One way to fix this might be avoiding having such a large array and reporting tokens via callback (which my original pull request was about).
from esprima.
@aslushnikov Thanks for the excellent and detailed analysis! I think we may want to pursue the path of providing delegates, since there is also another use case for that, see #1113. Just like your original pull request, we then shall implement the tokenizer delegate. There are some implications of that which we may need to take into account, I'll spawn a new issue soon to initiate the discussion.
from esprima.
I believe this is now resolved with the introduction of token delegator support (#1332).
from esprima.
Related Issues (20)
- npm ERR! 404 Not Found - GET https://codeload.github.com/ariya/esprima/legacy.tar.gz/master HOT 1
- Merge fixes from Contrast-Security-Inc's Fork
- npm audit report 29 vulnerabilities HOT 1
- esprima-next -> fork for development and npm package HOT 1
- High Surrogate Unicode value is wrong. HOT 1
- Replace JSF CLA with OpenJS CLA HOT 1
- Unused packages detected in the esprima project on Tag: 4.0.1 HOT 1
- [feature] esprima cannot parse bigint HOT 1
- JSX Identify the problem HOT 3
- Fails to parse access to import.meta HOT 1
- Esprima fails to parse classes that have public fields HOT 2
- Static property fails to parse HOT 1
- Feature request: Optional chaining operator HOT 9
- syntax error by rest parameter of spread syntax HOT 4
- AST_CALL is not applied on dictionary argument in a function
- Invalid UpdateExpression's arguments
- Parsing `-999999984306749440;` turns into `999999984306749400` HOT 1
- Broken link in file `docs/syntax-tree-format.md`
- Unexpected Syntax Error
- Is this project abandoned? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from esprima.