Parsing and unparsing have super-linear runtime, largely because OCaml list append and string concatenation are not optimized.
Parsing: parse_tilde quadratic (/cubic?) behavior
for n in 1000 2000 4000 8000 16000 32000 64000; do perl -e "print '~'; print 'a' x $n;" > /tmp/t; time ./parse_to_json.native /tmp/t > /dev/null; done
user 0m0.015s
user 0m0.034s
user 0m0.153s
user 0m0.830s
user 0m4.540s
user 0m27.661s
user 3m4.591s
(real/sys runtime omitted for brevity)
Notes:
- This is likely because of ast.ml:
| c::s' -> parse_tilde (acc @ [c]) s'
. It's not obvious to me why it seems to be cubic (rather than quadratic) runtime though.
- An alternative would be to prepend to the list, and then reverse the result at the end.
- The parse_to_json.native test application is from https://github.com/andromeda/pash/tree/main/compiler/parser
- To avoid conflating the libdash runtime with the JSON serialization, I disabled the JSON serialization code (print_ast) in the test app
Parsing: to_assign cubic (?) behavior
for n in 1000 2000 4000 8000 16000 32000 64000; do perl -e "print 'f' x $n; print '=pay_respects;'" > /tmp/t1; time ./parse_to_json.native /tmp/t1 > /dev/null; done
user 0m0.011s
user 0m0.021s
user 0m0.098s
user 0m0.571s
user 0m3.749s
user 0m25.097s
user 2m52.969s
This also has an expensive list append operation: ast.ml | C c :: a -> to_assign (v @ [c]) a
.
Unparsing: ^ string concatenation considered harmful
OCaml's "^" string operator is not optimized; concatenating n strings one at a time can take O(n^2) runtime (https://discuss.ocaml.org/t/whats-the-fastest-way-to-concatenate-strings/2616/7). This is arguably a compiler issue e.g., CPython optimizes for common cases (https://mail.python.org/pipermail/python-dev/2013-February/124031.html).
ast.ml's unparsing is essentially a series of "^" operations, hence everything is going to have a worst-case runtime that's super-linear.
Here's an example of a long pipeline, showing quadratic runtime for json_to_shell:
for n in 2000 4000 8000 16000 32000 64000 128000; do (echo -n "echo 'Hello '"; for f in `seq 1 $n`; do echo -n "| cat "; done; echo) > /tmp/s1; cat /tmp/s1 | ./parse_to_json.native > /tmp/j1; time ./json_to_shell.native /tmp/j1 | md5sum; done
user 0m0.035s
user 0m0.112s
user 0m0.508s
user 0m1.920s
user 0m13.909s
user 0m52.966s
user 4m26.274s
I also tried removing the Ast.to_string operation in json_to_shell.ml, to show that the JSON deserialization by itself is fast (linear); thus, the quadratic runtime is due to the libdash core.
Unparsing: fresh_marker for heredocs is slow on adversarial inputs
fresh_marker tries to find increasingly long-variants of {EOF, EOFF, EOFFF ..}, until it can find a marker that is not contained in the heredoc. This is slow for adversarial inputs that deliberately use all those markers:
cat <<CTHULHU
EOF
EOFF
EOFFF
EOFFFF
EOFFFFF
CTHULHU