raptorjit / raptorjit Goto Github PK

RaptorJIT: A dynamic language for system programming (LuaJIT fork)

License: Other

Makefile 0.75% C 75.33% Lua 19.82% Roff 0.08% C++ 0.13% Nix 0.33% Assembly 3.47% R 0.09%

raptorjit's Introduction

RaptorJIT is a Lua implementation suitable for high-performance low-level system programming. If you want to use a simple dynamic language to write a network stack; a hypervisor; a unikernel; a database; etc, then you have come to the right place.

RaptorJIT is a fork of LuaJIT where we aim to provide:

Ubiquitous tracing and profiling to make application performance and compiler behaviour transparent to programmers.
Interactive tools for inspecting and cross-referencing trace and profiler data (Studio).
Collaborative and distributed development based on the Linux kernel fork-and-merge model.

The most notable technical changes since forking LuaJIT are:

Added auditlog and vmprofile low-overhead ("always on") binary tracing and profiler logging features. Removed obsoleted tracing based on introspection including jit.v, jit.dump, and jit.p.
Reduced code maintenance footprint ~50% by removing #ifdef features that are not required for Linux/x86-64 e.g. Windows support, 32-bit heap support, and non-x86 backends. This is a necessary short-term expedient to make the code maintainable while we bootstrap the project.
Compiler heuristics tightened to reduce the risk of bytecode blacklisting causing catastrophic performance drops.
Started using git merge to accept contributions of both code and development history from other forks.

RaptorJIT is used successfully by the Snabb community to develop high-performance production network equipment. Join us!

RaptorJIT compilation for users

Build using LuaJIT to bootstrap the VM:

$ make  # requires LuaJIT (2.0 or 2.1) to run DynASM

Build without bootstrapping, when not hacking the VM:

$ make reusevm  # Reuse reference copy of the generated VM code
$ make          # Does not require LuaJIT now

Inspecting trace and profiler data interactively

To understand how your program executes you first produce diagnostic data (auditlog and vmprofile files) and then you inspect them interactively with Studio.

You can produce diagnostic data on the command line:

$ raptorjit -a audit.log -p default.vmprofile ...

Or within your Lua code:

jit.auditlog("audit.log")
local vmprofile = require("jit.vmprofile")
vmprofile.open("default.vmprofile")

Then you can copy the file audit.log and *.vmprofile into a directory /somepath and inspect that with the Studio script:

with import <studio>;
raptorjit.inspect /somepath

Studio will then parse, analyze, cross-reference, etc, the diagnostic data and present an interactive user-interface for browsing how the program ran.

Here are tutorial videos for Studio:

How to load Snabb diagnostic data into Studio. Covers installing Studio and running a script. (Uses a Snabb-specific mechanism for producing diagnostic data which is implemented in Lua.)
Inspecting RaptorJIT IR code with Studio. Covers profiling and inspecting small Lua scripts. Runs Lua code directly from the Studio UI.

RaptorJIT compilation for VM hackers

RaptorJIT uses Nix to provide a reference build environment. You can use Nix to build/test/benchmark RaptorJIT with suitable versions of all dependencies provided.

Note: Building with nix will be slow the first time because it downloads the exact reference versions of the toolchain (gcc, etc) and all dependencies (glibc, etc). This is all cached for future builds.

Build with nix

Install nix:

$ curl https://nixos.org/nix/install | sh

Build in batch-mode and run the test suite (option 1a):

$ nix-build    # produces result/bin/raptorjit

Build in batch-mode without the test suite (option 1b):

$ nix-build -A raptorjit

Build interactively (option 2):

$ nix-shell    # start sub-shell with pristine build environment in $PATH
[nix-shell]$ make -j    # build manually as many times as you like
[nix-shell]$ exit       # quit when done

Build without nix

$ make

... but make sure you have at least make, gcc, and luajit in your $PATH.

Run the benchmarks

Nix can also run the full benchmark suite and generate visualizations with R/ggplot2.

The simplest incantation tests one branch:

$ nix-build testsuite/bench --arg Asrc ./.   # note: ./. means ./

You can also test several branches (A-E), give them names, specify command-line arguments, say how many tests to run, and allow parallel execution:

# Run the benchmarks and create result visualizations result/
$ nix-build testsuite/bench                     \
            --arg    Asrc ~/git/raptorjit       \
            --argstr Aname master               \
            --arg    Bsrc ~/git/raptorjit-hack  \
            --argstr Bname hacked               \
            --arg    Csrc ~/git/raptorjit-hack2 \
            --argstr Cname hacked-O1            \
            --argstr Cargs -O1                  \
            --arg    runs 100                   \
            -j 5           # Run up to 5 tests in parallel

If you are using a distributed nix environment such as Hydra then the tests can be automatically parallelized and distributed across a suitable build farm.

Optimization resources

These are the authoritative optimization resources for processors supported by RaptorJIT. If you are confused by references to CPU details in discussions then these are the places to look for answers.

Computer Architecture: A Quantitative Approach by Hennessy and Patterson.
Intel Architectures Optimization Reference Manual.
Agner Fog's software optimization resources:

The AnandTech review of the Haswell microarchitecture is also excellent lighter reading.

Quotes

Here are some borrowed words to put this branch into context:

I'm outta here in a couple of days. Good luck. You'll need it. Mike Pall

Optimal code is not optimal to maintain. Vyacheslav Egorov

If a programmer is indispensable, get rid of him as quickly as possible. Gerald M. Weinberg

If a system is to serve the creative spirit, it must be entirely comprehensible to a single individual. Dan Ingalls

The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. E.W. Dijkstra

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. C.A.R. Hoare

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it? Brian Kernighan

raptorjit's People

Contributors

Stargazers

Watchers

raptorjit's Issues

Distribute as a binary

RaptorJIT should be distributed as a binary. The binary should be built automatically by CI.

You can build it yourself, of course, but due to #23 #24 #25 there is not much point because you will end up with the same binary anyway.

Of course you will want to run your own builds if you are hacking on it... :-)

New garbage collector?

Question: does RaptorJIT need a new garbage collector? Why? Why not?

LuaJIT has a design for a new garbage collector. @corsix and @fsfod both have partial implementations of the new design. However, I am missing the background here: what is the high-level problem with the existing collector and how is this solved by the new one? (As an application developer, what can I do with the new collector that I can't do today?)

The rationale in the design document is quite terse:

The current garbage collector is relatively slow compared to implementations for other language runtimes. It's not competitive with top-of-the-line GCs, especially for large workloads. The main causes for this are the independent memory allocator, cache-inefficient data structures and a high number of branch mispredictions. The current GC design is a dead end when it comes to further performance tuning.

This rationale is too general for me to really process.

My feeling is that each GC design has certain advantages and disadvantages that impact different applications in different ways. Some applications worry about the upper bound on GC latency, others worry about the practical maximum heap size, others worry about the maximum rate of new object allocation or old object mutation, and so on. The application developer needs to understand how the GC works in order to design their application in harmony.

Take Snabb for example:

We are sensitive to the maximum latency of a GC step. Snabb operates in a soft-realtime environment using network cards that can only buffer about 100us of packets before they overflow.
We have a low allocation rate because we take care to sink all allocations in inner loops as much as we can.
We keep the Lua heap small and allocate all large data structures with the FFI.

So for Snabb it could be very exciting to have a new collector that would clamp our worst-case latency, or would allow us to relax about allocation sinking, or would allow us to use the Lua heap instead of FFI for large data structures.

Question is, does the new design do any of those things? Or would we somehow benefit from the other things that it does?

(Pardon the Snabb-centric description. Could also be that the new GC benefits other applications than Snabb and is valuable for those reasons. RaptorJIT is absolutely not Snabb-specific. I am only trying to understand the motivation for the new LuaJIT GC in terms of benefits rather than features.)

See also current Snabb investigation of GC performance and behaviour at snabbco/snabb#1121.

Radical idea: Should we rewrite the VM in...

Should we rewrite the VM in GNU assembler?

Elevator pitch: Frees us from the bootstrap dependency on Lua/LuaJIT/minilua, costs moderate porting effort, and has very little impact from a maintenance perspective.

Situation now:

RaptorJIT VM is written in DynASM (C variant.)
RaptorJIT build depends on Lua/LuaJIT/minilua for DynASM.

Issues:

DynASM is just being used as a generic assembler. Chosen for historical reasons.
DynASM C variant is being used. Lua would be nicer. (Maybe bad idea with minilua.)
Build dependency on Lua/LuaJIT/minilua/nix is inconvenient for RaptorJIT end-users (but reasonable for upstream RaptorJIT hackers.)

So we need a solution so that:

End-users can build RaptorJIT easily (conservative build dependencies.)
VM hackers can work efficiently (liberal build dependencies if helpful.)
RaptorJIT code footprint is kept small (don't overwhelm maintainers.)

Solutions:

Status quo: Currently end-users need to install LuaJIT as a build dependency for DynASM. This is inconvenient, but not the end of the world.
Short term: We can break the LuaJIT dependency for end-users by checking the generated VM code into the repo. However there is a good chance that this will become unwieldy and kludgy over time if the code is churning often e.g. due to frequent merge conflicts, busy-work of regenerating and checking in, risk of accidentally missing an update and building stale code, etc.
Long term: Maybe it does make sense to reformulate the VM in GNU assembler? The code should look and feel basically the same as the current DynASM version but be buildable with standard GNU tools. (It will also be more work to integrate VM changes from other LuaJIT branches, but that ship has sailed: we are not going to live in LuaJIT's shadow.) See jonesforth.S for an example of a virtual machine written in GNU assembler.

DynASM in Lua mode is an extremely valuable piece of software. Likely it should be bundled with RaptorJIT for convenient use by application programmers. Snabb hackers are increasingly using DynASM for key algorithms and data structures, see e.g. siphash.dasl. This doesn't necessarily mean we need to use DynASM for the RaptorJIT VM though.
We already decided that rewriting the VM in C is a dubious idea.

See also: #58 "Rewrite the VM bytecode interpreter in C?"

How does stack frame unwinding work?

I have to understand how stack frames and unwinding works. I also want to understand whether this can be simplified by dropping support for some complex scenarios e.g. exceptions propagating through interleaved Lua and C++ stack frames.

This is especially relevant to rewriting the VM in C (#93) and wanting to keep it simple and avoid potential dependencies on the C compiler. How simple can we make everything? What functionality will that cost us?

Anybody already understand this and care to offer some hints for getting started?

HotPenalty low/medium/high watermark?

This is an idea for a simple and systematic approach to making JIT heuristics more conservative about blacklisting (#100).

Problem

The JIT sometimes encounters situations where it is able to create a trace but instead it "aborts" in the hope that a better trace can be found. This leads to messages in the JIT log like "leaving loop in root trace" and "inner loop in root trace" and "blacklisted". This is helpful when the JIT does actually find a better trace, or neutral when the JIT later accepts the trace, or harmful when the aborts cause a bytecode to be blacklisted even though a trace could have been recorded.

There are some special-case heuristics for detecting when the same thing is happening persistently and eventually accepting it, as in lj_record.c:innerloopleft(), but it is not obvious that every case is covered (for example reaching a blacklisted bytecode seems to always abort and accept a trace that ends immediately before the ILOOP.)

Proposed solution

I would like to be confident that the JIT becomes less "fussy" over time and will always accept a "good" or merely "okay" trace in preference to falling back on the interpreter. I propose that a simple and uniform heuristic is the way to achieve this.

Suppose that the "penalty" score for each bytecode (measuring how persistently it is aborting when chosen as the start of a trace) would be divided into low/medium/high watermarks. The JIT would then adapt its "fussiness" based on the penalty category:

Low: Accept only "very good" traces e.g. looping root traces.
Medium: Accept also "good" traces e.g. root trace that reaches a compiled loop (LINNER).
High: Accept "bad" traces e.g. a straight line ending on a non-JITable NYI instruction.

This could be applied to the codebase by guarding every "heuristic" trace abort like:

if (penalty_low_watermark(J))
  lj_trace_err(J, LJ_REASON)

and then it would be clear from inspection that the aborts will not lead to blacklisting.

Thoughts? cc @javierguerragiraldez and @corsix again in case my thinking is all muddled and I need a poke in another direction!

Validating LuaJIT optimizations

I setup a CI job to evaluate the effectiveness of existing LuaJIT optimizations on Linux/x86-64 (Haswell.) Conclusion is that they are all valuable and worth maintaining. (Too bad, it would have made life easy to be able to remove some!)

You can browse the automatic report from the master branch (permalink to latest) but the main summary is this first graph (click to zoom):

This compares the master branch to other branches no-fold, no-loop, no-narrow, no-sink that each disable one major optimization. If the optimization is beneficial then we expect these branches to be slower than master.

My take:

Disabling fold, loop, or narrow impacts a lot of benchmarks. The optimizations are often providing a 2-4x speedup.
Disabling sink does not have much effect on these benchmarks, but as an application developer I know that this optimization is critical when working with pointers or 64-bit integers. So the lack of visual impact here is a sign that the benchmarks need to be expanded to show the benefit of allocation sinking.
One benchmark (nsieve-bit-fp) actually improves when the narrowing optimization is disabled. However, this is a rare exception, and may have been written specifically to exercise a bad case (not sure.) The biggest win for narrow is md5 with ~3x difference.

So now it is clear to me that all of these optimizations are worth keeping, understanding, and maintaining.

I hope this issue illustrated the quantitative approach that I want to take to optimizations. Each optimization has a cost in complexity and must have a compensating benefit in performance. The benefit needs to be established with reproducible benchmarks so that we know it is real and we know how it evolves over time (e.g. whether it is ever obsoleted by code and/or hardware changes.)

(If you read the full report you can also see that disabling loop optimization makes performance between runs more consistent. This is because performance will be less sensitive to side-traces effectively inhibiting the loop optimization in non-deterministic ways. I really want to make performance more consistent but I can't live without the loop optimization.)

Support efficient pointer arithmetic with non-power-of-2 structure sizes

Fix this obscure problem with pointer arithmetic on non-power-of-2 structure addresses: snabbco/snabb#1207.

How to profile and optimize code that runs interpreted?

How should we profile and optimize code that runs in interpreted mode?

VMProfile (LuaJIT/LuaJIT#290) currently only reports the overall % of time spent in the interpreter:

    what percent
  interp    39.9
       c    25.7
      gc    17.9
 foreign     7.6
    head     4.8
    loop     4.1
    exit     0.0
  record     0.0

This tell us that we are spending 39.9% of our time in the interpreter. Fine. But what does this really mean and what should we do about it?

What it means: I'd say that if you are spending time in the interpreter then you have a performance bug that you need to fix. That's certainly true for Snabb anyway: the application is supposed to run fast and absolutely cannot afford to spend any significant amount of time in interpreted mode. So the first step in any optimization flowchart would be "Are you spending time in the interpreter? If so then stop doing that."

What should we do about it? We need to identify the reason that the code is running interpreted and eliminate that so that it will JIT instead. This is frequently the accidental use of an "NYI" bytecode in some part of the program that causes a loop or function to be "blacklisted." This blacklisting can then propagate to all the callers, and their callers, etc, when their traces reach an IFUNC/ILOOP bytecode representing some code that is blacklisted.

So! I reckon the profiler should break down the % time interpreted according to which bytecode is the root cause of the code being interpreted i.e. which bytecode is at the end of any chain of IFUNC/ILOOP blacklistings. This is very different than how the jit.p profiler operates: it tells us which code the interpreter is running, but we don't care about that, we only want to know which line of code prevented the JIT from compiling.

Have to think about the best way to provide this information to the profiler. Could e.g. be that the interpreter state needs to include the "reason" that the interpreter is currently running and then the profiler's signal handler needs to sample that.

RFC: New bytecode "HOTCOUNT"

Problem: Tracing is unpredictable due to non-deterministic collisions in the HOTCOUNT hashtable. Local changes in source code can lead to global changes in performance when certain bytecodes become aligned in their lower bits. (This is directly analogous to cache conflict misses.)

Solution: Replace the shared HOTCOUNT table with a separate counter for each loop and function head. This way we will measure when one specific bytecode gets hot rather than when one shared hash bucket gets hot.

Implementation: Each loop and function head will be immediately followed by a HOTCOUNT bytecode that contains the corresponding counter. HOTCOUNT executes as a NOP. The current count is stored in the first operand, initialized by the bytecode compiler, and updated in place. (The second operand could perhaps be used to remember the original value and restore that on jit.flush().)

TODO: Consider whether this design works for detecting hot side-traces. (I have not looked at this code for this... is it done in the VM interpreter or somewhere else?)

Further: This design could permit experimental language extensions for hinting the compiler where a trace should start by specifying the initial value a given HOTCOUNT instruction should have. This could help to write test suites that explicitly test different trace structures.

-Ofuse degrades md5 benchmark

The md5 benchmark from the LuaJIT test suite performs better when the fuse optimization is disabled. This seems like a bug: something fuse is doing is hurting rather than helping performance.

This issue is reproducible with LuaJIT v2.1 (x86-64/GC64).

Example from an E5-2620v3 CPU:

 Performance counter stats for './luajit ../../luajit-test-cleanup/bench/md5.lua 20000':

       1684.486328      task-clock (msec)         #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               179      page-faults               #    0.106 K/sec                  
     4,042,712,768      cycles                    #    2.400 GHz                    
     8,166,814,912      instructions              #    2.02  insn per cycle         
     1,046,206,012      branches                  #  621.083 M/sec                  
           379,100      branch-misses             #    0.04% of all branches        

       1.685012534 seconds time elapsed

 Performance counter stats for './luajit -O-fuse ../../luajit-test-cleanup/bench/md5.lua 20000':

       1460.412241      task-clock (msec)         #    1.000 CPUs utilized          
                 1      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               180      page-faults               #    0.123 K/sec                  
     3,504,887,005      cycles                    #    2.400 GHz                    
     8,647,103,342      instructions              #    2.47  insn per cycle         
     1,048,078,728      branches                  #  717.660 M/sec                  
           344,457      branch-misses             #    0.03% of all branches        

       1.461060263 seconds time elapsed

Observations:

Takes 1.68 seconds with default settings but only 1.46 seconds with -O-fuse.
The fuse version executes fewer instructions (8.17M vs 8.65M) but more slowly (2.02 vs 2.46 instructions/cycle.)

Could be that the Haswell CPU is choking on the fused instructions?

Generate DynASM instruction listing with Intel XED

Intel have recently released XED which is a parseable definition of the instruction encodings for all x86 architectures. We should use this to automatically generated the DynASM instruction listings instead of maintaining them by hand.

Relevant comment: LuaJIT/LuaJIT#257 (comment).

Blacklist bytecodes conservatively

One of the worst things that can happen in a long-running process is for the JIT to "blacklist" a frequently executed bytecode. RaptorJIT needs to treat blacklisting as "the nuclear option" and to employ it only as a last resort.

Blacklisting a bytecode means that the JIT has given up on recording traces that start on a certain line of code. This is severe because this line of code is blacklisted indefinitely for the lifetime of the process, which can be several years for server applications.

Blacklisting is also extra-severe because it can propagate. If other code reaches the blacklisted bytecodes frequently then it will also be blacklisted, etc, and this cascading effect can easily slow an application down by a factor of 10 or more.

Cases where blacklisting is appropriate:

Short pieces of code that can never be JITed successfully. For example a tight loop that executes an NYI operation should be blacklisted because there is no way that the JIT can compile it even partially.

Cases where blacklisting is NOT appropriate:

Simple bad luck during tracing e.g. unfortunate control flow on the series of executions that were recorded.
Fussiness during tracing e.g. heuristics rejecting a workable trace to continue searching for a better one.
Heavy load e.g. being overwhelmed by applications that record thousands of traces and will frequently exercise any "theoretically rare" false-positive problems.

Create Git workflow for pulling LuaJIT

RaptorJIT should pull fixes and improvements from the LuaJIT v2.1 branch. Git should keep track of what has already been merged and what is still pending. However, RaptorJIT has diverged from LuaJIT significantly and messy/irrelevant changes should be skipped entirely.

So we need to come up with a procedure to pull in the latest changes without making a mess.

Just now I merged in all the new commits to LuaJIT into my working tree like this:

$ git fetch https://github.com/luajit/luajit v2.1 && git merge FETCH_HEAD

but this pulls in many conflicts such as where new ports have been added to LuaJIT. This is messy to unpick by hand. I need a way to select the commits that seems actually relevant to RaptorJIT and then resolve the conflicts only in those. Could use cherry-pick but I would prefer for Git to keep track of which commits have already been considered via merge commits.

Switch to Lua-mode DynASM

RaptorJIT should switch from the C-hosted DynASM to the Lua-hosted DynASM.

This requires reformulating the VM source (vm_x64.dasm) and replacing dasm/ with the code from LuaJIT/LuaJIT#108.

Idea: Super Looper Side Trace

This is an idea that builds on Super Side Traces (#108).

Suppose that when we create a side trace we do construct the super-IR that includes all parent instructions. It is likely that this super-IR will form a loop. This is because often a side trace will return to the starting point of its root trace.

So the idea is to create a new looping trace based on the super-IR whenever this is possible. This would allow a source code loop to compile into many looping side traces.

This would partially solve the "side trace Russian Roulette problem" (LuaJIT/LuaJIT#218). If each control path through the loop forms a looping trace then the machine code will be efficient for a larger class of biased loops. Specifically, the loop does not need to be biased to take the same control path as the original root trace, it only needs each iteration to be biased the same way as the previous iteration.

This seems like a potential big with for predictable high performance i.e. less sensitivity to non-deterministic trace selection heuristics.

Studio 0.1 release: JIT visualization tools

I have made the first "zero-dot" release of Studio, the application that can interactively visualize RaptorJIT traces and profiles. It's the code that creates the screenshots over on the auditlog branch (#63).

It's very early days but brave souls are welcome to try it out! Feedback of all kinds is welcome via issues on the Studio repo.

Example: How a side trace inherits from a parent trace

Here is an example to help explore what is special about a side trace (#99). The idea is to look at the LuaJIT IR dump for a simple side trace and see what we can figure out.

Example code

local array = require("ffi").new('uint64_t[1]')
for i = 1, 1000 do
  -- START OF PARENT TRACE
  global = 0
  local tab = {}
  local u64 = array[0]
  local var = global
  local tmp
  if i < 500 then
    tmp = 0 
  end
  -- START OF SIDE TRACE
  tmp = tab
  tmp = u64
  tmp = var
  global = 1
end

IR

Here is the root trace up to the point where the exit is taken to the side trace:

---- TRACE 1 start test4.lua:2
---- TRACE 1 IR
....              SNAP   #0   [ ---- ---- ]
0001 rbp      int SLOAD  #3    CI
0002 r9       fun SLOAD  #0    R
0003 rdx      tab FLOAD  0002  func.env
0004          int FLOAD  0003  tab.hmask
0005       >  int EQ     0004  +63
0006 r8       p64 FLOAD  0003  tab.node
0007 rax   >  p64 HREFK  0006  "global" @35
0008          tab FLOAD  0003  tab.meta
0009       >  tab EQ     0008  NULL
0010          num HSTORE 0007  +0
0011          nil TBAR   0003
....              SNAP   #1   [ ---- ---- ---- 0001 ---- ---- 0001 +0   ]
0012  {sink}  tab TNEW   #0    #0
0013 rbx   >  cdt SLOAD  #2    T
0014          u16 FLOAD  0013  cdata.ctypeid
0015       >  int EQ     0014  +96
0016          p64 ADD    0013  +16
0017 rcx      u64 XLOAD  0016
0018  {sink}  cdt CNEWI  +12   0017
....              SNAP   #2   [ ---- ---- ---- 0001 ---- ---- ---- 0012 0018 +0   ---- ]

and here is the setup portion of the side trace:

---- TRACE 2 start 1/2 test4.lua:13
---- TRACE 2 IR
0001 rbp      int SLOAD  #3    PI
0002 rcx      u64 PVAL   #17
0003  {sink}  tab TNEW   #0    #0
0004  {sink}  cdt CNEWI  +12   0002
....              SNAP   #0   [ ---- ---- ---- 0001 ---- ---- ---- 0003 0004 +0   ---- ]
0005       >  nil GCSTEP

and here is the actual body of the side trace:

0006 rbx      fun SLOAD  #0    R
0007 rbx      tab FLOAD  0006  func.env
0008          int FLOAD  0007  tab.hmask
0009       >  int EQ     0008  +63
0010 r15      p64 FLOAD  0007  tab.node
0011       >  p64 HREFK  0010  "global" @35
0012          tab FLOAD  0007  tab.meta
0013       >  tab EQ     0012  NULL
0014          num HSTORE 0011  +1
0015          nil TBAR   0007
0016 rbp      int ADD    0001  +1
....              SNAP   #1   [ ---- ---- ---- ]
0017       >  int LE     0016  +1000
0018 xmm7     num CONV   0016  num.int
....              SNAP   #2   [ ---- ---- ---- 0018 ---- ---- 0018 ]
---- TRACE 2 stop -> 1

Analysis of side trace setup

Here are some initial observations about the setup code for the side trace:

The side trace can import stack values from the parent trace using the SLOAD instruction's P flag (see LuaJIT IR).
The side trace can import SSA values from the parent trace using PVAL. The side trace was able to access the value of variable u64 directly in the register rcx.
Allocation sinking works even when sunk values escape to side traces. The root trace sunk the u64 allocation, the side-trace could still access the value with PVAL, and the side trace also sunk the allocation since it did not escape further.
The side trace setup includes an explicit GCSTEP instruction. This could be gratuitous because no allocation actually happened? (The allocation were all sunk but maybe this was not known when the setup code was emitted... maybe need an extra optimization step to eliminate unneeded GCSTEP instructions?)

Analysis of the side trace body

The body of the side trace looks like it might be very redundant with the root trace? The line global = 1 generates a lot of instructions for finding the address of the variable global and checking all the necessary guards on the relevant tables. But it seems like this work was already done in the parent trace when it did global = 0. So could we expect the side trace to have shorter code that skips the guards and reuses the address from the parent trace? i.e. PVAL #7 to get the hashtable slot and then HSTORE +0 to write the new values?

I am confused about whether side traces are doing optimizations like common subexpression elimination with their parent traces, and whether they could, and whether they should...

End

Make any sense? Thoughts welcome!

Specify build dependencies with nix

Create a nix definition of the build environment for RaptorJIT. This will ensure that the exact same version of the C compiler, GNU Make, etc, is used every time and each person compiling RaptorJIT will get exactly the same result.

Validating LuaJIT micro-optimizations

I am running a follow-on benchmark to #46 to validate the benefit of smaller optimizations on RaptorJIT/x86-64. This time I am checking -O-fuse, -O-abc, -Oinstunroll=0, and -Oloopunroll=0. Here is some rife speculation while I wait for the results.

I am curious about the fuse optimization. This is doing quite a lot of work to try and simplify machine code by merging instructions together. The wrinkle is that Intel Core CPUs are doing similar optimizations. The "front-end" of an Intel CPU is actually similar to a tracing JIT: it translates from high-level instructions (x86-64) into low-level IR instructions (micro-operations aka uops), it splits and merges the instruction stream (uop fusion), it identifies dependencies and reuses registers behind the scenes (register renaming), and it even has a "trace cache" of optimized uops that is not unlike the LuaJIT traces.

So the question on my mind is whether some of the instruction fusion optimizations could be mutually redundant between our tracing JIT and the one built into the CPU. If so then we may be able to simplify the implementation by outsourcing the work to the CPU. Or it could be that our JIT does the job better or that there is other value e.g. that we prefer to simplify the mcode ourselves to make it easier to read (since we never get to see the uops.) In any case it will be interesting to look at the benchmarks and decide why the fusion optimization is a good thing and worth maintaining.

Validating LJ_LIKELY and LJ_UNLIKELY

I want to establish how beneficial the macros LJ_LIKELY() and LJ_UNLIKELY() are for RaptorJIT. My impression from other projects like the Linux kernel is that these are often unhelpful or even counter productive.

If they are not clearly helpful then I would like to remove them to reduce the cognitive burden of thinking about how to maintain the existing 127 uses and when to add more.

I ran the standard benchmarks in four different variants (30 runs each):

BOTH: Compile with both LJ_LIKELY and LJ_UNLIKELY defined.
LIKELY: Compile with only LJ_LIKELY (LJ_LIKELY becomes a nop.)
UNLIKELY: Compile with only LJ_UNLIKELY.
NONE: Compile without either.

I decided to visualize the results a little differently this time. I am using a boxplot to summarize the distribution of results instead of only showing the average. I made the figure large to accommodate this, please click to zoom:

Observations:

There are a significant number of outliers. Overall the distribution of results is quite interesting and I think it has been much too simplistic to report only average and relative-standard-deviation previously. Have to think about the most appropriate ways going forward.

The boxplot really shows the non-deterministic performance of the roulette benchmark well. That's the benchmark that I wrote specifically to highlight how side-trace heuristics can make performance highly variable.

binary_trees gets -2.25% from LJ_LIKELY().
nbody gets +1% from LJ_UNLIKELY() and -1% from LJ_UNLIKELY() (performance is the same when both are enabled or both are disabled.)
life does better with both disabled.

... and so on. I suppose this should be analyzed more thoroughly e.g. with something along the lines of Tukey's Test or a permutation test. Just don't have the time to work that out now.

Bottom line: Overall it does not seem like we are benefitting from the static branch prediction annotations in the benchmark suite. This could justify removing them. Certain instances could be examined closely in order to preserve the specific good cases while avoiding the specific bad cases. More data from real applications would be needed to validate any proposed action.

Vector operations

Some operations would greatly benefit if RaptorJIT had vector operations.
Perhaps there should be built-in vector FFI types, or a built-in function for each kind of vector operation?
E.g.

local vector_ops = require "vector_ops"
local x, y = 1, 2
x, y = vector_ops.add(x, y, 3, 4) -- vector_length = select("#", ...) / 2
assert(x == 4 and y == 6)

local dvec4 = require "dvec4" -- vector of 4 doubles
local x = dvec4(1, 2)
x = x + dvec4(3, 4)
assert(x[0] == 4 and x[1] == 6)

I don't know which would be easier to implement, but the idea is that having a generic function for each of the operations would let the user construct the necessary FFI type themself, and then let the JIT-compiler turn the calls into the proper instructions.

Revisit "narrowing" optimization rationale

The "narrowing numbers to integers" optimization in lj_opt_narrow.c includes an extremely well-written rationale for the approach taken with reference to the CPU microarchitectural details. (Inspiringly good piece of code commenting!) However, this was written circa 2010 in the Nehalem era which has since been superseded by Sandy Bridge and then Haswell and then Skylake. So it makes sense to revisit the design and see if it still makes sense or whether the assumptions that it is based on are no longer true. Could be an opportunity to improve performance by updating the optimization, or to simplify the code by removing optimizations that are no longer beneficial.

Let's use the standard optimization resources for reference and especially this illustration from AnandTech of the Haswell execution engine for reference:

Here are some parts of the comments that jump out at me from a 2017 perspective:

The total CPU execution bandwidth is the sum of the bandwidth of the FP and the integer units, because they execute in parallel. The FP units have an equal or higher bandwidth than the integer units. Not using them means losing execution bandwidth. Moving work away from them to the already quite busy integer units is a losing proposition.

This does not seem to be true of Haswell. Haswell has two execution ports that can do either float or integer operations (0,1) and two ports that are integer-only (5,6). So it seems like the opposite to the comment is true: integer operations have more bandwidth & and have to contend with float operations for use of execution resources.

Always replacing an FP op with an integer op plus an overflow check is counter-productive on a current-generation super-scalar CPU. Although the overflow check branches are highly predictable, they will clog the execution port for the branch unit and tie up reorder buffers.

Is it still true that branch units and reorder entries are a scarce resource? Haswell has upgraded from one branch unit to two (ports 0,6) and it has a generous 192-entry reorder buffer.

So - summary - it seems like it would be worth reviewing the narrowing optimization and at least checking whether there is any complex code that could be retired as non-beneficial.

Relatedly: LuaJIT supports a diverse range of "number modes." I have experimented with these in the past, particularly trying to switch to ones that seem like they should help on Snabb's integer-heavy workloads, but I have never been able to observe a difference. So it seems possible that certain LuaJIT optimizations are only effective on older CPUs and can be removed in RaptorJIT. (I already removed non-canonical number modes in #5 but it could be worth revisiting whether I kept the right one i.e. the simplest of the fast options.)

Idea: Super Side Trace

This is an idea to solve #107 by optimizing side traces more efficiently.

The problem is that side traces only inherit snapshot values from their parents. The solution is for them to inherit the whole IR of their parent chain for the purposes of optimizations. Side traces would then reference their parent instructions in the same way that LOOP code references pre-loop instructions to eliminate duplicate work.

Proposed solution

The proposed solution is to replace the current optimization, which imports snapshot values via lj_snap_replay(), with a new one that constructs a "Super Side Trace" that is more amenable to optimization.

Here is how it works:

The Super Side Trace is created exactly like the example on #107. The instructions from each parent are replayed at the top of the trace and then the new instructions for this side trace are added at the end.
Standard optimizations are applied (DSE, CSE, etc) with the restriction that instructions in parent traces cannot be modified (they represent code that has already executed when the side trace starts.)
Machine code is emitted only for the new instructions.

The end result is a side-trace that is fully optimized with respect to its parents.

Summary

This looks like a straightforward and very high value optimization to me. It could substantially reduce the penalty for taking side traces and this could make performance much more predictable by reducing sensitivity to non-determinism at recording time.

On the other hand it seems too good to be true. What am I missing? Thoughts and criticism welcome.

Problem: Expensive heap-allocated boxes for 64-bit values (integers and pointers)

One of the biggest problems that we have in Snabb development is when 64-bit values (integers or FFI pointers) in local variables are allocated on the heap. This is a problem for two related reasons.

The first problem is that it is very difficult to predict whether a line of code like local x = pointers[i] will cause a heap allocation of a "box" for the pointer, or whether the allocation will be "sunk" so that the raw value is kept in a register. The answer can be complex: it may be that the same line of code is compiled several times in different traces, and that some of those traces will cause allocations while others will not. The answer can also be that the behaviour is non-deterministic and will vary from one execution of the program to the next depending on how JIT heuristics interact with the workload.

The second problem is that there is a wide performance gulf between the two cases. On the one hand a heap allocation is extremely expensive and requires invocation of the garbage collector. On the other hand a sunk allocation is extremely fast and requires basically no instructions at all.

These problems are related. The wide difference in performance means that it is critical to always sink allocations, but the unpredictability of the JIT heuristics makes this requirement extremely difficult for application developers to satisfy.

Some solution is needed. Either the allocation sinking must be made predictable so that all programmers can depend on it, or the box allocation needs to be made efficient enough that the difference is not a big deal. (Or both.)

Example

Here is a simple example program:

local ffi = require("ffi")
local pointers = ffi.new("void *[101]")

-- Loop that references pointers and compiles into multiple traces
for i = 1, 100 do
   local x = pointers[i]  -- Load a pointer into a local variable
   for i = 1, 100 do end  -- This loop will require a separate trace
end

The line local x = pointers[i] looks like it should be cheap. It's only loading a pointer into a local variable and then never even using it. However it is actually very expensive. I believe the issue is that the variable x is still in scope when we transition to a new trace for the inner loop and this causes the pointer to be heap-allocated as a boxed Lua object.

The expense is partly due to requiring calls to the C functions lj_gc_step_jit() and lj_mem_newgco().

This is not satisfactory.

Goal: Predictable performance for IR instructions

RaptorJIT should strive to make the performance of IR instructions predictable. You should be able to predict the machine code for an IR instruction by inspection of the opcode, result type, and operand types.

This way day-to-day optimization work can be done at the IR level without having to refer to the lower-level machine code.

This is analogous to Intel CPUs where developers of compilers and applications mostly work with high-level IR (x86-64) and can predict the low-level code (micro-op) performance without actually having to look at micro-op dumps. Predicting the exact performance of high-level instructions is straightforward using Agner Fog's instruction reference tables.

(Maybe we will have RaptorJIT instruction tables one day...)

Replace minilua with PUC Lua submodule

RaptorJIT should bootstrap using off-the-shelf PUC Lua instead of the stripped-down fork in src/host/minilua.c. This saves ~7800 lines of (obfuscated) C code.

PUC Lua would not be acceptable as a runtime dependency but for build-time it is preferable to a private fork.

PUC Lua will be referenced as a submodule and will in turn depend only on a basic C build environment (so RaptorJIT is still reproducible and does not e.g. depend on a previous binary version of itself.)

"make install" target not working

The make install target is not working. Workaround is to make and copy required files from src/.

sudo make install shows many missing files

This is on Ubuntu latest. make works fine, but sudo make install shows some glitches. Perhaps these are already known.

install: cannot stat 'luajit': No such file or directory
Makefile:120: recipe for target 'install' failed

So I hand-copied src/raptorjit to src/luajit and tried again. A whole lot more files are now not found:

install: cannot stat 'dump.lua': No such file or directory
install: cannot stat 'p.lua': No such file or directory
install: cannot stat 'v.lua': No such file or directory
install: cannot stat 'zone.lua': No such file or directory
install: cannot stat 'dis_x86.lua': No such file or directory
install: cannot stat 'dis_x64.lua': No such file or directory
install: cannot stat 'dis_arm.lua': No such file or directory
install: cannot stat 'dis_arm64.lua': No such file or directory
install: cannot stat 'dis_ppc.lua': No such file or directory
install: cannot stat 'dis_mips.lua': No such file or directory
install: cannot stat 'dis_mipsel.lua': No such file or directory
install: cannot stat 'dis_mips64.lua': No such file or directory
install: cannot stat 'dis_mips64el.lua': No such file or directory
Makefile:120: recipe for target 'install' failed

From the looks of it these files aren't really necessary for RaptorJIT. The executable itself
works fine.

"Unbranch"-ing branches

Something like

local a = {}
for i = 1, 2^10 do
	a[i] = i % 2 == 0 and 5 or 10
end

can produce the following IR:

---- TRACE 1 start x.lua:5
0012  MODVN    6   5   0  ; 2
0013  ISNEN    6   1      ; 0
0014  JMP      6 => 0017
0017  KSHORT   6  10
0018  TSETV    6   1   5
0019  FORL     2 => 0012
---- TRACE 1 IR
....        SNAP   #0   [ ---- ]
0001    int SLOAD  #3    CI
0002    int BAND   0001  +1  
....        SNAP   #1   [ ---- ---- ---- 0001 ---- ---- 0001 ]
0003 >  int NE     0002  +0  
....        SNAP   #2   [ ---- ---- ---- 0001 ---- ---- 0001 ]
0004 >  tab SLOAD  #2    T
0005    int FLOAD  0004  tab.asize
0006 >  int ABC    0005  0001
0007    p32 FLOAD  0004  tab.array
0008    p32 AREF   0007  0001
0009    tab FLOAD  0004  tab.meta
0010 >  tab EQ     0009  NULL
0011    num ASTORE 0008  +10 
0012  + int ADD    0001  +1  
....        SNAP   #3   [ ---- ---- ---- ]
0013 >  int LE     0012  +1024
....        SNAP   #4   [ ---- ---- ---- 0012 ---- ---- 0012 ]
0014 ------ LOOP ------------
0015    int BAND   0012  +1  
....        SNAP   #5   [ ---- ---- ---- 0012 ---- ---- 0012 ]
0016 >  int NE     0015  +0  
....        SNAP   #6   [ ---- ---- ---- 0012 ---- ---- 0012 ]
0017 >  int ABC    0005  0012
0018    p32 AREF   0007  0012
0019    num ASTORE 0018  +10 
0020  + int ADD    0012  +1  
....        SNAP   #7   [ ---- ---- ---- ]
0021 >  int LE     0020  +1024
0022    int PHI    0012  0020
---- TRACE 1 stop -> loop

---- TRACE 2 start 1/5 x.lua:6
0015  KSHORT   6   5
0016  JMP      7 => 0018
0018  TSETV    6   1   5
0019  JFORL    2   1
---- TRACE 2 IR
0001    int SLOAD  #3    PI
....        SNAP   #0   [ ---- ---- ---- 0001 ---- ---- 0001 ]
0002 >  tab SLOAD  #2    T
0003    int FLOAD  0002  tab.asize
0004 >  int ABC    0003  0001
0005    p32 FLOAD  0002  tab.array
0006    p32 AREF   0005  0001
0007    tab FLOAD  0002  tab.meta
0008 >  tab EQ     0007  NULL
0009    num ASTORE 0006  +5  
0010    int ADD    0001  +1  
....        SNAP   #1   [ ---- ---- ---- ]
0011 >  int LE     0010  +1024
0012    num CONV   0010  num.int
....        SNAP   #2   [ ---- ---- ---- 0012 ---- ---- 0012 ]
---- TRACE 2 stop -> 1

Obviously this is not optimal; the optimal output would probably be CMOVs, e.g.

mov rax, rbx
test i, 1
cmovnz rax, rcx
mov [a+i*8], rax

where rbx is 5 in binary64, and rcx is 10 in binary64.

Though there are multiple problems with this:

RaptorJIT has to somehow recognise this branch
It must somehow know, that the registers need not be in the xmm registers

For the first problem: maybe a static bytecode analysation?
The current bytecode is: (for it all)

0001    TNEW     0   0
0002    KSHORT   1   1
0003    KSHORT   2 1024
0004    KSHORT   3   1
0005    FORI     1 => 0014
0006 => MODVN    5   4   0  ; 2
0007    ISNEN    5   1      ; 0
0008    JMP      5 => 0011
0009    KSHORT   5   5
0010    JMP      6 => 0012
0011 => KSHORT   5  10
0012 => TSETV    5   0   4
0013    FORL     1 => 0006
0014 => RET0     0   1

Maybe this could become something like:

0001    TNEW     0   0
0002    KSHORT   1   1
0003    KSHORT   2 1024
0004    KSHORT   3   1
0005    FORI     1 => 0014
0006 => KSHORT   5   5
0007    KSHORT   6  10
0008    MODVN    7   4   0  ; 2
0009    ISNEN    7   1      ; 0
0010    CMOV     5   6
0011    TSETV    5   0   4
0012    FORL     1 => 0006
0013 => RET0     0   1

The reason I suggest this instead of having the JIT-compiler optimise it, is that I thought that this might be easier.

For the second problem, I think #72 would be sufficient.

`make install` fails

make install fails with:

==== Installing RaptorJIT 1.0.0-alpha1 to /usr/local ====
mkdir -p /usr/local/bin /usr/local/lib /usr/local/include/luajit-1.0 /usr/local/share/man/man1 /usr/local/lib/pkgconfig /usr/local/share/luajit-1.0.0-alpha1/jit /usr/local/share/lua/5.1 /usr/local/lib/lua/5.1
cd src && install -m 0755 luajit /usr/local/bin/luajit-1.0.0-alpha1
cd src && test -f libluajit.a && install -m 0644 libluajit.a /usr/local/lib/libluajit-5.1.a || :
rm -f /usr/local/lib/libluajit-5.1.so.1.0.0 /usr/local/lib/libluajit-5.1.so /usr/local/lib/libluajit-5.1.so.1
cd src && test -f libluajit.so && \
  install -m 0755 libluajit.so /usr/local/lib/libluajit-5.1.so.1.0.0 && \
  ldconfig -n /usr/local/lib && \
  ln -sf libluajit-5.1.so.1.0.0 /usr/local/lib/libluajit-5.1.so && \
  ln -sf libluajit-5.1.so.1.0.0 /usr/local/lib/libluajit-5.1.so.1 || :
cd etc && install -m 0644 luajit.1 /usr/local/share/man/man1
cd etc && sed -e "s|^prefix=.*|prefix=/usr/local|" -e "s|^multilib=.*|multilib=lib|" luajit.pc > luajit.pc.tmp && \
  install -m 0644 luajit.pc.tmp /usr/local/lib/pkgconfig/luajit.pc && \
  rm -f luajit.pc.tmp
cd src && install -m 0644 lua.h lualib.h lauxlib.h luaconf.h lua.hpp luajit.h /usr/local/include/luajit-1.0
cd src/jit && install -m 0644 bc.lua bcsave.lua dump.lua p.lua v.lua zone.lua dis_x86.lua dis_x64.lua dis_arm.lua dis_arm64.lua dis_ppc.lua dis_mips.lua dis_mipsel.lua dis_mips64.lua dis_mips64el.lua vmdef.lua /usr/local/share/luajit-1.0.0-alpha1/jit
install: kan ikke udføre stat() på 'dump.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'p.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'v.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'zone.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_x86.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_x64.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_arm.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_arm64.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_ppc.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_mips.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_mipsel.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_mips64.lua': Ingen sådan fil eller filkatalog
install: kan ikke udføre stat() på 'dis_mips64el.lua': Ingen sådan fil eller filkatalog

kan ikke kan ikke udføre stat() på 'dis_mips64el.lua': Ingen sådan fil eller filkatalog, means
can not perform stat() on 'dis_mips64el.lua': No such file or directory

I am using a fully updated Arch Linux.

New IR "unknown" type?

I think an IR type, that represents an unknown type, would be beneficial in many cases.
It could be valid for only a subset of all IR operations, e.g.:
NE
EQ
All the LOAD and STORE operations

This would be very useful in code where the type does not matter, e.g.:

for i = 1, #some_table do
  if some_table[i] == some_other_table then
    break
  end
end

This is currently compiled to the following IR with LuaJIT 2.1 alpha 3 (I have no idea how to use jit.dump with RaptorJIT), if it's filled with numbers:

....        SNAP   #0   [ ---- ]
0001 >  int SLOAD  #4    CRI
0002 >  int LE     0001  +2147483646
0003    int SLOAD  #3    CI
0004 >  tab SLOAD  #2    T
0005    int FLOAD  0004  tab.asize
0006 >  p32 ABC    0005  0001
0007    p32 FLOAD  0004  tab.array
0008    p32 AREF   0007  0003
0009 >  num ALOAD  0008
0010  + int ADD    0003  +1  
....        SNAP   #1   [ ---- ---- ---- ]
0011 >  int LE     0010  0001
....        SNAP   #2   [ ---- ---- ---- 0010 0001 ---- 0010 ]
0012 ------ LOOP ------------
0013    p32 AREF   0007  0010
0014 >  num ALOAD  0013
0015  + int ADD    0010  +1  
....        SNAP   #3   [ ---- ---- ---- ]
0016 >  int LE     0015  0001
0017    int PHI    0010  0015

The problem here, is that 0014 typechecks 0013, and asserts that it's a number.
This isn't such a big of a problem here, since the typecheck is inexpensive, but if the table contains a bunch of different types, the trace will abort, no?
What it should really compile the inner loop as is:

0013    p32 AREF   0007  0010
0014    unk ALOAD  0013
0015 >  unk NE     0014 <some_other_table>
0016  + int ADD    0010  +1
....        SNAP   #3   [ ---- ---- ---- ]
0017 >  int LE     0015  0001
0018    int PHI    0010  0015

Interface to external JIT analysis tools

One very important problem is the analysis of JIT decision making and performance. RaptorJIT will take a different approach to LuaJIT.

LuaJIT approach can be summarized thus:

Basic analysis tools that run inside the VM (e.g. jit.v and jit.dump.)
Synchronous analysis during execution (expensive - not for use in benchmarking or production.)
VM provides a moderately large introspection API (jit.util module.)

RaptorJIT approach in contrast:

No built-in analysis: external tools must be developed instead.
Asynchronous analysis: VM efficiently logs binary events for later inspection.
VM does not provide introspection API: external tools must understand the raw data.

The intended benefits are:

Simplify the VM by moving the complexity to external tools.
Support sophisticated tools such as LOOM and more.
Support "always on" tracing and profiling that does not need to be disabled in production.
Support post-hoc workflows e.g. inventing new analysis to apply to old data.

The implementation outline is:

User provides memory where the VM can write opaque trace data (e.g. file-backed shared memory, see VMProfile example in LuaJIT/LuaJIT#290.)
VM logs events very efficiently e.g. with memcpy() of internal C objects like GCtrace.
External tooling decodes the traces e.g. converts them to JSON or operates on them directly.

I am imagining a semi-automated trace decoder that compiles a dummy C object and parses the DWARF debug information for access to all defines, enums, typedefs, structs, etc. If this tool provided JSON/XML/etc conversion then it could be reused for multiple analysis tools e.g. LOOM and others. (Could alternatively write an FFI definition manually.)

RaptorJIT Slack channel

You are welcome to join the RaptorJIT slack chat over at raptorjit.slack.com. Invitations are available with no questions asked. Just send a mail to [email protected] listing the email addresses and/or domains that you want invited.

"Eventually consistent" performance

RaptorJIT should provide the invariant that the same program applied to the same workload will quickly converge on the same level of performance every time you run it. The short-term ("warm up") performance may depend on non-deterministic trace heuristics, but after that the performance level will quickly converge for all executions.

See LuaJIT/LuaJIT#218 for one problem that needs to be solved ("Side-trace russian roulette.")

IAF: What is the difference between a root trace and a side trace?

This is an Infrequently Answered Question:

What is the difference between a root trace and a side trace?

Anybody have a succinct answer?

Problem: Side traces re-doing work already done by parent

Problem Statement

Side traces do not take their parent traces into account when performing optimizations like Dead Code Elimination (DCE) and Common Subexpression Elimination (CSE). This causes side traces to duplicate work that has already been done by their parents e.g. checking guards and locating constant hashtable slots.

In a long chain of side traces the work can be duplicate in each trace on the chain. This can lead to pathological behavior such as executing loop-invariant instructions N times per iteration for a loop that follows N side traces (compared with zero times for the root trace with LOOP optimization.)

Example program

Here is a program that exercises a pathological case:

global = 0
local x
for i = 1, 1e6 do
  x = global           -- trace 1 (root)
  if i > 1e3 then
    x = global         -- trace 2
    if i > 1e4 then
      x = global       -- trace 3
      if i > 1e5 then
	x = global     -- trace 4
      end
    end
  end
end

The full JIT dump (from LuaJIT) is here: https://gist.github.com/lukego/88ea63e91e2bbae591d4f5f7e03b7858

Analysis

The outer loop is ultimately compiled into a chain of traces 1->2->3->4->loop. The inefficiency is that each one of these traces does a lookup on the global variable from scratch. This means that each trace executes a lot of instructions that are all the same as the previous trace.

Here is what the IR looks like for the entire chain of traces 1->2->3->4. This is made by pasting the IR traces together and then deleting the instructions that are never reached due to a taken exit.

---- TRACE 1 IR
....              SNAP   #0   [ ---- ]
0001 rbp      int SLOAD  #1    CI
0002 rbx      fun SLOAD  #0    R
0003 rdx      tab FLOAD  0002  func.env
0004          int FLOAD  0003  tab.hmask
0005       >  int EQ     0004  +63 
0006 rcx      p32 FLOAD  0003  tab.node
0007 rax   >  p32 HREFK  0006  "x"  @33
0008 xmm7  >  num HLOAD  0007
0009 xmm7   + num ADD    0008  +1  
0010          num HSTORE 0007  0009
....              SNAP   #1   [ ---- 0001 ---- ---- 0001 ]
---- TRACE 2 IR
0001 rbp      int SLOAD  #1    PI
....              SNAP   #0   [ ---- 0001 ---- ---- 0001 ]
0002 rbx      fun SLOAD  #0    R
0003 rbx      tab FLOAD  0002  func.env
0004          int FLOAD  0003  tab.hmask
0005       >  int EQ     0004  +63 
0006 rbx      p32 FLOAD  0003  tab.node
0007       >  p32 HREFK  0006  "x"  @33
0008 xmm7  >  num HLOAD  0007
0009 xmm7     num ADD    0008  +1  
0010          num HSTORE 0007  0009
....              SNAP   #1   [ ---- 0001 ---- ---- 0001 ]
---- TRACE 3 IR
0001 rbp      int SLOAD  #1    PI
....              SNAP   #0   [ ---- 0001 ---- ---- 0001 ]
0002 rbx      fun SLOAD  #0    R
0003 rbx      tab FLOAD  0002  func.env
0004          int FLOAD  0003  tab.hmask
0005       >  int EQ     0004  +63 
0006 rbx      p32 FLOAD  0003  tab.node
0007       >  p32 HREFK  0006  "x"  @33
0008 xmm7  >  num HLOAD  0007
0009 xmm7     num ADD    0008  +1  
0010          num HSTORE 0007  0009
---- TRACE 4 IR
0001 rbp      int SLOAD  #1    PI
....              SNAP   #0   [ ---- 0001 ---- ---- ---- ]
0002 rbx      fun SLOAD  #0    R
0003 rbx      tab FLOAD  0002  func.env
0004          int FLOAD  0003  tab.hmask
0005       >  int EQ     0004  +63 
0006 rbx      p32 FLOAD  0003  tab.node
0007       >  p32 HREFK  0006  "x"  @33
0008 xmm7  >  num HLOAD  0007
0009 xmm7     num ADD    0008  +1  
0010          num HSTORE 0007  0009
0011 rbp      int ADD    0001  +1  
....              SNAP   #1   [ ---- ]
0012       >  int LE     0011  +1000000
0013 xmm7     num CONV   0011  num.int
....              SNAP   #2   [ ---- 0013 ---- ---- 0013 ]
---- TRACE 4 stop -> 1

Note that:

Each trace executes quite a few instructions (mostly constant hashtable reference.)
The instructions are all dead code (redundant work already done by parent.)
This assembles to a correspondingly large amount of machine code (see gist.)

Conclusion

This seems like a significant inefficiency. Side traces should take their parents into account when performing optimizations like DSE and CSE.

Validating the HOTCOUNT table

Here is an experiment to test an idea from @fsfod: that the outliers we see in performance reports could be due to hash collisions in the HOTCOUNT table.

The JIT only starts tracing a bytecode when it becomes "hot." Hotness is not counted separately for each bytecode: the bytecode address is hashed into a bucket that is shared by all bytecodes with the same hash value. The standard hashtable size is only 64 entries.

It is possible that we see non-deterministic performance because bytecode addresses have entropy and on different runs we see different hash collisions. This could cause the JIT to make different decisions, for example whether to start a trace at the start of a function / on an outer loop / on an inner loop. (Ordinarily a trace would start on the innermost loop but hash collisions could disturb this.)

I ran the simplest experiment that I could think of to test this idea: I made five branches each with a different hashtable size (8, 16, 64, 256, 1024) and I ran a large benchmark suite (300 tests per benchmark per branch.)

Here are the results in a jitter plot:

How to make sense of this? I am not sure. There are definitely patterns there, for example fannkuch runs much faster 16-element table, and nbody runs much slower with 8- and 16- element tables. It is less clear whether the table size influences the distribution of results i.e. makes them more predictable or not. (Could it be that hashtable collisions do impact performance but that it is fairly deterministic on these simple benchmarks due to lack of entropy in the addresses being hashed?)

Have to think about this one some more. Meanwhile here is one more graph that I spun to try and see if the pattern of outliers has changed. It's an ECDF with the Y-axis on a log scale to emphasise the slow outliers.

Possible next step would be to make sure there is entropy in the bytecode addresses from one benchmark run to the next. Just so that the same branch is not always hashing the same instruction to the same bucket (in a way that would be unrealistic for real applications that load code in more complex ways and are also being incrementally modified during development.)

A little idea on naming

It took me tha-a-a-at long to realize why this project was originally called "RaptorJit". Now I get it, I immediately demand some component or tool named "Claw". Without this, the brand will not be complete.

ISTYPE and ISNUM bytecodes unused?

I don't see where the compiler emits the ISTYPE and ISNUM bytecodes. Are they used? Where?

What does LJ_TNUMX+2 mean?

I am confused by this line in lj_record.c:

if (rc == ~LJ_TNUMX+2)

What is the meaning of value ~LJ_TNUMX+2? How is that assigned?

My understanding is that LJ_TNUMX means integer (in LuaJIT DUALNUM mode) and LJ_TNUMX+1 means double-float but I don't see the definition of LJ_TNUMX+2.

Unique trace abort codes

Idea: Use a unique trace abort error code for each distinct point in the JIT source code where a trace can be aborted. This way the trace abort reason will unambiguously lead to the specific line of C code where the abort was made.

This is to avoid ambiguity e.g. when the abort reason is leaving loop in root trace and there are four different points in the source code that could have generated this error. This will help users to dive into the JIT source and understand why a JIT error occurred.

Could either extend lj_traceerr.h with more codes or (say) automatically log the function/file/line where the abort is invoked.

Decide how to call C from Lua

How should RaptorJIT call C code from Lua?

To answer this we need to understand what mechanisms are available, how LuaJIT uses each one, and whether the same strategy makes sense for RaptorJIT.

Here is my very rough current understanding that is probably at least 65% wrong:

FFI is the simplest and most concise. Cannot access VM data structures such as lua_State. Incurs a load/store barrier that applies to FFI memory access but not to Lua objects.
LJLIB_CF() is used in lib_*.c files and implements built-in libraries such as jit.*. Incurs a strong barrier because it is able to update the Lua VM state.
LUA_API is used in lj_api.c. The same as LJLIB_CF()? The same as the "classic" Lua API?
LJ_FASTCALL and LJ_FUNC. These are naked C functions? Called not from Lua but only from within the VM's C/asm/mcode? Using different calling conventions as an optimization when possible e.g. for functions that don't need many registers?

First step is to nail down the details properly. Hints welcome! cc @corsix @javierguerragiraldez.

Next step is to look for potential simplifications. Can RaptorJIT drop certain of these keywords e.g. if the calling conventions are not applicable on Linux/x86-64? Can RaptorJIT use the FFI more widely to reduce boilerplate (since FFI is always available in RaptorJIT but optional in LuaJIT)?

Replace LuaJIT the name with RaptorJIT everywhere

It's e.g. not replaced in the REPL.

Slack migration to common RaptorJIT/Studio/Snabb Slack

I have created a #raptorjit channel on the Snabb slack. I would like to use this as a common chat server for RaptorJIT, Studio, Snabb, and other related projects that want to participate. Each one has a private channel but a common membership. There are ~65 members today who all have overlapping interests in these projects to various extents.

I will invite everybody who is already on the RaptorJIT slack (#42) and then replace that. Just drop a line to [email protected] with any email addresses that you want to get invited, no uestions asked. If you have a related project then we can create a channel for that too.

Rewrite the VM bytecode interpreter in C?

Should we rewrite the VM bytecode interpreter from assembler to C? Let me make the case for consideration:

The assembler VM has a high maintenance cost and a low benefit.

High maintenance cost because the code is complex and the maintenance complexity scales poorly, perhaps O(N log N) when supporting N CPUs (i.e. maintaining a separate assembler VM for each architecture + keeping them perfectly synchronized.) The main reason that all non-essential RaptorJIT platforms were removed is not having maintainer resources to take care of all their VMs.

The benefit is low because RaptorJIT targets server applications that probably spend ~0.1% of their time running interpreted code after initialization and warm-up. The other 99.9% of the time is spent in JIT mcode, the runtime system, and FFI functions none of which benefit from the assembler VM.

So it seems reasonable to me to consider porting the bytecode interpreter to C if this will make the maintenance much simpler, and would change the economics of supporting multiple CPU architectures (maintain JIT backend only), and would only cost perhaps <0.1% in total real-world performance.

(Just now this issue is not urgent because we only support one architecture and maintaining one VM is okay. However, if we want to add support for more architectures like ARM64 then we will have to decide whether we will maintain the VMs as parallel assembler programs or write a common VM in C.)

Build with fixed C compiler

RaptorJIT should always be built with the same C compiler. Supporting multiple compilers takes very little work, but it provides even less benefit.

CLANG from LLVM 4.0 seems like a reasonable initial choice.

IAF: Where can a trace end? Why?

Here is an Infrequently Answered Question:

Where can a trace end? Why?

Can a trace end on any bytecode at all? Or are there restrictions about e.g. what is the final bytecode, what is the point of control flow, etc?

This seems relevant to understanding how much flexibility the JIT has in selecting traces. For example, if we encounter an uncompilable bytecode like an NYI or a blacklisted function, do we have the option to successfully end the current trace and then transition to the interpreter? Or do we have to abort because this is not a valid end point for a trace for some reason?

cc @javierguerragiraldez

Expand TValue to 128-bit?

Here is a big idea: How about we expand the basic tagged value type (TValue) from 64-bit to 128-bit?

Initially we could keep all the existing types the same, using only the first 64 bits, and then we could invent new uses for the extra space.

The motivating idea is to add a new KCDATA64 type that contains a 64-bit immutable FFI value. The first 64-bits contains the type information and the second 64-bits contains the value. This way heap allocation and garbage collection would never be needed for pointers or 64-bit integers, even if they are stored in functions or closures. (Specifically, the existing CNEWI IR instruction would return an immediate KCDATA64 value instead of a pointer to a heap allocated CDATA object.)

Bad? Crazy? Hard? Promising?

RaptorJIT manual?

RaptorJIT is diverging from LuaJIT in significant ways. There is a new profiling API (VMProfile #66 #77), a new audit logging capability (#63), and we have removed the old introspection APIs.

These differences need to be documented. I am imagining an initial manual that simply documents the differences from LuaJIT.

asciidoc a reasonable choice? (I haven't tried it but it looks more-or-less like markdown with more structure.)

Strange Lua errors with JIT enabled.

It's hard to reproduce as these errors go away when attempting to debug. It seems to always originate from using FFI but it may just be coincidental. I'm running Linux.

raptorjit / raptorjit Goto Github PK

raptorjit's Introduction

RaptorJIT compilation for users

Inspecting trace and profiler data interactively

RaptorJIT compilation for VM hackers

Build with nix

Build without nix

Run the benchmarks

Optimization resources

Quotes

raptorjit's People

Contributors

Stargazers

Watchers

Forkers

raptorjit's Issues

Problem

Proposed solution

Example code

IR

Analysis of side trace setup

Analysis of the side trace body

End

Proposed solution

Summary

Example

Problem Statement

Example program

Analysis

Conclusion

Recommend Projects

Recommend Topics

Recommend Org