Comments (25)
Benchmark results of rv8: (RV32 only, smaller is better)
- | interpreter | JIT | qemu-user | native-x86 |
---|---|---|---|---|
primes | 34.33 | 1.75 | 1.55 | 1.00 |
miniz | 40.19 | 1.95 | 2.68 | 1.00 |
SHA-512 | 76.23 | 3.40 | 3.78 | 1.00 |
AES | 88.02 | 2.57 | 4.08 | 1.00 |
qsort | 28.82 | 1.86 | 7.50 | 1.00 |
dhrystone | 128.97 | 2.72 | 12.04 | 1.00 |
from rv32emu.
Think about whether we want to load a 64-bit immediate under the present RISC-V specification. Despite memory access being substantially slower, it is simpler to just load a constant from memory since it requires less instructions.
In RISC-V
lui r1, 0x01234
lui r2, 0x89ABC
addi r1, 0x567
addi r2, 0xDEF
shl r1, 32
or r1, r2
24-bytes, 6 instruction cycles and 2 registers.
In x86_64:
movabs rax, 0x0123456789ABCDEF
11-bytes, 1 instruction cycle, 1 register.
from rv32emu.
it's not so apparent where the starting and ending position of a particular code block is . does this mean hard coding pseudo pattern(common code block pattern) for the binary to match on?
from rv32emu.
blink is a virtual machine for running x86-64-linux programs on different operating systems and hardware architectures. Recently, it implements a JIT. Quote from blink/jit.c:
This file implements an abstraction for assembling executable code at runtime. This is intended to be used in cases where it's desirable to have fast "threaded" pathways, between existing functions, which were compiled statically. We need this because virtual machine dispatching isn't very fast, when it's implemented by loops or indirect branches.
Modern CPUs go much faster if glue code without branches is outputted to memory at runtime, i.e. a small function that calls the functions.
from rv32emu.
wasm3 is a fast WebAssembly interpreter without JIT. In general, its strategy seems capable of executing code around 4-15x slower than compiled code on a modern x86 processor.
- Reduce bytecode decoding overhead
- Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
- wasm3 trades some space for time. Opcodes map to up to 3 different operations depending on the number of source operands and commutative-ness.
- Commonly occurring sequences of operations can also be optimized into a "fused" operation. This sometimes results in improved performance.
- In wasm3, the stack machine model is translated into a more direct and efficient "register file" approach.
- Tightly Chained Operations
Tails is a minimal, fast Forth-like interpreter core. It uses no assembly code, only C++, but an elegant tail-recursion technique inspired by Wasm3 makes it nearly as efficient as hand-written assembly.
from rv32emu.
Our strategy to develop JIT is utilizing Clang
to generate optimized target code. The JIT compiler design is shown in Figure 1.
When a block's usage frequency exceeds a predetermined threshold, we trace its taken and untaken branches and use a code generator to convert the instruction sequence into C code. We then compile the C code using Clang and store the resulting target machine code as a function in the code cache for future use.
An example instruction sequence that is a hot spot in Mandelbrot is shown in Figure 2, along with the corresponding EBB. The generated code for this EBB is shown as below.
10750: lw a5,128(sp)
10754: beq a5,s10,0x10760
10758: add s1,s1,a0
1075c: j 0x10728
10760: mv s8,a0
10764: sub s9,s1,s7
10768: bne s1,s7,0x10a0c
...
- Generated Code
insn_10750:
...
goto insn_10754;
insn_10754:
...
if (...)
goto insn_10760;
goto insn_10758;
...
from rv32emu.
The commit 36f304c implements the JIT strategy described above. The benchmark results, as shown in the statistics below, demonstrate that the JIT has a positive effect on benchmarks with long execution times. However, in the case of Mandelbrot, its short execution time means that the overhead of the JIT outweighs its benefits.
Test | rv32emu (interpreter) | rv32emu (JIT) | Speedup |
---|---|---|---|
CoreMark | 1155.174 (Iterations/Sec) | 1796.483 (Iterations/Sec) | +55.5% |
dhrystone | 1282 DMIPS | 2521 DMIPS | +96.64% |
nqueens | 7766.80 msec | 3893.81 msec | +99.47% |
mandelbrot | 32.28 msec | 77.34 msec | -139.59% |
from rv32emu.
We experiment the same JIT strategy based on different compiler clang and mir.
clang
Metric | Interpreter-Only | Just-in-time Compiler | Speedup |
---|---|---|---|
CoreMark | 1155.174 (Iterations/Sec) | 1796.483 (Iterations/Sec) | +55.5% |
Dhrystone | 1282 DMIPS | 2521 DMIPS | +96.64% |
mir
Metric | Interpreter-Only | Just-in-time Compiler | Speedup |
---|---|---|---|
CoreMark | 1155.174 (Iterations/Sec) | 2194.907 (Iterations/Sec) | +90% |
Dhrystone | 1282 DMIPS | 2522 DMIPS | +96.7% |
The issue with Clang is that we need to fork a Clang process, which results in a significant overhead. However, its ability to optimize code is strong. In contrast, launching mir has a relatively small overhead, but its ability to optimize code is relatively weak and we cannot determine the code size of the machine code compiled by mir. This limitation prevents us from using a code cache to manage machine code effectively.
from rv32emu.
The issue with Clang is that we need to fork a Clang process, which results in a significant overhead. However, its ability to optimize code is strong. In contrast, launching mir has a relatively small overhead, but its ability to optimize code is relatively weak and we cannot determine the code size of the machine code compiled by mir. This limitation prevents us from using a code cache to manage machine code effectively.
The preliminary baseline JIT compiler has been landed in wip/jit
branch. Meanwhile, it comes with some known issues, including Arm64 breakage. The design principle of baseline JIT is to stick to C2MIR approach where we can improve macro operation fusion and/or RISC-V oriented optimizations.
from rv32emu.
Ongoing
- As #142, the related experiment of using a dominator tree to detect loops has been preliminarily implemented on the branch detect_loop.
- Improving memory I/O.
- Fix the known issue that JIT does not work on macOS on Apple Silicon. JIT compilation works on GNU/Linux for both x86-64 and Arm64.
from rv32emu.
- As Utilize dominators for constructing extended basic blocks #142, the related experiment of using a dominator tree to detect loops has been preliminarily implemented on the branch detect_loop.
You shall describe the details in #142 rather than here. In #142, we care about the feasibility to improve block-based execution by introducing dominator tree.
from rv32emu.
The author of RVVM discussed the design choices where it's substantially different to QEMU.
Performance-wise:
- Instead of a static translate-and-run flow like in QEMU, RVVM has an interpret-trace-run execution loop which is remotely similar to JVM, and allows to collect some data like branch probabilities and hot loops, and optimize better
- Using a hardware host FPU instead of softfp emulation. This is like, 10x faster with some synthetic FPU benchmarks
- Conscious decisions for beneficial trade-offs, like fast-path JIT trace cache, JIT IR is more streamlined to "Big ISA Triad" (RISC-V, ARM64, x86-64), etc
Infrastructure-wise:
- A public library API for a lot of things: Machine management (Construct and run 'em in any program), device integration, registering new CPU instructions, userspace emulation
- Subjectively, a more lean and clean codebase, in places where it wasn't harmed by either 1) performance decisions to copy-paste or restructure things 2) complexity of related things like JIT backend arches
- Portability. RVVM officially runs in WASM, runs on Haiku, SerenityOS, KolibriOS, even DOS!
from rv32emu.
lightrec is a MIPS-to-everything dynamic re-compiler (aka JIT compiler or dynrec) for PlayStation emulators, using GNU Lightning as the code emitter. Features:
- High-level optimizations. The MIPS code is first pre-compiled into a form of Intermediate Representation (IR). Basically, just a single-linked list of structures representing the instructions. On that list, several optimization steps are performed: instructions are modified, reordered, tagged; new meta-instructions can also be added.
- Lazy compilation. If Lightrec detects a block of code that would be very hard to compile properly (e.g. a branch with a branch in its delay slot), the block is marked as not compilable, and will always be emulated with the built-in interpreter. This allows to keep the code emitter simple and easy to understand.
- Run-time profiling. The generated code will gather run-time information about the I/O access (whether they hit RAM, or hardware registers). The code generator will then use this information to generate direct read/writes to the emulated memories, instead of jumping to C for every call.
- Threaded compilation. When entering a loading zone, where a lot of code has to be compiled, we don't want the compilation process to slow down the pace of emulation. To avoid that, the code compiler optionally runs on a thread, and the main loop will emulate the blocks that have not been compiled yet with the interpreter. This helps to drastically reduce the stutter that typically happens when a lot of new code is run.
Check optimizer.c, blockcache.c, and TLSF for the implementation.
Test hardware: Desktop PC – Core i7 7700k, Windows 10
Game | Interpreter (No Dithering) | Interpreter (With Dithering) | Dynarec (No Dithering) | Dynarec (With Dithering) |
---|---|---|---|---|
Final Doom | 246fps | 245fps | 621fps | 616fps |
Resident Evil | 250fps | 248fps | 642fps | 639fps |
Tekken 3 | 190fps | 175fps | 279fps | 250fps |
from rv32emu.
The copyjit draws inspiration from the paper "Copy-and-Patch Compilation." However, what if patching could be entirely eliminated?
The core concept revolves around using the compiler to generate 'templates' that can be directly copied into place. This approach heavily relies on continuation passing, which means that all operations defined by the jit library must allow for continuation passing optimizations. In copy-and-patch, the templates are filled in at runtime with user-selected values. Unfortunately, this method relies on parsing ELF relocations, which necessitates porting the library to different platforms. While not a major issue, avoiding runtime patching of relocations could potentially enable the creation of a JIT library that is architecture agnostic and offers very low latencies.
bcgen generates a number of files in a directory called gen in the working directory. These generated files are included by bcode.c, which you can compile into an object file that provides an interface to compiling and running bytecode system.
See also: A Template-Based Code Generation Approach for MLIR
from rv32emu.
QEMU employs a two-step process for executing binaries, involving an intermediate representation known as tiny code. This tiny code is interpreted in two ways: first through emulation and second via compilation into native code using a JIT compiler, often leading to enhanced speed.
However, the use of JITs demands the allocation of executable memory to house the compiled code, which is not permitted in iOS. To circumvent this restriction, a technique is employed that involves reusing portions of code that are already in executable memory. This concept takes on various names such as code re-use, ROP (Return-Oriented Programming), and ret2code. It is formalized as "weird machines" due to the differing semantics between the original code and final execution.
This process involves the creation of code gadgets, such as ldr x0, [sp], #16; ret
(Aarch64), which execute an action and return to their caller upon completion of the first instruction. These gadgets are chained together, forming sequences that achieve specific objectives. The qemu-tcg-tcti project has developed a script that generates pre-compiled code snippets (gadgets) with different functionalities. These gadgets are compiled during the iOS app's build process and are mapped to executable memory, eliminating the need for runtime creation. The final step is the creation of a JIT that constructs a memory segment containing values for use by these gadgets before transitioning to the next one.
This inventive approach allows the creation of complete programs by reusing existing code, a technique historically employed for creating exploits. It provides a creative solution for implementing JIT compilers in architectures that disallow the allocation of executable memory. See commit 4de86e.
UTM already merges the above qemu-tcg-tcti effort (see patches directory):
UTM SE ("slow edition") uses a threaded interpreter which performs better than a traditional interpreter but still slower than JIT. This technique is similar to what iSH does for dynamic execution. As a result, UTM SE does not require jailbreaking or any JIT workarounds and can be sideloaded as a regular app.
- TCI (normal interpreter) boot this VM in 130 seconds
- TCTI (threaded interpreter) boots it in only 22 seconds
Reference:
from rv32emu.
pylbbv is a lazy basic block versioning + copy and patch JIT interpreter for CPython.
The copy-and-patch JIT compiler uses a stencil compiler.
- At runtime, each basic block, except the branches (exits) are compiled to machine code.
- If compilation is successful, execution jumps into the machine code rather than the interpreter bytecode.
- The branches remain as CPython interpreter bytecode, to faciliatate easy branching.
- Upon encountering a branch, the interpreter leaves the machine code to go back into bytecode.
- Execution thus interleaves between machine code and the interpreter.
from rv32emu.
luajit-remake transforms an LLVM function to make it suitable for compilation and back-parsing into a copy-and-patch stencil.
The transformation process involves the following steps:
The function is split into two parts: the fast path and the slow path. The identification of the slow path logic is done using BlockFrequencyInfo, and proper annotations are added to the LLVM IR to enable the identification and separation of the slow path during assembly generation.
- A simple heuristic is applied to modify the assembly code. This modification ensures that the function falls through to the next bytecode when it might be beneficial.
- The pass comprises two phases: one at the LLVM IR level (IR to IR transformation) and another at the ASM (.s file) level (ASM to ASM transformation).
It is important to note that the IR-level rewrite pass should be executed immediately before the LLVM module is compiled to assembly. Once this pass is applied, no further transformations to the LLVM IR are allowed.
from rv32emu.
Jonathan Müller has an excellent talk on A deep dive into dispatching techniques. He compared the manual jump table and the one generated by optimizing compiler.
- Manual jump table
movzx eax, byte ptr [rbx] ; rax := ip->op
jmp qword ptr [r13 + 8*rax] ; goto *execute_table[rax]
- Switch jump table
movzx eax, byte ptr [rbx] ; rax := ip->op
movsxd rax, dword ptr [r13 + 4*rax] ; rax := execute_table[rax]
add rax, r13 ; rax := rax + &execute_table
jmp rax ;goto
Compiler generates jump table with 4 byte relative offsets, not 8 byte absolute offsets, resulting faster execution on Intel Core i5-1145G7.
from rv32emu.
WebAssembly Micro Runtime (WAMR) is a lightweight standalone WebAssembly (Wasm) runtime with small footprint, high performance and highly configurable features for applications cross from embedded devices.
from rv32emu.
Possible lightweight JIT framework:
- TildeBackend (Tilde or TB for short)
- Cwerg
- dstogov/ir used by PHP
from rv32emu.
The core of the security concern lies in the inherent complexity of the system. Even extensively used and battle-tested tools like wasmtime have experienced severe vulnerabilities, such as the recent critical bug that could potentially lead to remote code execution (as seen in Guest-controlled out-of-bounds read/write on x86_64 · bytecodealliance/wasmtime).
The strategy employed here, assuming it progresses beyond the experimental phase, comprises three key elements to ensure robust security:
- Simplicity: A commitment to keeping all components as straightforward as possible. Simplicity enhances security by reducing the attack surface.
- Thorough Testing: The application of exhaustive and comprehensive testing procedures across all aspects of the system, leaving no stone unturned.
- Rigorous Sandboxing: Implementing stringent sandboxing mechanisms akin to a highly secure facility, to ensure that even if an attacker were to break out of the virtual machine and achieve remote execution, their capabilities would be severely restricted, potentially limited to consuming the host's CPU resources at most.
from rv32emu.
- dstogov/ir used by PHP
An experimental JIT for PHP, built upon dstogov/ir project, has been developed and can be found in the master branch of the php-src repository.
By following the provided build instructions, we can build a development version of PHP, which will display the following:
$ sapi/cli/php --version
PHP 8.4.0-dev (cli) (built: Dec 1 2023 01:59:26) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.0-dev, Copyright (c) Zend Technologies
Check if opcache s loaded
$ sapi/cli/php -v | grep -i opcache
- Disable opcache.JIT
$ sapi/cli/php -d opcache.jit=off Zend/bench.php
..
Total 0.310
(unit: second)
- Enable opcache.JIT
$ sapi/cli/php -d opcache.jit=tracing Zend/bench.php
...
Total 0.089
from rv32emu.
Whose baseline compiler is it anyway? by Ben L. Titzer
We show the design of a new single-pass compiler for a research Wasm engine that integrates with an in-place
interpreter and host garbage collector using value tags, while also supporting flexible instrumentation. In experiments, we measure the effectiveness of optimizations targeting value tags and find, somewhat surprisingly, that the runtime overhead can be reduced to near zero. We also assess the relative compile speed and execution time of six baseline compilers and place these baseline compilers in a two-dimensional tradeoff space with other execution tiers for Wasm.
from rv32emu.
The concept of delay slot in MIPS was initially a straightforward solution to manage pipeline hazards in five-stage pipelines. However, it became a challenge for processors with longer pipelines and the ability to issue multiple instructions per clock cycle. From a software perspective, delay slot has drawbacks, making programs harder to read and often less efficient due to frequently inserting nop
(no operation) instructions in the delay slot.
Historically, in the 1980s, the idea of branch delay slot made sense for pipelines consisting of 5 or 6 stages, as it helped to mitigate the one-cycle branch penalty inherent in these systems. But with the evolution of processor architectures, this approach has become outdated. For instance, in modern Pentium microarchitectures, the branch penalty can range from 15 to 25 cycles, rendering a single instruction delay slot ineffective. Implementing a delay slot that could accommodate a 15-instruction delay would be impractical and would disrupt the compatibility of instruction sets.
Advancements in technology have introduced more efficient solutions. Branch prediction, now a mature technology, has proven to be more efficient. The rate of misprediction with current branch predictors is significantly lower than the occurrence of branches with a nop
delay slot. This holds true even in systems with a relatively short 6-cycle delay, like the Nios II architecture.
Given these considerations, both in terms of hardware and software efficiency, delay slots are less advantageous. Therefore, modern architectures like RISC-V have chosen to omit the delay slot feature, aligning with current technological capabilities and requirements.
The lightrec, a MIPS recompiler that employs GNU Lightning for code emission, must handle the delay slot characteristic of MIPS. This feature, however, is not present in RISC-V and other more recent RISC designs, which typically exclude the delay slot. This omission reflects a broader trend in newer RISC architectures to move away from this once-common design element.
from rv32emu.
rv64_emulator is a RISC-V ISA emulation suite which contains a full system emulator and an ELF instruction frequency analyzer, with JIT compiler for Arm64.
from rv32emu.
Related Issues (20)
- CI: Skip pipelines if no source files are changed
- CI: Generate Docker image with Alpine Linux HOT 7
- CI: Generate Docker image with commit SHA1 HOT 2
- Compare with libriscv HOT 1
- Fix potential security issue in benchmark script
- jit: code generation tool should be aware of comments
- jit: Incomplete `muldivmod`
- Evaluate a hybrid approach for macro-operation fusion in both the decoding and execution stages
- Consolidate public header by integrating src/state.h HOT 1
- jit: Several errors are detected when running CoreMark
- Refactoring RISC-V emulation APIs for easier adoption and porting HOT 30
- jit: Enhance security by Control-Flow Integrity
- Comparing Interpreter performance with Wasm3
- Incorrect global/static float array initialization HOT 2
- Fail to build when ENABLE_EXT_C=0 ENABLE_JIT=1 HOT 1
- Accelerate ISA simulation by tiered JIT compilation HOT 13
- Assertion fail when running jit-bf with ENABLE_JIT=1
- jit: Implement register allocation for T1C HOT 2
- Provide minimal cross-platform GUI library HOT 2
- Incorrect basename generated by dynamic profiler
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rv32emu.