rv8 demonstrates how RISC-V instruc

The commit <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="htt

Ongoing As <a class="issue-link js-issue-link" data-error-te

Use portable JIT compilation for accelerating RISC-V emulation,about sysprog21/rv32emu

Comments (25)

jserv commented on May 20, 2024 2

Benchmark results of rv8: (RV32 only, smaller is better)

-	interpreter	JIT	qemu-user	native-x86
primes	34.33	1.75	1.55	1.00
miniz	40.19	1.95	2.68	1.00
SHA-512	76.23	3.40	3.78	1.00
AES	88.02	2.57	4.08	1.00
qsort	28.82	1.86	7.50	1.00
dhrystone	128.97	2.72	12.04	1.00

from rv32emu.

jserv commented on May 20, 2024

Think about whether we want to load a 64-bit immediate under the present RISC-V specification. Despite memory access being substantially slower, it is simpler to just load a constant from memory since it requires less instructions.

In RISC-V

    lui r1, 0x01234
    lui r2, 0x89ABC
    addi r1, 0x567
    addi r2, 0xDEF
    shl r1, 32
    or r1, r2

24-bytes, 6 instruction cycles and 2 registers.

In x86_64:

    movabs rax, 0x0123456789ABCDEF

11-bytes, 1 instruction cycle, 1 register.

from rv32emu.

Ma-Y commented on May 20, 2024

it's not so apparent where the starting and ending position of a particular code block is . does this mean hard coding pseudo pattern(common code block pattern) for the binary to match on?

from rv32emu.

jserv commented on May 20, 2024

blink is a virtual machine for running x86-64-linux programs on different operating systems and hardware architectures. Recently, it implements a JIT. Quote from blink/jit.c:

This file implements an abstraction for assembling executable code at runtime. This is intended to be used in cases where it's desirable to have fast "threaded" pathways, between existing functions, which were compiled statically. We need this because virtual machine dispatching isn't very fast, when it's implemented by loops or indirect branches.
Modern CPUs go much faster if glue code without branches is outputted to memory at runtime, i.e. a small function that calls the functions.

from rv32emu.

jserv commented on May 20, 2024

wasm3 is a fast WebAssembly interpreter without JIT. In general, its strategy seems capable of executing code around 4-15x slower than compiled code on a modern x86 processor.

Reduce bytecode decoding overhead
- Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
- wasm3 trades some space for time. Opcodes map to up to 3 different operations depending on the number of source operands and commutative-ness.
- Commonly occurring sequences of operations can also be optimized into a "fused" operation. This sometimes results in improved performance.
- In wasm3, the stack machine model is translated into a more direct and efficient "register file" approach.
Tightly Chained Operations

Tails is a minimal, fast Forth-like interpreter core. It uses no assembly code, only C++, but an elegant tail-recursion technique inspired by Wasm3 makes it nearly as efficient as hand-written assembly.

from rv32emu.

qwe661234 commented on May 20, 2024

Our strategy to develop JIT is utilizing Clang to generate optimized target code. The JIT compiler design is shown in Figure 1.
When a block's usage frequency exceeds a predetermined threshold, we trace its taken and untaken branches and use a code generator to convert the instruction sequence into C code. We then compile the C code using Clang and store the resulting target machine code as a function in the code cache for future use.

An example instruction sequence that is a hot spot in Mandelbrot is shown in Figure 2, along with the corresponding EBB. The generated code for this EBB is shown as below.

Figure1
Figure2
Instruction Sequence in Mandelbrot

10750: lw  a5,128(sp)
10754: beq a5,s10,0x10760
10758: add s1,s1,a0
1075c: j   0x10728
10760: mv  s8,a0
10764: sub s9,s1,s7
10768: bne s1,s7,0x10a0c
...

Generated Code

insn_10750:
  ...
  goto insn_10754;
insn_10754:
  ...
  if (...)
    goto insn_10760;
  goto insn_10758;
...

from rv32emu.

qwe661234 commented on May 20, 2024

The commit 36f304c implements the JIT strategy described above. The benchmark results, as shown in the statistics below, demonstrate that the JIT has a positive effect on benchmarks with long execution times. However, in the case of Mandelbrot, its short execution time means that the overhead of the JIT outweighs its benefits.

Test	rv32emu (interpreter)	rv32emu (JIT)	Speedup
CoreMark	1155.174 (Iterations/Sec)	1796.483 (Iterations/Sec)	+55.5％
dhrystone	1282 DMIPS	2521 DMIPS	+96.64%
nqueens	7766.80 msec	3893.81 msec	+99.47%
mandelbrot	32.28 msec	77.34 msec	-139.59%

from rv32emu.

qwe661234 commented on May 20, 2024

We experiment the same JIT strategy based on different compiler clang and mir.

clang

Metric	Interpreter-Only	Just-in-time Compiler	Speedup
CoreMark	1155.174 (Iterations/Sec)	1796.483 (Iterations/Sec)	+55.5％
Dhrystone	1282 DMIPS	2521 DMIPS	+96.64%

mir

Metric	Interpreter-Only	Just-in-time Compiler	Speedup
CoreMark	1155.174 (Iterations/Sec)	2194.907 (Iterations/Sec)	+90％
Dhrystone	1282 DMIPS	2522 DMIPS	+96.7%

The issue with Clang is that we need to fork a Clang process, which results in a significant overhead. However, its ability to optimize code is strong. In contrast, launching mir has a relatively small overhead, but its ability to optimize code is relatively weak and we cannot determine the code size of the machine code compiled by mir. This limitation prevents us from using a code cache to manage machine code effectively.

from rv32emu.

jserv commented on May 20, 2024

The issue with Clang is that we need to fork a Clang process, which results in a significant overhead. However, its ability to optimize code is strong. In contrast, launching mir has a relatively small overhead, but its ability to optimize code is relatively weak and we cannot determine the code size of the machine code compiled by mir. This limitation prevents us from using a code cache to manage machine code effectively.

The preliminary baseline JIT compiler has been landed in wip/jit branch. Meanwhile, it comes with some known issues, including Arm64 breakage. The design principle of baseline JIT is to stick to C2MIR approach where we can improve macro operation fusion and/or RISC-V oriented optimizations.

from rv32emu.

qwe661234 commented on May 20, 2024

Ongoing

As #142, the related experiment of using a dominator tree to detect loops has been preliminarily implemented on the branch detect_loop.
Improving memory I/O.
Fix the known issue that JIT does not work on macOS on Apple Silicon. JIT compilation works on GNU/Linux for both x86-64 and Arm64.

from rv32emu.

jserv commented on May 20, 2024

As Utilize dominators for constructing extended basic blocks #142, the related experiment of using a dominator tree to detect loops has been preliminarily implemented on the branch detect_loop.

You shall describe the details in #142 rather than here. In #142, we care about the feasibility to improve block-based execution by introducing dominator tree.

from rv32emu.

jserv commented on May 20, 2024

The author of RVVM discussed the design choices where it's substantially different to QEMU.

Performance-wise:

Instead of a static translate-and-run flow like in QEMU, RVVM has an interpret-trace-run execution loop which is remotely similar to JVM, and allows to collect some data like branch probabilities and hot loops, and optimize better

Using a hardware host FPU instead of softfp emulation. This is like, 10x faster with some synthetic FPU benchmarks

Conscious decisions for beneficial trade-offs, like fast-path JIT trace cache, JIT IR is more streamlined to "Big ISA Triad" (RISC-V, ARM64, x86-64), etc

Infrastructure-wise:

A public library API for a lot of things: Machine management (Construct and run 'em in any program), device integration, registering new CPU instructions, userspace emulation

Subjectively, a more lean and clean codebase, in places where it wasn't harmed by either 1) performance decisions to copy-paste or restructure things 2) complexity of related things like JIT backend arches

Portability. RVVM officially runs in WASM, runs on Haiku, SerenityOS, KolibriOS, even DOS!

from rv32emu.

jserv commented on May 20, 2024

lightrec is a MIPS-to-everything dynamic re-compiler (aka JIT compiler or dynrec) for PlayStation emulators, using GNU Lightning as the code emitter. Features:

High-level optimizations. The MIPS code is first pre-compiled into a form of Intermediate Representation (IR). Basically, just a single-linked list of structures representing the instructions. On that list, several optimization steps are performed: instructions are modified, reordered, tagged; new meta-instructions can also be added.
Lazy compilation. If Lightrec detects a block of code that would be very hard to compile properly (e.g. a branch with a branch in its delay slot), the block is marked as not compilable, and will always be emulated with the built-in interpreter. This allows to keep the code emitter simple and easy to understand.
Run-time profiling. The generated code will gather run-time information about the I/O access (whether they hit RAM, or hardware registers). The code generator will then use this information to generate direct read/writes to the emulated memories, instead of jumping to C for every call.
Threaded compilation. When entering a loading zone, where a lot of code has to be compiled, we don't want the compilation process to slow down the pace of emulation. To avoid that, the code compiler optionally runs on a thread, and the main loop will emulate the blocks that have not been compiled yet with the interpreter. This helps to drastically reduce the stutter that typically happens when a lot of new code is run.

Check optimizer.c, blockcache.c, and TLSF for the implementation.

Test hardware: Desktop PC – Core i7 7700k, Windows 10

Game	Interpreter (No Dithering)	Interpreter (With Dithering)	Dynarec (No Dithering)	Dynarec (With Dithering)
Final Doom	246fps	245fps	621fps	616fps
Resident Evil	250fps	248fps	642fps	639fps
Tekken 3	190fps	175fps	279fps	250fps

from rv32emu.

jserv commented on May 20, 2024

The copyjit draws inspiration from the paper "Copy-and-Patch Compilation." However, what if patching could be entirely eliminated?

The core concept revolves around using the compiler to generate 'templates' that can be directly copied into place. This approach heavily relies on continuation passing, which means that all operations defined by the jit library must allow for continuation passing optimizations. In copy-and-patch, the templates are filled in at runtime with user-selected values. Unfortunately, this method relies on parsing ELF relocations, which necessitates porting the library to different platforms. While not a major issue, avoiding runtime patching of relocations could potentially enable the creation of a JIT library that is architecture agnostic and offers very low latencies.

bcgen generates a number of files in a directory called gen in the working directory. These generated files are included by bcode.c, which you can compile into an object file that provides an interface to compiling and running bytecode system.

from rv32emu.

jserv commented on May 20, 2024

QEMU employs a two-step process for executing binaries, involving an intermediate representation known as tiny code. This tiny code is interpreted in two ways: first through emulation and second via compilation into native code using a JIT compiler, often leading to enhanced speed.

However, the use of JITs demands the allocation of executable memory to house the compiled code, which is not permitted in iOS. To circumvent this restriction, a technique is employed that involves reusing portions of code that are already in executable memory. This concept takes on various names such as code re-use, ROP (Return-Oriented Programming), and ret2code. It is formalized as "weird machines" due to the differing semantics between the original code and final execution.

This process involves the creation of code gadgets, such as ldr x0, [sp], #16; ret (Aarch64), which execute an action and return to their caller upon completion of the first instruction. These gadgets are chained together, forming sequences that achieve specific objectives. The qemu-tcg-tcti project has developed a script that generates pre-compiled code snippets (gadgets) with different functionalities. These gadgets are compiled during the iOS app's build process and are mapped to executable memory, eliminating the need for runtime creation. The final step is the creation of a JIT that constructs a memory segment containing values for use by these gadgets before transitioning to the next one.

This inventive approach allows the creation of complete programs by reusing existing code, a technique historically employed for creating exploits. It provides a creative solution for implementing JIT compilers in architectures that disallow the allocation of executable memory. See commit 4de86e.

UTM already merges the above qemu-tcg-tcti effort (see patches directory):

UTM SE ("slow edition") uses a threaded interpreter which performs better than a traditional interpreter but still slower than JIT. This technique is similar to what iSH does for dynamic execution. As a result, UTM SE does not require jailbreaking or any JIT workarounds and can be sideloaded as a regular app.

Report on Apr 1 2021:

TCI (normal interpreter) boot this VM in 130 seconds
TCTI (threaded interpreter) boots it in only 22 seconds

Reference:

from rv32emu.

jserv commented on May 20, 2024

pylbbv is a lazy basic block versioning + copy and patch JIT interpreter for CPython.

The copy-and-patch JIT compiler uses a stencil compiler.

At runtime, each basic block, except the branches (exits) are compiled to machine code.
If compilation is successful, execution jumps into the machine code rather than the interpreter bytecode.
The branches remain as CPython interpreter bytecode, to faciliatate easy branching.
Upon encountering a branch, the interpreter leaves the machine code to go back into bytecode.
Execution thus interleaves between machine code and the interpreter.

from rv32emu.

jserv commented on May 20, 2024

luajit-remake transforms an LLVM function to make it suitable for compilation and back-parsing into a copy-and-patch stencil.

The transformation process involves the following steps:

The function is split into two parts: the fast path and the slow path. The identification of the slow path logic is done using BlockFrequencyInfo, and proper annotations are added to the LLVM IR to enable the identification and separation of the slow path during assembly generation.

A simple heuristic is applied to modify the assembly code. This modification ensures that the function falls through to the next bytecode when it might be beneficial.
The pass comprises two phases: one at the LLVM IR level (IR to IR transformation) and another at the ASM (.s file) level (ASM to ASM transformation).

It is important to note that the IR-level rewrite pass should be executed immediately before the LLVM module is compiled to assembly. Once this pass is applied, no further transformations to the LLVM IR are allowed.

from rv32emu.

jserv commented on May 20, 2024

Jonathan Müller has an excellent talk on A deep dive into dispatching techniques. He compared the manual jump table and the one generated by optimizing compiler.

Manual jump table

movzx eax, byte ptr [rbx] ; rax := ip->op
jmp qword ptr [r13 + 8*rax] ; goto *execute_table[rax]

Switch jump table

movzx eax, byte ptr [rbx] ; rax := ip->op
movsxd rax, dword ptr [r13 + 4*rax] ; rax := execute_table[rax]
add rax, r13 ; rax := rax + &execute_table
jmp rax ;goto

Compiler generates jump table with 4 byte relative offsets, not 8 byte absolute offsets, resulting faster execution on Intel Core i5-1145G7.

from rv32emu.

jserv commented on May 20, 2024

WebAssembly Micro Runtime (WAMR) is a lightweight standalone WebAssembly (Wasm) runtime with small footprint, high performance and highly configurable features for applications cross from embedded devices.

from rv32emu.

jserv commented on May 20, 2024

Possible lightweight JIT framework:

TildeBackend (Tilde or TB for short)
Cwerg
dstogov/ir used by PHP

from rv32emu.

jserv commented on May 20, 2024

The core of the security concern lies in the inherent complexity of the system. Even extensively used and battle-tested tools like wasmtime have experienced severe vulnerabilities, such as the recent critical bug that could potentially lead to remote code execution (as seen in Guest-controlled out-of-bounds read/write on x86_64 · bytecodealliance/wasmtime).

The strategy employed here, assuming it progresses beyond the experimental phase, comprises three key elements to ensure robust security:

Simplicity: A commitment to keeping all components as straightforward as possible. Simplicity enhances security by reducing the attack surface.
Thorough Testing: The application of exhaustive and comprehensive testing procedures across all aspects of the system, leaving no stone unturned.
Rigorous Sandboxing: Implementing stringent sandboxing mechanisms akin to a highly secure facility, to ensure that even if an attacker were to break out of the virtual machine and achieve remote execution, their capabilities would be severely restricted, potentially limited to consuming the host's CPU resources at most.

from rv32emu.

jserv commented on May 20, 2024

dstogov/ir used by PHP

An experimental JIT for PHP, built upon dstogov/ir project, has been developed and can be found in the master branch of the php-src repository.

By following the provided build instructions, we can build a development version of PHP, which will display the following:

$ sapi/cli/php --version
PHP 8.4.0-dev (cli) (built: Dec  1 2023 01:59:26) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.0-dev, Copyright (c) Zend Technologies

Check if opcache s loaded

$ sapi/cli/php -v | grep -i opcache

Disable opcache.JIT

$ sapi/cli/php -d opcache.jit=off Zend/bench.php
..
Total              0.310

(unit: second)

Enable opcache.JIT

$ sapi/cli/php -d opcache.jit=tracing Zend/bench.php
...
Total              0.089

from rv32emu.

jserv commented on May 20, 2024

Whose baseline compiler is it anyway? by Ben L. Titzer

We show the design of a new single-pass compiler for a research Wasm engine that integrates with an in-place
interpreter and host garbage collector using value tags, while also supporting flexible instrumentation. In experiments, we measure the effectiveness of optimizations targeting value tags and find, somewhat surprisingly, that the runtime overhead can be reduced to near zero. We also assess the relative compile speed and execution time of six baseline compilers and place these baseline compilers in a two-dimensional tradeoff space with other execution tiers for Wasm.

from rv32emu.

jserv commented on May 20, 2024

The concept of delay slot in MIPS was initially a straightforward solution to manage pipeline hazards in five-stage pipelines. However, it became a challenge for processors with longer pipelines and the ability to issue multiple instructions per clock cycle. From a software perspective, delay slot has drawbacks, making programs harder to read and often less efficient due to frequently inserting nop (no operation) instructions in the delay slot.

Historically, in the 1980s, the idea of branch delay slot made sense for pipelines consisting of 5 or 6 stages, as it helped to mitigate the one-cycle branch penalty inherent in these systems. But with the evolution of processor architectures, this approach has become outdated. For instance, in modern Pentium microarchitectures, the branch penalty can range from 15 to 25 cycles, rendering a single instruction delay slot ineffective. Implementing a delay slot that could accommodate a 15-instruction delay would be impractical and would disrupt the compatibility of instruction sets.

Advancements in technology have introduced more efficient solutions. Branch prediction, now a mature technology, has proven to be more efficient. The rate of misprediction with current branch predictors is significantly lower than the occurrence of branches with a nop delay slot. This holds true even in systems with a relatively short 6-cycle delay, like the Nios II architecture.

Given these considerations, both in terms of hardware and software efficiency, delay slots are less advantageous. Therefore, modern architectures like RISC-V have chosen to omit the delay slot feature, aligning with current technological capabilities and requirements.

The lightrec, a MIPS recompiler that employs GNU Lightning for code emission, must handle the delay slot characteristic of MIPS. This feature, however, is not present in RISC-V and other more recent RISC designs, which typically exclude the delay slot. This omission reflects a broader trend in newer RISC architectures to move away from this once-common design element.

from rv32emu.

jserv commented on May 20, 2024

rv64_emulator is a RISC-V ISA emulation suite which contains a full system emulator and an ELF instruction frequency analyzer, with JIT compiler for Arm64.

from rv32emu.

Use portable JIT compilation for accelerating RISC-V emulation about rv32emu HOT 25 OPEN

Comments (25)

clang

mir

Ongoing

Test hardware: Desktop PC – Core i7 7700k, Windows 10

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent