slothy-optimizer / slothy Goto Github PK

View Code? Open in Web Editor NEW

132.0 6.0 10.0 8.97 MB

Assembly super-optimization via constraint solving

Home Page: https://slothy-optimizer.github.io/slothy/

License: Other

Python 8.18% Shell 0.58% Assembly 91.23% Dockerfile 0.02%

assembly-language constraint-programming cryptography-tools superoptimization post-quantum-cryptography

slothy's Introduction

SLOTHY: Assembly optimization via constraint solving

About SLOTHY

SLOTHY - Super (Lazy) Optimization of Tricky Handwritten assemblY - is an assembly-level superoptimizer for:

Instruction scheduling
Register allocation
Software pipelining (= periodic loop interleaving)

SLOTHY is generic in the target architecture and microarchitecture. This repository provides instantiations for:

Armv8.1-M+Helium: Cortex-M55, Cortex-M85
AArch64: Cortex-A55, and experimentally Cortex-A72, Cortex-X/Neoverse-V, Apple M1 (Firestorm, Icestorm)

SLOTHY is discussed in Fast and Clean: Auditable high-performance assembly via constraint solving.

Goal

SLOTHY enables a development workflow where developers write 'clean' assembly by hand, emphasizing the logic of the computation, while SLOTHY automates microarchitecture-specific micro-optimizations. This accelerates development, keeps manually written code artifacts maintainable, and allows to split efforts for formal verification into the separate verification of the clean code and the micro-optimizations.

How it works

SLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and builds a constraint model capturing valid instruction schedulings, register renamings, and periodic loop interleavings. The model is passed to an external constraint solver and, upon success, a satisfying assignment converted back into the final code. Currently, SLOTHY uses Google OR-Tools as its constraint solver backend.

Performance

As a rough rule of thumb, SLOTHY typically optimizes workloads of <50 instructions in seconds to minutes, workloads up to 150 instructions in minutes to hours, while for larger kernels some heuristics are necessary.

Applications

SLOTHY has been used to provide the fastest known implementations of various cryptographic and DSP primitives: For example, the SLOTHY paper discusses the NTTs underlying ML-KEM and ML-DSA for Cortex-{A55, A72, M55, M85}, the FFT for Cortex-{M55,M85}, and the X25519 scalar multiplication for Cortex-A55. You find the clean and optimized source code for those examples in paper/.

Getting started

Have a look at the SLOTHY tutorial for a hands-on and example-based introduction to SLOTHY.

Real world uses

Arm EndpointAI: SLOTHY-optimized code has been deployed to the CMSIS DSP Library for the radix-4 CFFT routines as part of the Arm EndpointAI project in this commit.
AWS libcrypto (AWS-LC): SLOTHY-optimized X25519 code based on our un-interleaved form of the original code by Emil Lenngren has been formally verified and included in s2n-bignum and merged into AWS-LC.

Installation

Requirements

SLOTHY has been successfully used on

Ubuntu-21.10 and up (64-bit),
macOS Monterey 12.6 and up.

SLOTHY requires Python >= 3.10. See requirements.txt for package requirements, and install via pip install -r requirements.txt.

Note: requirements.txt pins versions for reproducibility. If you already have newer versions of some dependencies installed and don't want them downgraded, consider using a virtual environment:

python3 -m venv venv
./venv/bin/python3 -m pip install -r requirements.txt

Then, enter the virtual environment via source venv/bin/activate prior to running SLOTHY.

Docker

A dockerfile for an Ubuntu-22.04 based Docker image with all dependencies of SLOTHY and the PQMX+PQAX test environments setup can be found in paper/artifact/slothy.dockerfile. See paper/artifact/README.md for instructions.

Quick check

To check that your setup is complete, try the following from the base directory:

% python3 example.py --examples aarch64_simple0_a55

You should see something like the following:

* Example: aarch64_simple0_a55...
INFO:aarch64_simple0_a55:Instructions in body: 20
INFO:aarch64_simple0_a55.slothy:Perform internal binary search for minimal number of stalls...
INFO:aarch64_simple0_a55.slothy:Attempt optimization with max 32 stalls...
INFO:aarch64_simple0_a55.slothy:Objective: minimize number of stalls
INFO:aarch64_simple0_a55.slothy:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:aarch64_simple0_a55.slothy:[0.0721s]: Found 1 solutions so far... objective 19.0, bound 8.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:[0.0765s]: Found 2 solutions so far... objective 18.0, bound 12.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:OPTIMAL, wall time: 0.155224 s
INFO:aarch64_simple0_a55.slothy:Booleans in result: 509
INFO:aarch64_simple0_a55.slothy.selfcheck:OK!
INFO:aarch64_simple0_a55.slothy:Minimum number of stalls: 18

Examples

The SLOTHY Tutorial and the examples directory contain numerous exemplary assembly snippets. To try them, use python3 example.py --examples={YOUR_EXAMPLE}. See python3 example.py --help for the list of all available examples.

The use of SLOTHY from the command line is illustrated in scripts/ supporting the real-world optimizations for the NTT, FFT and X25519 discussed in Fast and Clean: Auditable high-performance assembly via constraint solving.

slothy's People

Contributors

Stargazers

Watchers

Forkers

fabklein dop-amin mrcodechef ego aqjune-aws bftjoe brugarolas aqjune mkannwischer

slothy's Issues

Document address offset fixup and make it optional

Background: Address offset fixup is an important feature to facilitate software pipelining as it allows the reordering of ldr/str instructions with increment operating on the same address register. Normally, SLOTHY would not allow such reordering because it breaks data dependencies, but commutativity relations such as ldr _, [x0]; str _, [x0], #imm == str _, [x0], #imm; ldr _, [x0, #-imm] makes it possible, assuming a suitable 'address offset fixup' -- hence the name.

Issue: While important, address offset fixup is still somewhat experimental and poorly documented. What's more, SLOTHY's selfcheck is not aware of it and cannot catch bugs it may still have. To facilitate debugging, address offset fixup should be behind a configuration option to begin with.

Acceptance criteria:

SLOTHY supports a configuration option config.address_offset_fixup controlling the use of address offset fixup.
Document what determines whether instructions can participate in address offset fixup.

CMake error when building

I got a trouble on building SLOTHY.

I got following CMake error when I run cmake -S. -Bbuild -DBUILD_PYTHON:BOOL=ON

CMake Error at /usr/local/share/cmake-3.28/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find SWIG (missing: SWIG_EXECUTABLE SWIG_DIR)
Call Stack (most recent call first):
  /usr/local/share/cmake-3.28/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /usr/local/share/cmake-3.28/Modules/FindSWIG.cmake:153 (find_package_handle_standard_args)
  cmake/python.cmake:27 (find_package)
  CMakeLists.txt:333 (include)

I use cmake version 3.28.0-rc5 and GNU Make 4.2.1.
How can I solve this?

Can I preserve SIMD registers using `-c reserved_regs`?

I wonder whether it is possible to use -c reserved_regs=[..] to avoid assignment to v8-v15 registers.
Only low-half parts of these registers are mandatory to be preserved by the callee, but totally avoiding usage of v8-v15 is still fine in my case because my input programs use few SIMD registers.

Used script:

OUTPUTS="[hint_buffer0]"
RESERVED_REGS="[x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,sp,q8,q9,q10,q11,q12,q13,q14,q15,v8,v9,v10,v11,v12,v13,v14,v15]"

${SLOTHY_DIR}/slothy-cli Arm_AArch64 Arm_Neoverse_N1_experimental \
       a.s \
    -o a_opt.s \
    -c outputs=$OUTPUTS -c reserved_regs=$RESERVED_REGS \
    $REDIRECT_OUTPUT

Input:

ldr q27, [x1]                    // @slothy:reads=buffer0
umull v8.2D, v27.2S, v27.2S
mov x6, v8.d[0]
mov x12, v8.d[1]
stp x6, x12, [x1]                    // @slothy:writes=buffer0

The output program still contains usage of v8:

ldr q5, [x1]                   //(omitting the plots here)
umull v8.2D, v5.2S, v5.2S << v8 is used
...

I tweaked SLOTHY a bit to print all input logs to info, and I could see this diagnostic message which does exclude v8-v15:

INFO:slothy-cli.split.0_68:Registers available for renaming
INFO:slothy-cli.split.0_68:- GPR available: ['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14',
'x15', 'x16', 'x17']
INFO:slothy-cli.split.0_68:- NEON available: ['v0', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v16', 'v17', 'v18', 'v19', 'v20', 'v21', 'v22
', 'v23', 'v24', 'v25', 'v26', 'v27', 'v28', 'v29', 'v30', 'v31']        <----- This correctly excludes v8 ~ v15..!

... which might hint that it is somewhere in its downstream that omits the renaming constraint.

Problems with optimizing X25519

I was trying to put together an example showing how to use the splitting heuristic on the X25519.
I'm not trying to have optimal code (for that we of course have https://github.com/slothy-optimizer/slothy/blob/main/paper/scripts/slothy_x25519.sh), but just trying to have something that terminates in a reasonable time on a laptop and demonstrates how the splitting heuristic works.

I have the following code (to be placed in a Python script in the root dir):

import logging
import sys

from slothy import Slothy

import slothy.targets.aarch64.aarch64_neon as AArch64_Neon
import slothy.targets.aarch64.cortex_a55 as Target_CortexA55

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

arch = AArch64_Neon
target = Target_CortexA55

slothy = Slothy(arch, target)

# example
slothy.load_source_from_file("examples/naive/aarch64/X25519-AArch64-simple.s")

slothy.config.inputs_are_outputs=True
slothy.config.outputs=["x0"]
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32
slothy.config.split_heuristic = True
slothy.config.split_heuristic_stepsize = 0.1
slothy.config.split_heuristic_factor = 10
slothy.config.split_heuristic_repeat = 2
slothy.optimize(start="mainloop", end="end_label")
slothy.write_source_to_file("test.s")

Running this very quickly errors with

INFO:slothy:Instructions in body: 958
INFO:slothy.mainloop.split.0_96:Perform internal binary search for minimal number of stalls...
INFO:slothy.mainloop.split.0_96:Attempt optimization with max 32 stalls...
INFO:slothy.mainloop.split.0_96:Objective: minimize number of stalls
INFO:slothy.mainloop.split.0_96:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:slothy.mainloop.split.0_96:[4.5895s]: Found 1 solutions so far... objective 12.0, bound 6.0 (minimize number of stalls)
INFO:slothy.mainloop.split.0_96:OPTIMAL, wall time: 6.632225 s
INFO:slothy.mainloop.split.0_96:Booleans in result: 8991
INFO:slothy.mainloop.split.0_96.selfcheck:OK!
INFO:slothy.mainloop.split.0_96:Minimum number of stalls: 12
INFO:slothy.mainloop.split.96_192:Perform internal binary search for minimal number of stalls...
INFO:slothy.mainloop.split.96_192:Attempt optimization with max 32 stalls...
ERROR:slothy.mainloop.split.96_192:Ran out of registers trying to _statically_ assign architectural registers
                for input and outputs of the snippet. You should consider enabling register
                renaming for input and output, by setting config.rename_{inputs,outputs}.
                Alternatively, you may fix input and output registers by hand yourself.
Traceback (most recent call last):
  File "/home/mjk/git/slothy-tutorial/tst.py", line 34, in <module>
    slothy.optimize(start="mainloop", end="end_label")
  File "/home/mjk/git/slothy-tutorial/slothy/core/slothy.py", line 249, in optimize
    early, core, late, num_exceptional = Heuristics.periodic(body, logger, c)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 310, in periodic
    res = Heuristics.linear( body, logger=logger, conf=conf)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 387, in linear
    return Heuristics._split(body, logger, conf)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 782, in _split
    return Heuristics._split_inner(body, logger, c)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 723, in _split_inner
    cur_body, stalls, _ = optimize_chunks_many(idx_lst, cur_body, stalls,show_stalls=False)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 683, in optimize_chunks_many
    body, stalls, cur_stalls, local_perm = optimize_chunk(start_idx, end_idx, body,
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 659, in optimize_chunk
    result = Heuristics.optimize_binsearch(cur_body,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 132, in optimize_binsearch
    return Heuristics.optimize_binsearch_internal(source, logger, conf, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 234, in optimize_binsearch_internal
    success = core.optimize(source, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1040, in optimize
    self._load_source(source, prefix_len=prefix_len, suffix_len=suffix_len)
  File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1164, in _load_source
    self._restrict_input_output_renaming()
  File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1288, in _restrict_input_output_renaming
    raise e
  File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1281, in _restrict_input_output_renaming
    v.reg = get_fresh_renaming_reg(v.ty)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1273, in get_fresh_renaming_reg
    raise OutOfRegisters
slothy.core.core.SlothyBase._restrict_input_output_renaming.<locals>.OutOfRegisters

If I instead do two calls to optimize to first resolve the symbolic variable names (in the way the other script does it), it works fine:

import logging
import sys

from slothy import Slothy

import slothy.targets.aarch64.aarch64_neon as AArch64_Neon
import slothy.targets.aarch64.cortex_a55 as Target_CortexA55

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

arch = AArch64_Neon
target = Target_CortexA55

slothy = Slothy(arch, target)

# example
slothy.load_source_from_file("examples/naive/aarch64/X25519-AArch64-simple.s")

slothy.config.inputs_are_outputs=True
slothy.config.outputs=["x0"]
slothy.config.constraints.functional_only = True
slothy.config.constraints.allow_reordering = False
slothy.optimize(start="mainloop", end="end_label")


slothy.config.constraints.functional_only = False
slothy.config.constraints.allow_reordering = True
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32
slothy.config.split_heuristic = True
slothy.config.split_heuristic_stepsize = 0.1
slothy.config.split_heuristic_factor = 10
slothy.config.split_heuristic_repeat = 2
slothy.optimize(start="mainloop", end="end_label")
slothy.write_source_to_file("test.s")

I'm not sure if this is a bug, or if I misunderstand how SLOTHY handles symbolic register names.
@hanno-becker, any ideas?