trailofbits / maat Goto Github PK

View Code? Open in Web Editor NEW

606.0 606.0 44.0 7.44 MB

Open-source symbolic execution framework: https://maat.re

License: Other

C++ 97.77% C 0.41% CMake 1.25% Dockerfile 0.14% Python 0.36% Solidity 0.08%

maat's People

Contributors

Stargazers

Watchers

maat's Issues

Unable to build Docker container.

Maat version:

master branch at commit 485b2c6

Issue description:
Attempting to build the Docker container yields the following error:

Step 9/9 : RUN cmake -S . -B /tmp/maat/build -DCMAKE_BUILD_TYPE=RelWithDebInfo "-DCMAKE_INSTALL_PREFIX=$(python3 -m site --user-base)"       -Dmaat_USE_EXTERNAL_SLEIGH=OFF     &&   cmake --build /tmp/maat/build -j $(nproc) &&   cmake --install /tmp/maat/build
 ---> Running in b5eb127d2a17
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found GMP: /usr/lib/x86_64-linux-gnu/libgmp.so  
CMake Error at CMakeLists.txt:108 (add_subdirectory):
  The source directory

    /src/maat/src/third-party/sleigh/sleigh-cmake

  does not contain a CMakeLists.txt file.


-- Found Z3: /usr/lib/x86_64-linux-gnu/libz3.so  
-- Found LIEF: /usr/local/lib/libLIEF.a (found version "0.11.5") 
CMake Error at CMakeLists.txt:150 (sleigh_compile):
  Unknown CMake command "sleigh_compile".
Call Stack (most recent call first):
  CMakeLists.txt:159 (maat_sleigh_compile)


-- Configuring incomplete, errors occurred!

The issue seems to be a missing CMakeLists.txt at /src/maat/src/third-party/sleigh/sleigh-cmake.

I'm using Docker version 20.10.12, build e91ed5707e.

Steps to reproduce:

git clone https://github.com/trailofbits/maat
cd maat'
docker build .

Remove vendored sleigh source/files

After merging #37 we should remove the vendored sleigh source files (except for the supported/modified .slaspec architecture files) from the repo in favor of using lifting-bits/sleigh CMake project.

We don't do this in #37 because it would blow up the diff.

Expose the `Arch` class in bindings

We should expose the Arch class in python bindings so that users can use the generic pc and sp getters, and allow to get the size of an arbitrary register.

Contrary to the native C++ Arch API, the Python wrapper should use strings to identify registers, instead of reg_t, since it will integrate better with the way the CPU is exposed in bindings (with a dynamic attribute getter that uses the register names, rather than their reg_t number.

Acceptance criteria:

Arch class is exposed in bindings with methods:
- pc(): return the name of the program counter
- sp(): return the name of the stack pointer
- reg_size(<reg>): return the size in bits of the register
MaatEngine has an arch attribute accessible from python bindings

`ir::CPU` doesn't need to be templated

It turns out that making ir::CPU a templated class doesn't bring any additional value while rendering its usage more difficult. We should make it a regular class and switch from using array<><>'s internally to vector<> and reserve as many slots as arch registers, initialising them to empty values.

Improvements needed to run `/bin/id` from binutils

Hi,
I would like to use maat with simple projects but I have troubles running, for instance, id. My code is the following:

from maat import *
m = MaatEngine(ARCH.X64, OS.LINUX)  
m.load("/bin/id", BIN.ELF64, libdirs=["/usr/lib/x86_64-linux-gnu/"])
m.run()

and the output is the following

➜  maat python3 id.py
[Info] Adding object 'ld-linux-x86-64.so.2' to virtual fs at '/usr/lib/ld-linux-x86-64.so.2'
[Info] Adding object 'libc.so.6' to virtual fs at '/usr/lib/libc.so.6'
[Info] Adding object 'libdl.so.2' to virtual fs at '/usr/lib/libdl.so.2'
[Info] Adding object 'libpcre2-8.so.0' to virtual fs at '/usr/lib/libpcre2-8.so.0'
[Info] Adding object 'libpthread.so.0' to virtual fs at '/usr/lib/libpthread.so.0'
[Info] Adding object 'libselinux.so.1' to virtual fs at '/usr/lib/libselinux.so.1'
[Info] Adding object 'id' to virtual fs at '/id'
[Error] Exception in CALLOTHER handler: SYSCALL: EnvEmulator: syscall '218' not supported for emulation
[Error] Unexpected error when processing IR instruction, aborting...
➜  maat

Would it be hard to add the support for the missing system call?

Thanks

Cache symbolic filesystem when serialising the engine

Currently when serialising FileSystem, we dump the entire symbolic filesystem. When the filesystem contains many shared library files it results in a lot of overhead when dumping/loading states.

Since many shared library files will likely only be read by the program, and never modified, we could gain a lot of time and storage by caching their contents in memory and avoid serialising them when possible.

The best solution will probably to implement a dynamic cache for env files that keeps track of the PhysicalFile internal buffers with a copy-on-write mechanism.

Document python bindings for `EnvEmulator` and `FileSystem`

I have written partial bindings for the emulated symbolic filesystem API, but those are not yet documented. We should write the corresponding documentation and push it before public release since the examples will likely use the filesystem for stdin input.

MacOS installation

We need to make installation easier on macOS, that includes:

Fixing missing pkgconfig for deps installed with homebrew (z3, ...)
Have a portable way to store the compiled sla files needed by Sleigh
Add the detailed install instructions for macOS

Opaque crash in `Number::set_overwrite()`

I'm getting a crash trying to simply run a binary with an input concolic buffer following the tutorial here. I'm running against the binary AIS-Lite from the CGC dataset, and I've attached a .tar.gz of the directory with the binary and libraries. A minimized version of my script is below to reproduce the issue.

Any help would be appreciated to figure out what I'm doing wrong here, it would also be super helpful to have a more detailed error explanation or traceback on errors like this!

Thanks for the hard work folks, this system looks really promising once it matures a bit!

AIS-Lite.tar.gz

from pathlib import Path
from argparse import ArgumentParser
from typing import List, Optional
from maat import MaatEngine, OS, ARCH, BIN
from angr import Project
from logging import basicConfig, getLogger

basicConfig()
logger = getLogger(__name__)


class TestMaat:
    """Can ma'at replace Triton?"""

    BASE = 0x400000

    def __init__(
        self,
        binary: Path,
        input_file: Path,
        args: Optional[List[str]] = None,
        libdirs: Optional[List[str]] = None,
    ) -> None:
        """
        Set up the maat engine with our binary.

        :param binary: Path to the binary to run
        :param args: Arguments to pass to the binary
        """
        self.binary = binary
        self.input_file = input_file
        self.args = args
        self.libdirs = (
            list(map(lambda l: str(l.resolve()), libdirs))
            if libdirs is not None
            else []
        )
        self.engine = MaatEngine(ARCH.X64, OS.LINUX)
        assert binary.is_file(), f"{binary} is not a file or doesn't exist."

        logger.info(f"Loading binary {self.binary} with libdirs {self.libdirs}")

        self.engine.load(
            str(binary.resolve()),
            BIN.ELF64,
            base=self.BASE,
            args=args if args is not None else [],
            libdirs=self.libdirs,
        )

        self.set_input()

    def run(self) -> None:
        """
        Run the binary.
        """
        self.engine.run()

    def set_input(self) -> None:
        """
        Set up the input for the binary.
        """

        stdin = self.engine.env.fs.get_fa_by_handle(0)  # Get stdin simfile
        contents = self.input_file.read_bytes()
        input_buffer = self.engine.vars.new_concolic_buffer(
            "stdin",
            contents,
            len(contents),
        )
        stdin.write_buffer(input_buffer)


if __name__ == "__main__":
    parser = ArgumentParser(prog="repro")
    parser.add_argument(
        "--binary", type=Path, required=True, help="Path to binary to execute."
    )
    parser.add_argument(
        "--input", type=Path, required=True, help="Path to input file to read."
    )
    parser.add_argument("--args", nargs="*", help="Arguments to pass to binary.")
    parser.add_argument(
        "--libdirs", nargs="*", type=Path, help="Library directories to load."
    )
    cli_args = parser.parse_args()

    stage1 = TestMaat(
        cli_args.binary,
        cli_args.input,
        cli_args.args,
        cli_args.libdirs + [Path("/lib/x86_64-linux-gnu/")],
    )
    stage1.run()

Performance improvements

Brain dumping about ways to improve runtime performance. The current bottlenecks are most likely:

Expr creation: they require dynamic allocation and are thus costly to create.
Disassembly/Lifting: seems that lifting is actually slower than execution, but I'm not sure whether that could be improved since we are using Sleigh for that
Number class initialisation: if the mpz part has a non-trivial constructor it could hinder the perf gains that we expect from using Number for concrete values
Expr canonisation: currently every expression is canonised upon creation. It certainly induces overhead, especially for memory operations that require using Expr. Computing expression hashes also adds overhead!

Some ideas to address them:

Limit Expr usage as much as possible. Especially, we should not enforce the use of Expr for memory operations and be able to use Number too. Maybe we could consider creating a Value class which would be an std::variant<Expr, Number>
Run some benchmarks/tests to verify this and look into Sleigh to see if perf can be improved. I wonder whether getting the register strings for translating PCODE operands to IR params causes overhead and if there is a way to do the translation without the string comparisons
If mpz has non trivial initialisation, consider wrapping them in a class that enables to skip the init (std::optional maybe?)
Two options:
- We could canonise only when needed / when simplifying. The hash could be computed as part of the canonisation.
- We could also skip canonisation entirely, but that would suppress some simplifications, and we would have to switch from ExprObject::eq() to comparing Expr raw pointers for quick expression equality (deep expr equality can be done with a recursive method on the arguments...)

Improve IR/PCODE caching in the engine

Issue

We should switch to storing PCODE representations of basic blocks to storing PCODE representation of individual instructions. Reasons are mainly:

Easier to manage the PCODE cache
Make the code for the main PCODE execution loop cleaner
Having a 1-1 mapping between ASM and PCODE block could come in handy in later applications

Implementation details

We want to replace the ir::Block class by something like ir::AsmInstruction which holds the PCODE, address, length, ...

SIGABRT on `Value.as_int()`

Maat version:
master branch at commit 1d1c0d3

Issue description:
Same as title. I'm consistently getting a SIGABRT signal when running Value.as_int() over python.

Steps to reproduce:
Run the following:

from maat import *
m = MaatEngine(ARCH.X64, OS.LINUX)
m.cpu.rax = Var(64, "a") # Variable "a" on 64 bits
m.cpu.rax.as_int()

Update to latest Ghidra sleigh specs

Currently, we have a vendored version of the x86 sleigh specs based on Ghidra 9.2.3 (as evidenced by the download script src/third-party/sleigh/native/sleigh_download.sh that was removed in this PR).

The following is a diff between Maat's x86 sleigh spec files (src/third-party/sleigh/processors/x86/data/languages) and Ghidra tag Ghidra_9.2.3_build. The diff was generated by copying Maat's src/third-party/sleigh/processors/x86/data/languages directory into Ghidra's equivalent and running git diff in the Ghidra repo:

Click to see diff

diff --git a/Ghidra/Processors/x86/data/languages/ia.sinc b/Ghidra/Processors/x86/data/languages/ia.sinc
index 4e7e69d3f..f47be7806 100644
--- a/Ghidra/Processors/x86/data/languages/ia.sinc
+++ b/Ghidra/Processors/x86/data/languages/ia.sinc
@@ -704,7 +704,7 @@ addr64: [Base64 + Index64*ss]			is mod=0 & r_m=4; Index64 & Base64 & ss
 addr64: [Base64]						is mod=0 & r_m=4; rexXprefix=0 & index64=4 & Base64    { export Base64; }
 addr64: [simm32_64 + Index64*ss]		is mod=0 & r_m=4; Index64 & base64=5 & ss; simm32_64   { local tmp=simm32_64+Index64*ss; export tmp; }
 addr64: [Index64*ss]					is mod=0 & r_m=4; Index64 & base64=5 & ss; imm32=0 { local tmp=Index64*ss; export tmp; }
-addr64: [imm32_64]						is mod=0 & r_m=4; rexXprefix=0 & index64=4 & base64=5; imm32_64      { export *[const]:8 imm32_64; }
+addr64: [simm32_64]						is mod=0 & r_m=4; rexXprefix=0 & index64=4 & base64=5; simm32_64      { export *[const]:8 simm32_64; }
 addr64: [Base64 + Index64*ss + simm8_64] is mod=1 & r_m=4; Index64 & Base64 & ss; simm8_64 { local tmp=simm8_64+Base64+Index64*ss; export tmp; }
 addr64: [Base64 + Index64*ss]			is mod=1 & r_m=4; Index64 & Base64 & ss; simm8=0   { local tmp=Base64+Index64*ss; export tmp; }
 addr64: [Base64 + simm8_64]				is mod=1 & r_m=4; rexXprefix=0 & index64=4 & Base64; simm8_64     { local tmp=simm8_64+Base64; export tmp; }
@@ -2737,9 +2737,10 @@ enterFrames: low5 is low5 { tmp:1 = low5; export tmp; }
 :INSD^rep^reptail eseDI4,DX is vexMode=0 & rep & reptail & opsize=1 & byte=0x6d & eseDI4 & DX   { eseDI4 = in(DX); }
 :INSD^rep^reptail eseDI4,DX is vexMode=0 & rep & reptail & opsize=2 & byte=0x6d & eseDI4 & DX   { eseDI4 = in(DX); }
 
-:INT1           is vexMode=0 & byte=0xf1                            { tmp:1 = 0x1; intloc:$(SIZE) = swi(tmp); call [intloc]; return [0:1]; }
-:INT3           is vexMode=0 & byte=0xcc                            { tmp:1 = 0x3; intloc:$(SIZE) = swi(tmp); call [intloc]; return [0:1]; }
-:INT imm8       is vexMode=0 & byte=0xcd; imm8                      { tmp:1 = imm8; intloc:$(SIZE) = swi(tmp); call [intloc]; }
+# Removed the call [intloc]; from INT* instructions to make callother processing easier
+:INT1           is vexMode=0 & byte=0xf1                            { tmp:1 = 0x1; intloc:$(SIZE) = swi(tmp); return [0:1]; }
+:INT3           is vexMode=0 & byte=0xcc                            { tmp:1 = 0x3; intloc:$(SIZE) = swi(tmp); return [0:1]; }
+:INT imm8       is vexMode=0 & byte=0xcd; imm8                      { tmp:1 = imm8; intloc:$(SIZE) = swi(tmp); }
 :INTO           is vexMode=0 & byte=0xce & bit64=0
 {
   tmp:1 = 0x4;
@@ -3155,8 +3156,11 @@ define pcodeop swap_bytes;
 :NEG rm64      is vexMode=0 & opsize=2 & byte=0xf7; rm64 & reg_opcode=3 ... { negflags(rm64); rm64 = -rm64; resultflags(rm64); }
 @endif
 
-:NOP            is vexMode=0 & opsize=0 & byte=0x90                        { }
-:NOP            is vexMode=0 & opsize=1 & byte=0x90                        { }
+# For simple NOPs rexprefix=0 is necessary to avoid the XCHG R8D, EAX and
+# XCHG R8W, AX instructions to be wrongly interpreted as REX-prefixed NOPs
+ 
+:NOP            is vexMode=0 & opsize=0 & byte=0x90 & rexprefix=0         { }
+:NOP            is vexMode=0 & opsize=1 & byte=0x90 &  rexprefix=0         { }
 :NOP rm16       is vexMode=0 & mandover & opsize=0 & byte=0x0f; high5=3; rm16  ...    { }
 :NOP rm32       is vexMode=0 & mandover & opsize=1 & byte=0x0f; high5=3; rm32  ...    { }
 :NOP^"/reserved" rm16 is vexMode=0 & mandover & opsize=0 & byte=0x0f; byte=0x18; rm16 & reg_opcode_hb=1 ...    { }
@@ -3907,6 +3911,8 @@ define pcodeop xend;
 :XCHG  RAX,Rmr64   is vexMode=0 & opsize=2 & row = 9 & page = 0 & RAX & Rmr64       { local tmp = RAX;   RAX = Rmr64;     Rmr64 = tmp; }
 @endif
 
+
+
 :XCHG  rm8,Reg8        is vexMode=0 & byte=0x86; rm8 & Reg8  ...                { local tmp = rm8;   rm8 = Reg8;   Reg8 = tmp; }
 :XCHG rm16,Reg16   is vexMode=0 & opsize=0 & byte=0x87; rm16 & Reg16 ...        { local tmp = rm16; rm16 = Reg16; Reg16 = tmp; }
 :XCHG rm32,Reg32   is vexMode=0 & opsize=1 & byte=0x87; rm32 & check_rm32_dest ... & Reg32 ... & check_Reg32_dest ...        { local tmp = rm32; rm32 = Reg32; build check_rm32_dest; Reg32 = tmp; build check_Reg32_dest;}
@@ -5690,19 +5696,12 @@ define pcodeop movmskps;
 
 :MOVUPS       XmmReg, m128     is vexMode=0 &  byte=0x0F; byte=0x10; m128 & XmmReg ...
 {
-    local m:16 = m128;
-    XmmReg[0,32] = m[0,32];
-    XmmReg[32,32] = m[32,32];
-    XmmReg[64,32] = m[64,32];
-    XmmReg[96,32] = m[96,32];
+    XmmReg = m128;
 }
 
 :MOVUPS       XmmReg1, XmmReg2 is vexMode=0 &  byte=0x0F; byte=0x10; xmmmod = 3 & XmmReg1 & XmmReg2
 {
-    XmmReg1[0,32] = XmmReg2[0,32];
-    XmmReg1[32,32] = XmmReg2[32,32];
-    XmmReg1[64,32] = XmmReg2[64,32];
-    XmmReg1[96,32] = XmmReg2[96,32];
+    XmmReg1 = XmmReg2;
 }
 
 :MOVUPS       m128, XmmReg     is vexMode=0 &  mandover=0 & byte=0x0F; byte=0x11; m128 & XmmReg ...
@@ -5712,10 +5711,7 @@ define pcodeop movmskps;
 
 :MOVUPS       XmmReg2, XmmReg1 is vexMode=0 &  mandover=0 & byte=0x0F; byte=0x11; xmmmod = 3 & XmmReg1 & XmmReg2
 {
-    XmmReg1[0,32] = XmmReg2[0,32];
-    XmmReg1[32,32] = XmmReg2[32,32];
-    XmmReg1[64,32] = XmmReg2[64,32];
-    XmmReg1[96,32] = XmmReg2[96,32];
+    XmmReg2 = XmmReg1;
 }
 
 :MULPD        XmmReg, m128     is vexMode=0 &  $(PRE_66) & byte=0x0F; byte=0x59; m128 & XmmReg ...
@@ -6670,10 +6666,75 @@ define pcodeop pminub;
 :PMINUB        XmmReg1, XmmReg2 is vexMode=0 &  $(PRE_66) & byte=0x0F; byte=0xDA; xmmmod = 3 & XmmReg1 & XmmReg2 { XmmReg1 = pminub(XmmReg1, XmmReg2); }
 
 define pcodeop pmovmskb;
-:PMOVMSKB       Reg32, mmxreg2   is vexMode=0 &   mandover=0 & byte=0x0F; byte=0xD7; Reg32 & mmxreg2 { Reg32 = pmovmskb(Reg32, mmxreg2); }
-:PMOVMSKB       Reg32, XmmReg2   is vexMode=0 &   $(PRE_66) & byte=0x0F; byte=0xD7; Reg32 & XmmReg2 { Reg32 = pmovmskb(Reg32, XmmReg2); }
-@ifdef IA64
-:PMOVMSKB       Reg64, mmxreg2   is vexMode=0 &   opsize=2 & mandover=0 & byte=0x0F; byte=0xD7; Reg64 & mmxreg2 { Reg64 = pmovmskb(Reg64, mmxreg2); }
+:PMOVMSKB       Reg32, mmxreg2   is vexMode=0 &   mandover=0 & byte=0x0F; byte=0xD7; Reg32 & mmxreg2
+{
+    TempA:4 = 0;
+    TempA[0, 1] = mmxreg2[7, 1];
+    TempA[1, 1] = mmxreg2[15, 1];
+    TempA[2, 1] = mmxreg2[23, 1];
+    TempA[3, 1] = mmxreg2[31, 1];
+    TempA[4, 1] = mmxreg2[39, 1];
+    TempA[5, 1] = mmxreg2[47, 1];
+    TempA[6, 1] = mmxreg2[55, 1];
+    TempA[7, 1] = mmxreg2[63, 1];
+    Reg32 = TempA;
+}
+:PMOVMSKB       Reg32, XmmReg2   is vexMode=0 &   $(PRE_66) & byte=0x0F; byte=0xD7; Reg32 & XmmReg2
+{
+    TempA:4 = 0;
+    TempA[0, 1] = XmmReg2[7, 1];
+    TempA[1, 1] = XmmReg2[15, 1];
+    TempA[2, 1] = XmmReg2[23, 1];
+    TempA[3, 1] = XmmReg2[31, 1];
+    TempA[4, 1] = XmmReg2[39, 1];
+    TempA[5, 1] = XmmReg2[47, 1];
+    TempA[6, 1] = XmmReg2[55, 1];
+    TempA[7, 1] = XmmReg2[63, 1];
+    TempA[8, 1] = XmmReg2[71, 1];
+    TempA[9, 1] = XmmReg2[79, 1];
+    TempA[10, 1] = XmmReg2[87, 1];
+    TempA[11, 1] = XmmReg2[95, 1];
+    TempA[12, 1] = XmmReg2[103, 1];
+    TempA[13, 1] = XmmReg2[111, 1];
+    TempA[14, 1] = XmmReg2[119, 1];
+    TempA[15, 1] = XmmReg2[127, 1];
+    Reg32 = TempA;
+}
+@ifdef IA64
+:PMOVMSKB       Reg64, mmxreg2   is vexMode=0 &   opsize=2 & mandover=0 & byte=0x0F; byte=0xD7; Reg64 & mmxreg2
+{
+    TempA:8 = 0;
+    TempA[0, 1] = mmxreg2[7, 1];
+    TempA[1, 1] = mmxreg2[15, 1];
+    TempA[2, 1] = mmxreg2[23, 1];
+    TempA[3, 1] = mmxreg2[31, 1];
+    TempA[4, 1] = mmxreg2[39, 1];
+    TempA[5, 1] = mmxreg2[47, 1];
+    TempA[6, 1] = mmxreg2[55, 1];
+    TempA[7, 1] = mmxreg2[63, 1];
+    Reg64 = TempA;
+}
+:PMOVMSKB       Reg64, XmmReg2   is vexMode=0 &   $(PRE_66) & opsize=2 & byte=0x0F; byte=0xD7; Reg64 & XmmReg2
+{
+    TempA:8 = 0;
+    TempA[0, 1] = XmmReg2[7, 1];
+    TempA[1, 1] = XmmReg2[15, 1];
+    TempA[2, 1] = XmmReg2[23, 1];
+    TempA[3, 1] = XmmReg2[31, 1];
+    TempA[4, 1] = XmmReg2[39, 1];
+    TempA[5, 1] = XmmReg2[47, 1];
+    TempA[6, 1] = XmmReg2[55, 1];
+    TempA[7, 1] = XmmReg2[63, 1];
+    TempA[8, 1] = XmmReg2[71, 1];
+    TempA[9, 1] = XmmReg2[79, 1];
+    TempA[10, 1] = XmmReg2[87, 1];
+    TempA[11, 1] = XmmReg2[95, 1];
+    TempA[12, 1] = XmmReg2[103, 1];
+    TempA[13, 1] = XmmReg2[111, 1];
+    TempA[14, 1] = XmmReg2[119, 1];
+    TempA[15, 1] = XmmReg2[127, 1];
+    Reg64 = TempA;
+}
 @endif
 
 define pcodeop pmulhrsw;
@@ -6851,7 +6912,10 @@ define pcodeop psignd;
 :PSIGND         XmmReg1, XmmReg2 is vexMode=0 &  $(PRE_66) & byte=0x0F; byte=0x38; byte=0x0a; xmmmod = 3 & XmmReg1 & XmmReg2 { XmmReg1=psignd(XmmReg1,XmmReg2); }
 
 define pcodeop pslldq;
-:PSLLDQ         XmmReg2, imm8    is vexMode=0 & $(PRE_66) & byte=0x0F; byte=0x73; xmmmod = 3 & reg_opcode=7 & XmmReg2; imm8 { XmmReg2 = pslldq(XmmReg2, imm8:8); }
+:PSLLDQ         XmmReg2, imm8    is vexMode=0 & $(PRE_66) & byte=0x0F; byte=0x73; xmmmod = 3 & reg_opcode=7 & XmmReg2; imm8
+{
+    XmmReg2 = XmmReg2 << (imm8 * 8); # Boyan: The shift is in bytes, not bits !
+}
 
 define pcodeop psllw;
 :PSLLW          mmxreg, m64      is vexMode=0 &  mandover=0 & byte=0x0F; byte=0xF1; mmxreg ... & m64 ... { mmxreg = psllw(mmxreg, m64); }
@@ -7614,9 +7678,31 @@ define pcodeop rsqrtss;
 :RSQRTSS         XmmReg, m32      is vexMode=0 &  $(PRE_F3) & byte=0x0F; byte=0x52; XmmReg ... & m32 { XmmReg = rsqrtss(XmmReg, m32); }
 :RSQRTSS         XmmReg1, XmmReg2 is vexMode=0 &  $(PRE_F3) & byte=0x0F; byte=0x52; xmmmod = 3 & XmmReg1 & XmmReg2 { XmmReg1 = rsqrtss(XmmReg1, XmmReg2); }
 
+
 define pcodeop shufpd;
-:SHUFPD          XmmReg, m128, imm8     is vexMode=0 & $(PRE_66) & byte=0x0F; byte=0xC6; XmmReg ... & m128; imm8 { XmmReg = shufpd(XmmReg, m128, imm8:8); }
-:SHUFPD          XmmReg1, XmmReg2, imm8  is vexMode=0 & $(PRE_66) & byte=0x0F; byte=0xC6; xmmmod=3 & XmmReg1 & XmmReg2; imm8 { XmmReg1 = shufpd(XmmReg1, XmmReg2, imm8:8); }
+:SHUFPD          XmmReg, m128, imm8     is vexMode=0 & $(PRE_66) & byte=0x0F; byte=0xC6; XmmReg ... & m128; imm8
+{
+    shifted:16 = XmmReg >> ((imm8 & 0x1)*64);
+    tempA:8 = shifted:8;
+
+    shifted = m128 >> ((imm8 & 0x2)*64);
+    tempB:8 = shifted:8;
+
+    XmmReg[0, 64] = tempA;
+    XmmReg[64, 64] = tempB;
+}
+
+:SHUFPD          XmmReg1, XmmReg2, imm8  is vexMode=0 & $(PRE_66) & byte=0x0F; byte=0xC6; xmmmod=3 & XmmReg1 & XmmReg2; imm8
+{
+    shifted:16 = XmmReg1 >> ((imm8 & 0x1)*64);
+    tempA:8 = shifted:8;
+
+    shifted = XmmReg2 >> ((imm8 & 0x2)*64);
+    tempB:8 = shifted:8;
+
+    XmmReg1[0, 64] = tempA;
+    XmmReg1[64, 64] = tempB;
+}
 
 :SHUFPS  XmmReg, m128, imm8  is vexMode=0 & mandover=0 & byte=0x0F; byte=0xC6; (m128 & XmmReg ...); imm8 & Order0 & Order1 & Order2 & Order3
 {

This diff should be merged with a more recent version of Ghidra's x86 sleigh spec and potentially push any fixes as a PR to upstream.

I don't think there's too much harm in maintaining a vendored version of the sleigh specs for more fixes later, but the patches should be well-documented and include the base version of Ghidra used.

Lifter attempts to lift instructions past end of mapped bounds

from maat import *

m = MaatEngine(ARCH.X64, OS.LINUX)

m.mem.map(0x410000, 0x411000, PERM.RX)

m.mem.write(0x410000, b"\xeb\xfe", ignore_flags=True)

m.run_from(0x410000, 1)

# tries to lift OOB bytes and fails

print(m.info)

The above script results in a failure from Sleigh attempting to decode bytes past the end of the mapped memory.

$ python3 example.py
FATAL: Error in sleigh translate(): Sleigh raised a bad data exception: r0x004120b0: Unable to resolve constructor
[Fatal] Lifter error: MaatEngine::run(): failed to lift instructions

Stop:       fatal error in Maat

I have also seen other sleigh errors, all referencing memory past what should be mapped.

FATAL: Error in sleigh translate(): Can not lift instruction at 0x4110c9: IN AL, 0x3d (unsupported callother occurence)

I experience this consistently if I have only mapped a page or less. It will still happen occasionally if I map a greater amount (e.g. m.mem.map(0x410000, 0x620000))
I think it is because the call to lift_block specifies a arbitrary code size of 0xfffffff as seen below.

lifters[_current_cpu_mode]->lift_block(
                *ir_map,
                addr,
                mem->raw_mem_at(addr),
                0xfffffff,
                0xffffffff,
                nullptr, // is_symbolic
                nullptr, // is_tainted
                true
            )

But I am not sure why this issue is not encountered when loading via the lief loader.

Add Python tests

In addition to the unitary tests written in C++, we should add a couple tests in Python. Maintaining comprehensive unit-tests in Python is definitely not what we want, but we could add scripts solving small CTF or hand-crafted challenges.

Add the possibility to write tests in Python, using pytest for example
Add a couple tests with the challenges I have already solved during testing
Integrate python tests in CTest

Improve snapshot performance for big traces

Currently snapshots record every single memory write event to be able to restore a past state. This won't scale so well on very long traces that are very memory intensive. An alternative would be to snapshot memory on a per-page basis: save the whole page when it gets written for the first time, and don't record subsequent operations affecting this page.

The cost for small traces is increased
The amount of RAM needed to take snapshots on long traces could be reduced

We should probably allow both memory snapshoting strategies to be selected as a setting for advanced users.

Symbolic pointer read: exclude memory ranges without R flag

When computing the ITE expression for a symbolic read, we should only consider values from memory pages that have the R flag set.

Python3 PIP install for Apple Silicon macs

Hello everyone!

Today I tried to install PyMaat and it instantly failed with not finding a version to install.

$ python3 -m pip install pymaat
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement pymaat (from versions: none)
ERROR: No matching distribution found for pymaat

Also tried with sudo but got the same errors

$ sudo -H python3 -m pip install pymaat
ERROR: Could not find a version that satisfies the requirement pymaat (from versions: none)
ERROR: No matching distribution found for pymaat

I'm using pip-22.0.3 and Python 3.8.9

Regards

Refactor the `Lifter` classes

At the moment the only specificity of LifterX86 vs its abstract parent class Lifter is that it initialises the sleigh interface with the correct .sla and .pspec files.

This is a legacy from using a custom lifter, but since we now rely on sleigh for lifting there is no need for specialised Lifter<arch> classes. We could just factorise the logic into Lifter and have it handle all architectures.

Command line arguments: use concrete strings and/or `Expr` buffers instead of `Arg`

Currently, we use the Arg class to specify command line arguments to pass to the symbolically executed program. This allows to create concrete, concolic, and symbolic arguments. However it is not possible to mix concrete and symbolic/concolic bytes in the same argument string, which could be useful sometimes.

We should drop Arg and allow passing either string/uint8_t* or vector<Expr> to the loader for command line arguments. This way we can still easily create full concrete and symbolic arguments (using the VarContext::new_concolic_buffer and VarContext::new_symbolic_buffer methods), but also mix them if needed.

Maat abort when restoring a snapshot from before a mapping existed.

from maat import *

engine = MaatEngine(ARCH.X86, OS.LINUX)

snap = engine.take_snapshot()

engine.mem.map(0x10000, 0x11000)
engine.mem.write(0x10000, b'\x90'*0x1000)

engine.restore_snapshot(snap)

The above script will result in an abort with the error:

terminate called after throwing an instance of 'maat::runtime_exception'
  what():  Trying to restore from concrete-snapshot at address 0x10ff8 not mapped int memory
Aborted (core dumped)

I think when calling record_mem_write you may want to ensure the snapshot's map has a place for the memory, or skip writes to pages not in the maps when restoring.

Expose memory mappings in Python API

Discussed in #48

^{Originally posted by novafacing February 24, 2022}
Is there a way to either explicitly set or access the address libraries are loaded at? I'm using angr to extract some PLT information and trying to set a callback in maat on the PLT stub address in the loaded library, but the addresses don't match up.

I'm using the Python interface. Thanks!

#48 explains that memory mappings are currently accessible only from the C++ API. While @novafacing posted a workaround script to get the mappings programatically we should eventually add python bindings for the MemEngine.mappings attribute and the MemMapManager class.

Execute event hooking code only if hooks are set

Add systematic and fast tests before notifying the event manager to avoid overhead of setting the info field while no event callback is active.

Add tutorials for path exploration

We should write tutorials on how to properly do path exploration with Maat. This will surely immensely help people trying to do more advanced analysis as the exploration strategy is a bit different than with other tools.

We can re-use the python tests scripts that we have for x86 and use a different target program.

Tutorial that uses snapshots and DFS exploration
Tutorial that uses state serialization and BFS exploration

Add statistics/introspection capabilities

We should re-add introspection stats in the engine, especially:

total time spent solving symbolic pointer reads/writes
average ranges for symbolic pointers
total number of symbolic pointer reads/writes
total number of instructions executed

Some interesting things to add:

total number of symbolic expressions created
total number of Value instances created
total number of lifted instructions
...

Support `exit_group` syscall (#231)

The emulator prints out:

[Error] Exception in CALLOTHER handler: SYSCALL: EnvEmulator: syscall '231' not supported for emulation
[Error] Unexpected error when processing IR instruction, aborting...

upon executing an exit_group syscall, which is somewhat opaque given it happens quite frequently and is not really an error. Is there a nicer way we could implement exit-like syscalls to make things more clear to the user that execution has ended (possibly with the exit code) instead of showing an error?

Build using CMake

Maybe it doesn't make sense right now since Maat builds via Make. But if you have future plans to move to CMake, using our SLEIGH wrapper here would clean things up a bit. I'd be happy to help with that.

Amazing work by the way. I'm really excited about your project. 🎉

`MaatEngine::load()`: Don't allow to use `base` argument on non-relocatable binaries

In MaatEngine::load() the users can specify a load address using the base argument. The requested base is added to the binary base address (which is 0 for most relocatable binaries). That can be confusing and result in the binary being loaded at the wrong address if base is non-null for a non-relocatable binary. See this comment in #49.

We should simply detect and throw an error when users try to use base on a non-relocatable binary.

Allow to serialise/deserialise the `MaatEngine`

Description
It would be a really nice features to be able to dump the engine state on the disk, and later reload the state into an existing/new engine. It would allow to:

save an intermediate state during execution and use it as a basis from where to start subsequent executions
save "interesting" states during automatic analysis so that users can manually inspect them later
implement new exploration strategies based on state queuing in a BFS fashion instead of DFS
give flexibility on which method to use to take snapshots depending on the use-case: either rewind operations or reload a full state

Implementation
Implementing engine serialisation/deserialisation requires a huge engineering effort. Especially, it requires serialising many different object types many of which share one or many of the following characteristics:

multiple fields including other complex objects
part of an inheritance hierarchy
pointers and references to other fields (tree-like structures, no cyclic graphs IIRC)

We will most likely need to write serialisation methods for many of Maat's classes and objects. We should see to what extend we could take advantage of existing C++ serialisation libraries, the main criteria being:

Speed
Features to serialise complex objects (inheritance, pointers, references, ... see the bullet-list above)
Support for built-in types (vector, list, optional, map, ...)
Size of serialised data on disk

Deserializer fails with opaque error when directory does not exist

The state manager fails while trying to deserialize if the directory passed to the SimpleStateManager does not exist. I'd expect either an error when creating the SimpleStateManager, or for the directory to be created. At minimum, I think Deserializer's constructor should check if the stream is valid.

import os
from maat import *

eng = MaatEngine(ARCH.X64, OS.LINUX)

path = "/tmp/directory_does_not_exist"
state_manager = SimpleStateManager(path)
state_manager.enqueue_state(eng)

print(f"Enqueued state, exists = {os.path.exists(path)}")

# abort() during dequeue_state
if not state_manager.dequeue_state(eng):
    print("Did not dequeue state")

print("Done")

/home/jay/libs/maat/build/maat.cpython-39-x86_64-linux-gnu.so
Enqueued state, exists = False
terminate called after throwing an instance of 'maat::serialize_exception'
  what():  Deserializer::Factory::new_object: unsupported class uid: 0
Aborted (core dumped)

Not a big deal, but just figured I'd post an issue for it to have it on record.

Improve `MemEngine` mapping/segmentation API

At the moment the API of MemEngine mixes methods that map memory (map, allocate, unmap) with lower-level methods that manage MemSegments (new_segment, allocate_segment, delete_segment). The latter should be made protected, or at least be given some _ prefix indicating that they are not the preferred way to manage memory mappings.

Use map, allocate, unmap, instead of 'segment' methods
Split is_free into two methods:
- public is_free(start,end): that checks if a range is not mapped (!= no overlapping MemSegment)
- private is_segment_intersecting_with(start,end): that checks whether an existing MemSegment intersects with a range
Manage a list of mappings with names/permissions distinct from the actual MemSegment underlying list
- Make the MemSegment an internal class
- Add a MemMap class (which holds just start, end, perms, and name), and a MemMapManager that updates them on map/unmap
Save map manager in snapshots
In python bindings replace new_segment with map/alloc/unmap
Edit documentation

Fatal Error with register translation for eflags reg

Hello there!

I have been playing with pymaat and when I try to execute the instructions pushf / popf i get the following:

FATAL: Error in sleigh translate(): X86: Register translation from SLEIGH to MAAT missing for register eflags
[Fatal] Lifter error: MaatEngine::run(): failed to lift instructions

Crash in `maat::ExprITE::hash()` due to recursive call stack exhaustion

Looks like maat::ExprITE::hash() can get into infinite recursion and crash here. Here is a backtrace:

(gdb) where
#0  0x00007f8f62c8d9cb in maat::ExprITE::hash() () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#1  0x00007f8f62c8d9ed in maat::ExprITE::hash() () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
<...snip...>
#7475 0x00007f8f62c8d9ed in maat::ExprITE::hash() () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7476 0x00007f8f62c8d9ed in maat::ExprITE::hash() () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7477 0x00007f8f62c8d9ed in maat::ExprITE::hash() () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7478 0x00007f8f62c8d9ed in maat::ExprITE::hash() () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7479 0x00007f8f62c8c32b in maat::ExprObject::eq(std::shared_ptr<maat::ExprObject>) () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7480 0x00007f8f62cd1ebf in maat::MemSegment::symbolic_ptr_read(maat::Value&, std::shared_ptr<maat::ExprObject> const&, maat::ValueSet&, unsigned int, std::shared_ptr<maat::ExprObject> const&) () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7481 0x00007f8f62cd24c4 in maat::MemEngine::symbolic_ptr_read(maat::Value&, std::shared_ptr<maat::ExprObject>, maat::ValueSet const&, unsigned int, maat::Settings const&) () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7482 0x00007f8f62c39927 in maat::MaatEngine::resolve_addr_param(maat::ir::Param const&, maat::ir::ProcessedInst::Param&) () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7483 0x00007f8f62c39d3a in maat::MaatEngine::process_load(maat::ir::Inst const&, maat::ir::ProcessedInst&) () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7484 0x00007f8f62c3c2ac in maat::MaatEngine::run(int) () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7485 0x00007f8f62bf8468 in maat::py::MaatEngine_run(_object*, _object*) () from /root/.cache/pypoetry/virtualenvs/reface-mFqyHumy-py3.9/lib/python3.9/site-packages/maat.cpython-39-x86_64-linux-gnu.so
#7486 0x00000000005310fd in ?? ()
#7487 0x0000000000512192 in _PyEval_EvalFrameDefault ()
#7488 0x0000000000528b63 in _PyFunction_Vectorcall ()
#7489 0x0000000000512192 in _PyEval_EvalFrameDefault ()
#7490 0x00000000005106ed in ?? ()
#7491 0x0000000000510497 in _PyEval_EvalCodeWithName ()
#7492 0x00000000005f5be3 in PyEval_EvalCode ()
#7493 0x0000000000619de7 in ?? ()
#7494 0x0000000000615610 in ?? ()
#7495 0x0000000000619d79 in ?? ()
#7496 0x0000000000619816 in PyRun_SimpleFileExFlags ()
#7497 0x000000000060d4e3 in Py_RunMain ()
#7498 0x00000000005ea6e9 in Py_BytesMain ()
#7499 0x00007f8f63881d0a in __libc_start_main (main=0x5ea6b0, argc=6, argv=0x7fff4abfac48, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff4abfac38) at ../csu/libc-start.c:308
#7500 0x00000000005ea5ea in _start ()

I'm not sure what's causing the issue. I can email a test script and binaries if needed to reproduce, but I would prefer not to post them publicly.

Implement `munmap`

Add the munmap syscall once #10 is implemented

Ability to ignore failure to lift

While ideally sleigh would be able to lift all instructions, there are many occasions where it can not.

When attempting to use MAAT to follow a debugger trace, it would be nice if there were a way to ignore failure to lift an instruction. I would prefer being able to set an option for an engine that will let the lifter just lift as far as it can, and only return an error if no instructions could be lifted at all.

Maybe a simpler alternative to the above would be just to limit the number of instructions to be lifted to not be more than the number of instructions passed to run?

That way it would be possible to do something like:

eng, dbg = init()
while True:
    dbg.step()
    try:
        stop = eng.run(1)
        if stop != STOP.INST_COUNT:
            # handle stops ...
    except LifterError:
        # fix up registers if maat gets out of sync with dbg
        # possible loss of fidelity and symbolic info if we can't lift an instruction
        fixup_registers(dbg, eng)

Truncating a `MemSegment`

In order to implement the munmap system call, we need the ability to arbitrarily truncate existing segments since some of their contents might get unmapped.

Just as we have extend_before and extend_after, we should implement a truncate(start, end) that truncates the MemSegment. Function signature for truncate will likely be similar to vector<MemSegment*> MemSegment::truncate(addr_t start, addr_t end), and it will be called by a higher level MemManager::clear_segments(addr_t start, addr_t end). There are 3 main cases to consider:

start == 0, then we can truncate easily and return a single segment
end == segment_end, then truncating is a bit more tricky but it's still possible to return a single segment
[start:end] is in the middle of the segment, then truncating shall return two segments (the upper and lower parts of the original one)

clear_segments() will first go through all existing segments and call truncate(), and get the new segment pointers. Then it will replace its old segments list by the new ones. Also, page permissions shall be updated accordingly.

Test GMP impact on performance

Following comments about improving performance in #5:

If using GNU MP turns out to be a bottleneck, here are some alternatives to consider

Random thoughts:

Using uint512_t instead of true multiprecision numbers could be useful it it allows the library to use stack-allocated objects instead of dynamically allocating them
The most important point is not how fast the multi-precision computation is but rather how efficient the object creation/copy is. MP numbers are part of Maat's Number class, and Number objects are created all the time, so the cost of creation of an MP object is by far the most important

Restrain symbolic pointers to certain memory areas

We should have a setting that allows to provide a list of address ranges, and force possible values for symbolic pointers to be within those ranges. This setting should be compatible with symptr_refine_range and symptr_limit_range.

The rational behind such an option is to allow the symbolic pointer analysis to target only specific memory areas, just as was needed for the zehn challenge from hxp 2021 CTF.

Implementation idea: just compute the possible value range as before, and then refine it using the allowed ranges. If there are intersections, keep them as possible values for the pointer, if there are no intersections, use the pointer concrete value.

sys_linux_fstatat null dref

The function maat::env::emulated::sys_linux_fstatat will use a null pointer for the file argument to _stat if the input filepath is absolute.

maat/src/env/emulated_syscalls/linux_syscalls.cpp

Line 309 in 44a6f19

return _stat(engine, file, statbuf);

Expose symbolic pointer `ValueSet` information in `info::MemAccess`

We should provide the value set computed for symbolic pointers in the memory access information. Moreover, we should allow the user to manually tamper it in an event callback, and take the modifications into account before performing the memory access.

Add mem_access.range, with attributes
- range.min
- range.max
- range.stride
Let the user modify mem_access.range and use the modified value set to perform the mem access
Add python bindings for mem_access.range

Setup a CI that runs the unit tests

Currently the unit-tests are compiled as two binaries, unit-tests and adv-tests.
We should setup a very basic CI job that runs those binaries

Virtual Filesystem moves libraries to incompatible paths

As discussed in #49, the issue of libraries not being recursively loaded has (I think!) been fixed, but there is still the problem of libraries being placed in the virtual filesystem at paths different from their original location. For example using the same setup as in #49 with AIS-Lite and modifying the code a bit:

from collections import defaultdict
from typing import Dict, Tuple
from pathlib import Path
from argparse import ArgumentParser
from typing import Dict, List, Optional
from maat import MaatEngine, OS, ARCH, BIN, EVENT, WHEN
from angr import Project
from bisect import insort


class Stage1Maat:
    """Test using maat to run stage 1 instead of triton..."""

    BASE = 0x400000

    def __init__(
        self,
        binary: Path,
        input_file: Path,
        args: Optional[List[str]] = None,
        libdirs: Optional[List[str]] = None,
    ) -> None:
        """
        Set up the maat engine with our binary.

        :param binary: Path to the binary to run
        :param args: Arguments to pass to the binary
        """
        self.binary = binary
        self.input_file = input_file
        self.args = args
        self.libdirs = (
            list(map(lambda l: str(l.resolve()), libdirs))
            if libdirs is not None
            else []
        )
        self.project = Project(
            str(self.binary.resolve()),
            load_options={
                "auto_load_libs": True,
                "ld_path": self.libdirs,
                "use_system_libs": False,
                "skip_libs": ["libc.so.6", "libm.so.6"],
            },
        )
        self.cfg = self.project.analyses.CFGFast(normalize=True)
        self.engine = MaatEngine(ARCH.X64, OS.LINUX)
        assert binary.is_file(), f"{binary} is not a file or doesn't exist."

        print(f"Loading binary {self.binary} with libdirs {self.libdirs}")

        self.engine.load(
            str(binary.resolve()),
            BIN.ELF64,
            args=args if args is not None else [],
            libdirs=self.libdirs,
        )

        self.set_callbacks()
        self.maps = {}

    def run(self) -> None:
        """
        Run the binary.
        """
        self.engine.run()

    def get_mappings(self, engine: MaatEngine) -> Dict[str, List[Tuple[int]]]:
        """
        Get the mappings of the binary.

        :param engine: Maat engine
        """
        mappings = defaultdict(list)
        rawmaps = list(map(lambda l: l.strip(), str(engine.mem).splitlines()))
        maps = map(
            lambda l: l.split(),
            filter(
                lambda l: l and l.startswith("0x"),
                rawmaps[: rawmaps.index("Page permissions:")],
            ),
        )

        for mp in maps:
            insort(mappings[mp[2]], (int(mp[0], 16), int(mp[1], 16)))

        return dict(mappings)

    def loader_callback(self, engine: MaatEngine) -> None:
        """
        Callback when the loader is running.

        :param engine: Maat engine
        """
        maps = self.get_mappings(engine)
        if maps != self.maps:
            print(f"Maps changed: {maps}")
            self.maps = maps

    def got_to_main(self, _: MaatEngine) -> None:
        """
        Callback when we get to main

        :param engine: Maat engine
        """
        print("Got to main!")

    def set_callbacks(self) -> None:
        """
        Set a callback to...set the rest of our callbacks (lol) once the program
        is loaded.
        """
        main_addr = self.project.loader.find_symbol("main").rebased_addr
        print(f"Setting callback at {main_addr:#016x}")
        maps = self.get_mappings(self.engine)
        print(f"Initial maps {maps}")
        self.engine.hooks.add(
            EVENT.EXEC,
            WHEN.BEFORE,
            filter=(
                maps["ld-linux-x86-64.so.2"][0][0],
                maps["ld-linux-x86-64.so.2"][-1][1],
            ),
            callbacks=[self.loader_callback],
            group="loader",
        )

        self.engine.hooks.add(
            EVENT.EXEC,
            WHEN.BEFORE,
            filter=main_addr,
            callbacks=[self.got_to_main],
            group="setup",
        )


if __name__ == "__main__":
    parser = ArgumentParser(prog="repro")
    parser.add_argument(
        "--binary", type=Path, required=True, help="Path to binary to execute."
    )
    parser.add_argument(
        "--input", type=Path, required=True, help="Path to input file to read."
    )
    parser.add_argument("--args", nargs="*", help="Arguments to pass to binary.")
    parser.add_argument(
        "--libdirs", nargs="*", type=Path, help="Library directories to load."
    )
    cli_args = parser.parse_args()

    repro = Stage1Maat(
        cli_args.binary,
        cli_args.input,
        cli_args.args,
        cli_args.libdirs + [Path("/lib/x86_64-linux-gnu"), Path("/lib64")],
    )
    repro.run()

We end up getting:

Maps changed: {'ld-linux-x86-64.so.2': [(4096, 8191), (8192, 139263), (139264, 172031), (176128, 188415)], 'Interp.': [(188416, 4382719)], 'Heap': [(4382720, 8417279)], 'map_anon': [(67108864, 67117055)], '/usr/lib/libcgc.so': [(67117056, 67125247), (67125248, 67133439), (67133440, 67137535), (67137536, 67145727)], 'Stack': [(8796090925056, 8796093022207)]}
AIS-Lite: /usr/lib/libcgc.so: no version information available (required by AIS-Lite)
AIS-Lite: /usr/lib/libcgc.so: no version information available (required by /usr/lib/libcgc.so)
AIS-Lite: symbol lookup error: AIS-Lite: undefined symbol: __libc_start_main, version GLIBC_2.2.5
[Error] Exception in CALLOTHER handler: SYSCALL: EnvEmulator: syscall '231' not supported for emulation
[Error] Unexpected error when processing IR instruction, aborting...

Syscall 231 is exit_group, and we're exiting partway through loading, but the thing that stands out to me is that libcgc.so is tossed into /usr/lib instead of being emulated in the same directory it is on the host machine. Of course we can solve this with patchelf, but is there another way to either use the same virtual_path argument passed to MaatEngine.load to also specify virtual paths for libraries?

I've repackaged the CGC binaries, so the .tar.gz attached to #49 can be compiled on a tester's machine with cmake -DCMAKE_EXE_LINKER_FLAGS='-no-pie -fno-pie' . && make to duplicate the configuration I have on my host machine.

Create a generic `Value` class that seamlessly wraps both `Expr` and `Number`

Related to #5

More thoughts on having a Value class equivalent to std::variant<Expr, Number>:

could be used everywhere in the API, instead of having to duplicate things between Expr and Number/cst_t. That includes:
- In IRContext and TmpContext instead of maintaining two lists
- In ProcessedInst::Param
- In CPU's preprocess_inst and postprocess_inst methods
- In MemEngine read/write API
- In Info, RegAccess, MemAccess, ...
Should have a is_abstract() and is_concrete() methods
Should also have is_concrete, is_concolic, ..., wrappers methods for convenience use
Should have the same in-place operators as the Number class, except that they automatically use the internal Number or Expr depending on whether the Value is abstract or concrete. Using in-place operators is a must for performance
Should have operators to create Constraints, again mostly just wrappers around Expr operators

Basically we should start off from the current ProcessedInst::Param implementation and build a fully functional Value class on top of it, then progressively start to use Value everywhere.

Implementation notes

The class should have near zero overhead for concrete values: just set nullptr for the expression field
The class should have efficient creation, including assignment operator, and copy or rvalue reference assignment
Since the class contents size will exceed the size of a native integer, and since it will have non-trivial constructors, it should be passed as reference as often as possible
They class should be used in-place as much as possible
About usage of Value vs Expr in the API:
- Use Expr when we are sure we are dealing with abstract expressions (like new_symbolic_buffer, ...)
- Use Value when we are unsure if data is concrete or abstract
- Typically, we will allow using both Expr and Value when the user inputs symbolic data to the engine (assigning registers, writing memory, creating symbolic buffers, ...), but use mainlyValue when returning information back to the user (reading registers and memory, info field in event callbacks, ...)

Snapshoting needs to save the IR state more accurately

At the moment, taking a snapshot saves the instruction pointer. After restoring a snapshot, execution restarts from the beginning of the instruction pointed by the saved instruction pointer. If we take a snapshot in a callback in the middle of an instruction, this might result in restoring a corrupted IR state.

We should save the InstLocation information instead of just the PC.

Add a licence

Decide on what licence we will use before releasing publicly

Choose licence
Update repo with licence
Update doc with licence

Support Big Endian

The MemEngine currently supports only Little Endian. We should add an endianness setting to memory, and make the code work for both Big and Little endian. Places where changes are needed include (but aren't limited to):

MemEngine
SymbolicMemEngine
VarContext (when creating new concolic/symbolic buffers from raw bytes)

Remove Makefile based build

Once #37 is merged we should remove the old Makefile based build

Recursively get library dependencies

Currently the loader will add direct dependencies for the loaded binary in the emulated filesystem, so that they can later be loaded when executing the loader. However, it doesn't look for dependencies of the dependencies, resulting in missing library files when the loader recursively loads dependencies. An example of this problem can be found in #49.

We should get the dependency library list in a recursive fashion when loading a binary and add them all to the emulated file system.

Prefix preprocessor definitions

Maat currently uses the following defines:

HAS_SOLVER_BACKEND
Z3_BACKEND
HAS_LOADER_BACKEND
LIEF_BACKEND
PYTHON_BINDINGS

They should be prefixed with MAAT_ to allow other projects to compile and link against Maat without name collisions.

Publish Maat on PyPI

We could allow users to build and install Maat from PyPI for more convenience.

trailofbits / maat Goto Github PK

maat's People

Contributors

Stargazers

Watchers

Forkers

maat's Issues

Issue

Implementation details

Discussed in #48

Recommend Projects

Recommend Topics

Recommend Org