fkie-cad / dewolf Goto Github PK

A research decompiler implemented as a Binary Ninja plugin.

License: GNU Lesser General Public License v2.1

Dockerfile 0.03% Makefile 0.12% Python 97.52% C 2.33%

reverse-engineering binaryninja binaryninja-plugin software-analysis decompilation binary-analysis dewolf

dewolf's Introduction

dewolf

dewolf is a research decompiler we developed during a research cooperation from 2019 to 2021 between Germany (Fraunhofer FKIE) and Singapore (DSO National Laboratories).

The restructuring of dewolf is based on the former DREAM/DREAM++ approach [Yakdan et al. NDSS 2015, IEEE (SP) 2016].

The decompiler dewolf is implemented as a plugin for Binary Ninja and uses their Medium-Level intermediate language as the starting point. Although we consider dewolf to be pretty stable, it is still a research prototype and not extensively optimized for production use. Consequently, you will likely observe a few bugs or even decompilation failures when applying dewolf on real-world binaries.

If you encounter any bugs, please report them to us so that we can further improve dewolf. :)

Installation

Dependencies

Before we start, please make sure you have the following dependencies installed and available on your computer:

At least Python 3.10
Latest stable release of Binary Ninja (>=3.4)
astyle for proper indentation of the decompiled code
libgraph-easy-perl only required for printing ASCII graphs

Under Linux (Ubuntu / Debian), you can use the following command to install both astyle and libgraph-easy-perl:

sudo apt install astyle libgraph-easy-perl

Under Windows, please make sure the astyle-binary has been added to the environment Path.

Binary Ninja Plugin

Follow the steps below to setup dewolf as a GUI plugin for Binary Ninja. Afterwards, you will be able to inspect decompiled code from a Binary Ninja dock.

Step 1:

Clone the dewolf repository into the Binary Ninja plugin folder which is located in one of the following paths corresponding to your operating system:

Linux: ~/.binaryninja/plugins MacOS: ~/Library/Application Support/Binary Ninja Windows: %APPDATA%\Binary Ninja

Attention: If you want to use a python virtual environment, make sure it is enabled for the next steps and also when starting Binary Ninja.

Step 2:

Install dewolf's python dependencies with:

pip install -r requirements.txt

Step 3:

Install Binary Ninja python API with:

python <binaryninja_path>/scripts/install_api.py [--install-on-pyenv if using virtualenv]

Warning: Changes made to the dewolf plugin only comes into effect after restarting the Binary Ninja GUI.

Usage

The dewolf decompiler can be used from both the command line and within Binary Ninja.

GUI

After enabling the dewolf decompilation widget via Tools > dewolf decompiler, the decompiled code for the currently active symbol will be displayed. In the dewolf widget, it is possible to navigate through functions by double-clicking them.

The automatic decompilation of selected functions can be toggled with the follow button. Decompiled code is cached and can be generated again with the decompile button, e.g. after patching instructions in the binary view.

CLI

For batch decompilation, it may be more convenient to utilize dewolf as a command line program. If you would like to use dewolf from the command line, you can invoke decompilation of an entire binary with the following command:

python decompile.py <path/to/binary>

If you wish to decompile a specific function, the function name can be provided as the second parameter:

python decompile.py <path/to/binary> <function_name>

By default, the generated code is displayed on the console. If you want to write the output to a file instead, you can specify it with the --output/-o flag. Please use the --help flag for more information.

Configuration

dewolf has multiple configuration options of which some are configurable via the GUI.

via GUI

You can configure dewolf from the Binary Ninja GUI by navigating to Edit > Preferences > Settings or by pressing Ctrl + ,. Search for dewolf in the search bar and all dewolf related settings will be displayed.

Warning: Configurations made through Binary Ninja will not be taken into account when dewolf is started via command line interface. To configure dewolf when started via CLI, do as described in the following section.

via CLI

To apply settings for command line mode or using advanced settings not shown in the GUI, you can provide a config.json file in the decompiler root folder. The format of such a config file has to be as follows:

{
  "section.key": value,
  "expression-propagation.maximum_instruction_complexity": 5
}

All available settings can be found in decompiler/util/default.json.

Support

If you have any suggestions, or bug reports, please create an issue in the Issue Tracker.

In case you have any questions or other problems, feel free to send an email to:

[email protected].

dewolf's People

Contributors

Stargazers

Watchers

Forkers

monkeywave ron1987 bar2023 mm4rks mewbak steffenenders jamestiotio

dewolf's Issues

Rename dewolf folder

Problem Statement

Since the repository is called dewolf and is also a package, we have problems with the imports.

Solution

Rename the dewolf-folder to decompiler.

[Restructuring] Constructing initial switch

Proposal

When constructing the initial switch we face two problems:

We only consider the symbol of the condition and not the actual condition
When two switch nodes belong to the same case constant we randomly choose one.

Consider test18 in sample test_condition.zip
During the restructuring we have the following AST:

Nodes 3, 11, and 12 are possible cases for a switch with variable var_0. However, node 3 and node 12 would belong to case constant 1.
Nevertheless, we do not remove one of them, because the conditions belong to two different symbols.
The first task would be to change this, such that we already check that these conditions are not the same.
When these conditions are considered the same, then we would randomly remove one of them. However, only nodes 11 and 12 can be cases of a switch node due to the reachability of the nodes.
One can see this by having a look at the sibling reachability graph of the sequence node (node 1), the green marked nodes belong to the possible switch cases:

Now, the second task would be to decide which case to choose when two possible cases have the same condition and only one can be picked.*

Approach

The two problems should be handled in _remove_case_candidates_with_same_condition in initial_switch_node_constructer.

For checking for the same condition, there are two options, either directly compute the possible cases or transform them into the "real" conditions and check for equality.

The second task is a bit more difficult. One could compute the sibling reachability without these nodes and check which could be added or the one that is only a code-node and not a condition node if this is the case. But this part requires a few more considerations.

[Array access detection] Consider address-of operation as a valid base candidate

Proposal

Currently, we consider expressions *(base+offset) as an array element access base[offset/type_size] if the base variable has type Pointer and offset satisfies valid offset requirements.

For instance, here:

void func(long * arr, size_t size){
for (int i=0; i<size; ++i)
     *(arr + 8*i) = ...;
...
}

we recognize *(arr+8*i) as arr[i].

However, we do not recognize &STUDENTS[var_2] here:
gradebook.zip

$ python decompile.py gradebook add_student

extern void STUDENTS = 0;
...
        *(var_2 * 8 + &STUDENTS) = var_4;
...

Here address-operation (or reference) (&STUDENTS) is not recognized as valid base (pointer to the first element), as it is not a variable of type Pointer.

Since reference to a variable and a pointer to that variable are semantically equivalent, we should consider address operation as a valid base candidate.

Approach

Update and rename _is_pointer_variable function in ArrayAccessDetection to not only return True, if an expression is a variable of type Pointer but also if it is address UnaryOperation.

Also update getting array type -> in case as above type of unary operation operand can be used.

[CodeGenerator] Use Hungarian notation for unnamed variables

Proposal

We should use the hungarian notation instead of generic viable names
https://en.wikipedia.org/wiki/Hungarian_notation

We could also introduce a config option for this.

Approach

We could implement it as an AST pipeline stage after InstructionLengthHandling (which introduces additional variables)

[Lifter] Lift expression as None

What happened?

After updating the Lifter, we lift instructions as None leading to a ValueError in type propagation:

Traceback (most recent call last):
  File "decompile.py", line 80, in <module>
    main(Decompiler)
  File "/home/eva/Projects/dewolf-decompiler/decompiler/util/commandline.py", line 65, in main
    task = decompiler.decompile(function_name, options)
  File "decompile.py", line 55, in decompile
    pipeline.run(task)
  File "/home/eva/Projects/dewolf-decompiler/decompiler/pipeline/pipeline.py", line 97, in run
    instance.run(task)
  File "/home/eva/Projects/dewolf-decompiler/decompiler/pipeline/dataflowanalysis/type_propagation.py", line 102, in run
    graph = TypeGraph.from_cfg(task.graph)
  File "/home/eva/Projects/dewolf-decompiler/decompiler/pipeline/dataflowanalysis/type_propagation.py", line 38, in from_cfg
    graph.add_instruction(instruction)
  File "/home/eva/Projects/dewolf-decompiler/decompiler/pipeline/dataflowanalysis/type_propagation.py", line 44, in add_instruction
    self.add_expression(top_level_expression, instruction)
  File "/home/eva/Projects/dewolf-decompiler/decompiler/pipeline/dataflowanalysis/type_propagation.py", line 63, in add_expression
    self.add_edge(self._make_node(sub_expression), self._make_node(head), label=self.EdgeType.subexpression)
  File "/home/eva/Projects/dewolf-decompiler/.venv/lib/python3.8/site-packages/networkx/classes/digraph.py", line 622, in add_edge
    raise ValueError("None cannot be a node")
ValueError: None cannot be a node

The problem occurs in this block:

We lift these instructions as follows:

How to reproduce?

test_1.zip main

Affected Binary Ninja Version(s)

3.0.3233 (Build ID e250d0a3)

[Makefile] Adjust to encounter requirements of compiler idioms

What happened?

When running make locally, make has no idea, where user's dewolf-idioms repo lies and therefore fails to install its requirements.

Details:
In the VM, Dockerfile copies dewolf-idioms requirements.txt as dewolf-idioms-requirements.txt to the root folder and make than installs them. Since, when running locally, Dockerfile is not being used, the dewolf-idioms-requirements.txt does not exist.

Suggestion:
I would suggest to adjust make.venv to search for dewolf-idioms in .plugins folder and in project root folder. Or use environment variable with path to dewolf-idioms. Or all three options combined.

How to reproduce?

Clone this repo
Create symlimnk to install_api.py in the project root:

ln -s BINARY_NINJA_INSTALL_FOLDER/scripts/install_api.py install_api.py

run make.

Affected Binary Ninja Version(s)

Version 3.0.3233 (Build ID e250d0a3)

z3 Exception: True, False or Z3 Boolean expression expected. Received 1 of type <class 'int'>

What happened?

Using the old lifter, the decompiler crashes with:

Traceback (most recent call last):
  File "/home/user/0xff/decompiler2/decompile.py", line 78, in <module>
    main(Decompiler)
  File "/home/user/0xff/decompiler2/dewolf/util/commandline.py", line 59, in main
    undecorated_code = decompiler.decompile_all(options)
  File "/home/user/0xff/decompiler2/decompile.py", line 68, in decompile_all
    pipeline.run(task)
  File "/home/user/0xff/decompiler2/dewolf/pipeline/pipeline.py", line 97, in run
    instance.run(task)
  File "/home/user/0xff/decompiler2/dewolf/pipeline/dataflowanalysis/dead_path_elimination.py", line 30, in run
    if not (dead_edges := set(self.find_unsatisfyable_edges(task.graph))):
  File "/home/user/0xff/decompiler2/dewolf/pipeline/dataflowanalysis/dead_path_elimination.py", line 60, in find_unsatisfyable_edges
    if dead_edge := self._get_invalid_branch_edge(graph, branch_block, branch_instruction):
  File "/home/user/0xff/decompiler2/dewolf/pipeline/dataflowanalysis/dead_path_elimination.py", line 66, in _get_invalid_branch_edge
    condition = self._logic_converter.convert(instruction)
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 46, in convert
    return self._convert_branch(expression)
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 71, in _convert_branch
    return self._convert_condition(branch.condition)
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 80, in _convert_condition
    _operation = self._get_operation(condition)
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 94, in _get_operation
    operands = self._ensure_same_sort([self.convert(operand) for operand in operation.operands])
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 94, in <listcomp>
    operands = self._ensure_same_sort([self.convert(operand) for operand in operation.operands])
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 50, in convert
    return self._convert_operation(expression)
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 89, in _convert_operation
    _operation = self._get_operation(operation)
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 101, in _get_operation
    return converter(*operands)
  File "/home/user/0xff/decompiler2/dewolf/structures/pseudo/logic.py", line 156, in <lambda>
    OperationType.multiply: lambda a, b: a * b,
  File "/home/user/.local/lib/python3.9/site-packages/z3/z3.py", line 1435, in __mul__
    if other == 1:
  File "/home/user/.local/lib/python3.9/site-packages/z3/z3.py", line 946, in __eq__
    a, b = _coerce_exprs(self, other)
  File "/home/user/.local/lib/python3.9/site-packages/z3/z3.py", line 1114, in _coerce_exprs
    b = s.cast(b)
  File "/home/user/.local/lib/python3.9/site-packages/z3/z3.py", line 1409, in cast
    _z3_assert(is_expr(val), "True, False or Z3 Boolean expression expected. Received %s of type %s" % (val, type(val)))
  File "/home/user/.local/lib/python3.9/site-packages/z3/z3.py", line 96, in _z3_assert
    raise Z3Exception(msg)
z3.z3types.Z3Exception: True, False or Z3 Boolean expression expected. Received 1 of type <class 'int'>

Using the new lifter, the decompiler crashes with:

Traceback (most recent call last):
  File "/Users/steffenenders/Repositories/Work/dewolf/decompile.py", line 80, in <module>
    main(Decompiler)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/util/commandline.py", line 65, in main
    task = decompiler.decompile(function_name, options)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompile.py", line 54, in decompile
    task = self._frontend.create_task(function, task_options)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/frontend.py", line 57, in create_task
    cfg = self._extract_cfg(function, options)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/frontend.py", line 83, in _extract_cfg
    return parser.parse(function)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/parser.py", line 34, in parse
    index_to_BasicBlock[basic_block.index] = BasicBlock(basic_block.index, instructions=list(self._lift_instructions(basic_block)))
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/parser.py", line 77, in _lift_instructions
    if lifted_instruction := self._lifter.lift(instruction):
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/lifter.py", line 28, in lift
    if pseudo_expression := handler(expression, **kwargs):
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/handlers/calls.py", line 71, in lift_intrinsic
    writes_memory=call.output_dest_memory if ssa else None,
AttributeError: 'MediumLevelILIntrinsicSsa' object has no attribute 'output_dest_memory'

(which appears to be the same lifter error as in #69)

How to reproduce?

Decompile main from the following sample:
742b2356-eda6-44ac-b01c-67f24f892251.bin.zip

Affected Binary Ninja Version(s)

Version 3.0.3233 (Build ID e250d0a3)

[Switch Restructuring] Insert CaseNodes

Proposal

The MissingCaseFinder searches for possible cases that were not found/there when constructing the initial switch. Now, we insert such a potential case node only if the corresponding case constant does not exist. However, there are cases where we still can insert the missing case.

Consider the following decompiled code (test_switch.zip test7_b):

int test7_b() {
    int var_0;
    int * var_1;
    printf("Enter week number(1-7): ");
    var_1 = &var_0;
    __isoc99_scanf(0x804c025, var_1);
    if (var_0 == 0x190) {
        printf("Thursday");  // <---------- missing case
    }
    switch(var_0) {
    case 0x0:
        printf("Monday");
        break;
    case 0x6:
        printf("Saturday");
        break;
    case 0x9:
        printf("Sunday");
        break;
    case 0xc:
        printf("Tuesday");
        break;
    case 0x22:
        printf("Wednesday");
        break;
    case 0x190:
    case 0x1f4:
        printf("Friday");
        break;
    default:
        printf("Invalid input! Please enter week number between 1-7.");
    }
    return 0;
}

The preferred output would be:

int test7_b() {
    int var_0;
    int * var_1;
    printf("Enter week number(1-7): ");
    var_1 = &var_0;
    __isoc99_scanf(0x804c025, var_1);
    switch(var_0) {
    case 0x0:
        printf("Monday");
        break;
    case 0x6:
        printf("Saturday");
        break;
    case 0x9:
        printf("Sunday");
        break;
    case 0xc:
        printf("Tuesday");
        break;
    case 0x22:
        printf("Wednesday");
        break;
    case 0x190:
        printf("Thursday");
    case 0x1f4:
        printf("Friday");
        break;
    default:
        printf("Invalid input! Please enter week number between 1-7.");
    }
    return 0;
}

A similar example is test18 in the same executable, we currently have:

int test18() {
    int var_0;
    int * var_1;
    printf("Enter week number(1-7): ");
    var_1 = &var_0;
    __isoc99_scanf(0x804c025, var_1);
    switch(var_0) {
    case 1:
        printf("Monday");
        var_0 += 0x1f4;
        break;
    case 0x1f4:
        printf("Friday");
        break;
    }
    if ((var_0 != 1) && (var_0 != 12)) {
        printf("Invalid input! Please enter week number between 1-7."); // <---------- do not find default do due missing case
    }
    else {
        printf("Tuesday"); // <---------- missing case
    }
    printf("the number is %d", var_0);
    return 0;
}

But we would like to have the following output:

int test18() {
    int var_0;
    int * var_1;
    printf("Enter week number(1-7): ");
    var_1 = &var_0;
    __isoc99_scanf(0x804c025, var_1);
    switch(var_0) {
    case 1:
        printf("Monday");
        var_0 += 0x1f4;
    case 12:
        printf("Tuesday");
        break;
    case 0x1f4:
        printf("Friday");
        break;
    default:
        printf("Invalid input! Please enter week number between 1-7.");
    }
    printf("the number is %d", var_0);
    return 0;
}

Remark: test18 also has another problem, see Bug #19

Approach

Find a good way to check whether it is possible to insert a case node whose case constant already exists.
It is important to consider the reachability of the nodes.

Eliminate dereferences with nested address operators

Proposal

The decompiler emits statements such as *&x.
We could simplify them by omitting both operations.

sample.zip

The sample in the archive currently emits the following line of code:

*&total_students = 0x0;

Approach

Add a function to simplify dereference operations to the ExpressionSimplification pipeline stage in decompiler/pipeline/controlflowanalysis/expression_simplification.py

[Expression Propagation] remove propagation limit

Proposal

We have decided to propagate expressions so far it is possible. Afterwards, subsequent stages should control resulting expression length. Hence, we don't need propagation limits in the user config.

Approach

Remove all configuration options for expression-propagation, expression-propagation-memory and expression-propagation-function-call from util/default.json.
Remove the rule that controls expression length (_resulting_instruction_is_too_long(self, target, definition){...}) from ExpressionPropagationBase.
Remove calls to _resulting_instruction_is_too_long from ExpressionPropagation, ExpressionPropagationMemory, ExpressionPropagationFunctionCall.

[Expression Propagation Memory] Implement propagation of global variables

Proposal

Currently, we do not propagate (re)definitions of global variables, as there are cases where it could lead to incorrect decompilation. Consider the following example test_global.zip:

00004010  uint32_t glob = 0x4

propagate_globals:
   0 @ 0000118a  [glob].d = 0xa @ mem#0 -> mem#1
   1 @ 00001194  rax#1 = [glob].d @ mem#1 //<---------it is OK to propagate glob = 0xa here, then rax#1 = 0xa
   2 @ 0000119a  rax_1#2 = rax#1 + 0xa 
...

00004010  uint32_t glob = 0x4

do_not_propagate_globals:
   0 @ 00001159  [glob].d = 2 @ mem#0 -> mem#1
   1 @ 00001163  rax#1 = 0
   2 @ 00001168  mem#2 = change_glob() @ mem#1 //<-------------------- potentially changes glob
   3 @ 0000116d  rax_1#2 = [glob].d @ mem#2 //<-----------------------------cannot propagate glob = 2 here, since function call may change value of glob
....

Our current strategy was not to propagate global variables. However, such propagation would lead to cleaner decompiled code - in cases where it is possible (correct) ;)

Approach

Modify EPM (ExpressionPropagationMemory) to handle also global variables.
It can be done in the same fashion as by variables that have pointers to it:
check if between definition and use on any path exist instruction that potentially changes the global variable - for global variables, it would be any (non-library?) function call; or, in case pointer on global variable exist, write-uses of this pointer.
There could be more corner cases here; they should be handled correspondingly.

Meaningful Named Constants

Proposal

Specific constants can be very important during binary/malware analysis.
For example, such constants may be able to give away the crypto routine used.
Since in source code, those numbers have a textual representation, we could also try to identify and define them similarly.
We should consider both Library API constants and file magic numbers.

Approach

No response

[Array Access Detection] Implement extra checks

Proposal

There are a couple a checks that could be made in order to get type of array if we have void * - since we need type of array in non-aggressive mode (default) in order to mark expression as array element access.

E.g. non-agressive: (a+4i) where type of a is int* would be recognized as a[i] // int a[]; whereas in case type of a is void*, we generate something like this: *(a + i * 4)/*a[i]*/ since a can be basically anything.

When the source code contains some index arithmetic, it is possible to infer original type.
Consider the following example:
array_checks.zip

unsigned long func3(void * arg1, int arg2, long arg3, long arg4) {
    int i;
    int var_1;
    for (i = 0; i < (unsigned int)(arg2 / 2 - 1); i++) {
        var_1 = i * 2;
        printf(/* format */ "%l\n", *(arg1 + var_1 * 8)/*arg1[var_1]*/, var_1 * 8, arg4);
        printf(/* format */ "%l\n", *(arg1 + (var_1 + 1L) * 8), (var_1 + 1L) * 8);
        arg4 = var_1 + 1L << 3;
    }
    return (unsigned int)(arg2 / 2 - 1);
}

Here multiplier 8 can hint us, that arg1 points on array of long elements. This holds for expressions a[2*i] and a[2*i+1]. I am not sure if it holds also for other expression types.

Approach

Experiment a bit if this assumption is general enough - experiment with different types of arrays and different index expressions. If the trend remains - we can reliably recognize type for different index expressions - then update ArrayAccessDetection to enrich array type infos.

If such inference is possible only for limited subset of expressions or/and to simultaneously requires complex and time intensive heuristics, I suggest to close this issue.

[Makefile] Create environmental variable for Binary Ninja install folder

Proposal

Currently, to run make locally, we need to create symlink to install_api.py

ln -s BINARY_NINJA_INSTALL_FOLDER/scripts/install_api.py install_api.py

And then copy binaryninja.pth to the .venv site-packages.

make fails in this case when copying binaryninja.pth if Binary Ninja installation folder in not /opt/binaryninja

Approach

Check /opt/binaryninja/ and if not exist, check corresponding environment variable. Use this info both for install_api.py and binaryninja.pth paths derivation, so that the user needs does not need to create the symlink and to copy the binaryninja.pth file.

[C2AST] Generate AST from C code

Proposal

Writing AST tests can be quite verbose, therefore I propose to implement a converter which translates c code to dewolfs AST implementation.

Approach

Use pycparser to generate an intermediate AST representation
Translate each AST node to dewolfs AST representation

[Redundant Cast Elimination] The non-redundant cast is being removed

What happened?

python decompile.py div_by_2 div_by_2_dec

Expected output:

long div_by_2_dec(int arg1) {
    return arg1 + ((long)arg1 >> 32U) >> 1;
}

Actual output:

long div_by_2_dec(int arg1) {
    return arg1 + (arg1 >> 32U) >> 1;
}

Cast to long is erroneously removed during redundant cast elimination, so the actual function when recompile does not divide by 2 (floor division), so that when recompiled for input -5 produces output other than -3.

How to reproduce?

Original source code:

#include <stdint.h>
#include <stdio.h>

int div_by_2(int i) {
  return i / 2;
}

unsigned long div_by_2_dec(int i) {
  return i + (i >> 31) >> 1;
}

int main() {
  int i;
  scanf("%d", &i);
  printf("%d\n", div_by_2(i));
  printf("%d\n", div_by_2_dec(i));
}

Binary:
div_by_2.zip
For -5 as input , should output -2 and -3.

Affected Binary Ninja Version(s)

Version 2.4.2846 (Build ID df7c027e)

[Widget / GUI] Add timeout for GUI

Proposal

Currently a long running decompilation can not be terminated by the user.
We should probably add a timeout or interrupt to decompilation in widget.Worker(QRunnable).
With regards to difficulties met previous issues regarding timeouts, this widget timeout could maybe be soft by setting an exit flag or exhausting a timing budget.

Approach

No response

Implement support for tailcalls

What happened?

Right now, we handle Tailcalls similar to normal call instructions.

long rpl_fseeko(FILE * arg1, off_t arg2, int arg3) {
    off64_t var_0;
    long var_1;
    if ((*((char *)arg1 + 0x10) == *((char *)arg1 + 0x8)) && (*((char *)arg1 + 0x48) == 0L) && (*((char *)arg1 + 0x28) == *((char *)arg1 + 0x20))) {
        var_0 = lseek(/* fd */ fileno(/* fp */ arg1), /* __arg2 */ arg2, /* whence */ arg3);
        if (var_0 == -1L) {
            var_1 = 0xffffffff;
        }
        else {
            *arg1 = *arg1 & -17;
            *(arg1 + 144L) = var_0;
            var_1 = 0L;
        }
        return var_1;
    }
    fseeko(/* fp */ arg1, /* offset */ arg2, /* whence */ arg3);
}

The Last line should actually be a tailcall

return fseeko(/* fp */ arg1, /* offset */ arg2, /* whence */ arg3);

How to reproduce?

decompile.py sha224sum rpl_fseeko sha224sum.zip

Affected Binary Ninja Version(s)

3.0

Common Subexpression Elimination has side effects on Restructuring

What happened?

The restructuring phase expects all expressions to be fully propagated.
However, CommonSubexpressionElimination runs first (due to it being implemented on the ssa cfg).
This results in undesired side effects.

We could either try to implement CSE on the AST after the resturcturing process, or make the restructuring process follow definitions of values.

How to reproduce?

Bug can be observed in the sample test_goto.zip in the function test1_b

Before CSE is run, the CFG looks like this:

However, during CSE we add the definition c#0 = var_10#1 + 0x1between the phi-function and the if-condition.
This results in problems during restructuring:

After out of SSA, this line has been converted to var_1 = var_0 + 1
As a result, we currently can not restructure the loop as. The resulting code looks like this:

int test1_b() {
    int var_0;
    int var_1;
    var_0 = 10;
    while (true) {
        var_1 = var_0 + 1;
        if (var_0 <= 15) {
            printf("value is: %d\n", var_0);
            var_0 = var_1;
            continue;
        }
        printf("value of a: %d\n", var_0);
        if (var_1 > 19) {
            break;
        }
        var_0 = var_1;
    }
    printf("value of a final: %d\n", var_1);
    return 0;
}

Affected Binary Ninja Version(s)

2.4

AttributeError: 'MediumLevelILIntrinsicSsa' object has no attribute 'output_dest_memory'

What happened?

The decompiler crashes with the following stack trace:

Traceback (most recent call last):
  File "/Users/steffenenders/Repositories/Work/dewolf/decompile.py", line 80, in <module>
    main(Decompiler)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/util/commandline.py", line 65, in main
    task = decompiler.decompile(function_name, options)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompile.py", line 54, in decompile
    task = self._frontend.create_task(function, task_options)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/frontend.py", line 57, in create_task
    cfg = self._extract_cfg(function, options)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/frontend.py", line 83, in _extract_cfg
    return parser.parse(function)
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/parser.py", line 34, in parse
    index_to_BasicBlock[basic_block.index] = BasicBlock(basic_block.index, instructions=list(self._lift_instructions(basic_block)))
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/parser.py", line 77, in _lift_instructions
    if lifted_instruction := self._lifter.lift(instruction):
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/lifter.py", line 28, in lift
    if pseudo_expression := handler(expression, **kwargs):
  File "/Users/steffenenders/Repositories/Work/dewolf/decompiler/frontend/binaryninja/handlers/calls.py", line 71, in lift_intrinsic
    writes_memory=call.output_dest_memory if ssa else None,
AttributeError: 'MediumLevelILIntrinsicSsa' object has no attribute 'output_dest_memory'

How to reproduce?

Decompile the function main of the following sample:
65c3b302-a243-470e-b4a2-b74b5cfda206.bin.zip

Affected Binary Ninja Version(s)

Version 3.0.3233 (Build ID e250d0a3)

[Restructuring] Resolve unresolved reaching conditions

Proposal

During the restructuring, when we restructure a region, each node gets a reaching condition. Using these reaching conditions we try to find if-else and switch constructs. Nevertheless, not all reaching conditions are resolved by these algorithms, i.e., at the end of the restructuring, there may be still some nodes whose reaching condition is not true. Thus, these reaching conditions have to be transformed into if-constructs.

Currently, we simply add a condition node before a node with a reaching condition that has as condition the reaching condition of the node. But sometimes, this does not lead to the best possible output.

Here are some examples where another approach would be better:

If we have two nodes with reaching conditions (!x1 & x2) and (!x1 & !x2), then we first restructure the if-else with the condition x2, resulting in the following AST (nodes 6, 7, 8):

After resolving the unresolved reaching conditions it results in the following AST:

A better solution would be to add the condition before node 6:

This example belongs to test15 in test_condition.zip

If we have already a condition node with the condition x1 and we have a node with reaching condition x1 & cond that we want to resolve, we do not add this node to the existing condition node (if it is possible due to reachability), where cond is an arbitrary condition that could also be true.

Consider, for example, the following AST, where we have a condition node with the condition x1 and a node with reaching condition !x1 & !x2.

When resolving the reaching condition for code node 6, we get the following AST:

However, we could also add this node to the false-branch of the condition node, i.e., we would like to have the following AST:

This example belongs to test16 in test_condition.zip

Approach

No response

add to plugin manager?

Is there interest in having this plugin be added to the plugin manager? It definitely can help folks find the plugin more easily and also simplifies installation. The latest builds even automatically install the pip dependencies based on the existing requirements.txt.

Add support for BN 3.0 API (WidgetPane)

Proposal

Change current DockWidget to WidgetPane

Approach

Take a look at https://github.com/Vector35/binaryninja-api/blob/dev/python/examples/hellopane.py

[Lifter] Addresses representation in the decompiled code

Proposal

Currently we represent addresses within the binary as hexadecimal:
extractme.zip main

...
var_3 = i + 0x804c020;
...

For instance, Binary Ninja prepends in such cases data_ prefix for data.
It would be nice to somehow integrate section information into decompiled code.

Current suggestions were:
data_0x804c020
0x804c020 \* data *\
0x804c020 \* data + 12 *\ - with offset from the section start.
???

Approach

Collect couple of ideas, note them, chose the most optimal 😈 and implement.

[Switch] Add configuration options

Proposal

Not everybody likes switch-constructs, or switch in switch-constructs or only switches with a certain number of cases. Thus, it would be beneficial if it is configurable.

Remarks:
Since we start with restructuring smaller regions, we do not know whether a switch-construct may be a case node of another switch-construct when we restructure it.
Additionally, we sometimes have to construct an initial switch with two cases, to get a switch with more cases, depending on the control flow graph.

Approach

Add decompiler options for

reconstruct switch at all (yes/no , default: yes)
minimum number of cases to reconstruct as a switch node (any integer > 2, default: 2)
nested switches (yes/no, default: yes)
Optional: additionally case limit for nested switches, perhaps also an upper bound here?

The ConditionAwareRefinement, that reconstructs switch nodes, should be untouched. For the overall option switch yes or no, simply modify the function construct_refined_ast in acyclic_restructuring.py.
For all other options, this should be done after the restructuring. Either in a separate stage that only runs if these options are set or it could be done in the code generator. Simply rewrite the switch as if-else constructs if they do not fulfill any of the configuration options.

Loop Config Options

Proposal

Add configuration options for (for-loops). Right now, I can think of the following options that we probably should add to dewolf:

Loop Variable Names (specify which kind of variable names should be used for for-loops)
Conditions for for-loop recovery:
- Maximum complexity for a condition to be allowed in a for-loop header.
- Maximum complexity for a modification statement to be allowed in a for-loop header.
- Some option like: force for-loops which always transforms a loop to a for-loop, if possible.

Approach

No response

[Dead Path Elimination] Crash when transforming conditions into z3

What happened?

Currently, function main in the sample test.zip leads to a crash because we convert both operands of an ULE operation into bools. This is not possible in z3.
More precisely, we want to transform the condition (rax_3#9 == -0x1) u< 0x1 into a z3 condition. Since (rax_3#9 == -0x1) is a BoolRef function _ensure_same_sort also transforms 0x1 into a BoolRef resulting in the two operands (rax_3#9 == -0x1) and 1 != 1.
Now, the problem is that z3 requires one operand of an ULE to be a bit vector.

Possible Solution:
Update the function _ensure_same_sort as follows:

    def _ensure_same_sort(self, operands: List[ExprRef]) -> List[ExprRef]:
        """
        Complicated function ensuring the given operand list has a common sort.

        Converts bv and bool into each other and tries to even out size differences for bit vectors.
        """
        if any(is_bv(op) for op in operands): # <-------- changed line
            operands = [self._ensure_bitvec_sort(operand) for operand in operands]
            operands = list(self._ensure_bv_size(operands))
        elif any(is_bool(op) for op in operands): # <-------- changed line
            operands = [self._ensure_bool_sort(operand) for operand in operands]
        return operands

How to reproduce?

function main in the sample test.zip
function main in the sample test_1.zip

Affected Binary Ninja Version(s)

2.4.2846 (Build ID df7c027e)

Integrate dewolf-logic

Problem Statement

We implemented our own logic engine (https://github.com/fkie-cad/dewolf-logic), which we want to integrate into dewolf.

Solution

Implement the logic interface and make sure that all needed simplifications are implemented in dewolf-logic.

[GUI] Make addresses in code clickable

Proposal

Is it possible to make addresses in code clickable, i.e. if we have address in the decompiled code that points to some location within the binary, we could click it and it navigates to the corresponding location in the main view of the Binary Ninja?

E.g.
extractme.zip

int main(int argc, char ** argv, char ** envp) {
    int var_0;
    char var_1;
    void * i;
    void * var_3;
    var_0, var_1 = mmap(0, 71, 7, 34, -1, 0, 0);
    if (var_0 == 0) {
        var_0 = 1;
    }
    else {
        for (i = 0U; i < 71; i++) {
            var_3 = i + 0x804c020;
            *var_3 = *var_3 ^ 19;
        }
        memcpy(var_0, 0x804c020, 71);
        var_0();
        var_0 = 0;
    }
    return var_0;
}

So when we click on 0x804c020, we land on the corresponding address in the data section of the binary (marked gray):

Approach

Implement if possible :D

[Restructuring] Transform Guarded do-while loops

Proposal

Depending on the optimization level, easy while loops are sometimes transformed into an if-condition that has only a true branch consisting of a do-while loop with the same condition, i.e., while(cond){...} is transformed into if(cond){do{...}while(cond)}.

An example of this problem is function main in a.zip.
The decompiled code is

int main(int arg1, char ** arg2, char ** arg3) {
    int var_0;
    int var_1;
    unsigned long var_2;
    int * var_3;
    var_3 = &var_0;
    __isoc99_scanf(/* format */ "%d", var_3);
    var_3 = &var_1;
    __isoc99_scanf(/* format */ "%d", var_3);
    printf(/* format */ "b = %d\n", (unsigned int)var_1);
    var_2 = (unsigned int)var_0;
    if ((unsigned int)var_0 < var_1) { // <---------------- guarded condition
        do {
            printf(/* format */ "a = %d\n", (unsigned int)var_0, var_2);
            var_0++;
            var_2 = (unsigned int)var_0;
        }
        while ((unsigned int)var_0 < var_1); // <---------------- same condition as guarded if-condition
    }
    puts(/* str */ "done");
    return 0;
}

Approach

The task could be done by the loop structure. It has to be done before the for-loop restructuring such that we can transform these loops to for-loops if we needed.

Return value is a variable instead of 0

Proposal

For this sample, the decompiler currently produces the following output:

int main(int argc, char ** argv, char ** envp) {
    int var_0;
    char var_1;
    void * i;
    void * var_3;
    var_0, var_1 = mmap(/* addr */ 0U, /* len */ 71U, /* prot */ 7, /* flags */ 34, /* fd */ -1, /* offset */ 0);
    if (var_0 == 0) {
        var_0 = 1;
    }
    else {
        for (i = 0x0; i < 71; i++) {
            var_3 = i + 0x804c020;
            *var_3 = *var_3 ^ 19;
        }
        memcpy(var_0, 0x804c020, 71U);
        var_0();
        var_0 = 0;
    }
    return var_0;
}

However, the original source code simply contains return 0.
Also, the assembly looks like the following:

08049235  c745fc00000000     mov     dword [ebp-0x4 {var_8_1}], 0x0  ; here

0804923c  8b45fc             mov     eax, dword [ebp-0x4 {var_8_1}]  ; here
0804923f  83c428             add     esp, 0x28
08049242  5d                 pop     ebp {__saved_ebp}
08049243  c3                 retn     {__return_addr}

Maybe we can try to compact the code mov dword [ebp-0x4 {var_8_1}], 0x0 and mov eax, dword [ebp-0x4 {var_8_1}] to return 0
However, return varX is already present in Binary Ninjas MLIL.
So I don't know if we can or want to fix this.

Approach

No response

[Code Generator] Simplify Expressions

Proposal

During the restructuring, we simplify conditions using their old SSA name. Furthermore, we never simplify any expression that is not part of a condition.

More precisely, we would like to be able to simplify expression as

a+a to 2a or 1+ a + 3 to a + 4
arg2 = arg1 + -1L to arg2 = arg1 - 1L
a + 1 < 5 to a < 4 or a < 5 & a < 3 to a < 3

Approach

In the code generator, translate every pseudo-condition into a logic condition (either z3 or custom logic) and simplify it.
Do the same with the logic conditions, which are saved using symbols and a condition map.

The benefit of doing it in the code generator is that we do not have to translate it back to our pseudo-language.

Attention:
When simplifying unsigned comparison, like ULE(BitVec('x', 32), BitVecVal(5, 32)), with z3, the result is And(Extract(31, 3, x) == 0, ULE(Extract(2, 0, x), 5)) which is less readable. So one has to be careful how to simplify.

[Array access detection] Refactor function that collects candidates.

Proposal

Pure free-time refactoring issue, originated from some merge request review. One could make the function that collects candidates prettier.

Approach

Something like this could potentially work:

if candidate := self.get_array_access_candidate(candidate):
    self._candidates[base].append(candidate)
    self._update_candidate_offsets(base, offset_class, element_size)

def get_array_access_candidate(candidate) -> Optional[Candidate]:
  operand = candidate.operand
  if not self._is_addition(operand):
    return None
  base, offset = self._get_base_and_offset(operand)
  if not (base and offset):
    return None
  offset_class, index, element_size = self._parse_offset(offset)
  return Candidate(candidate, index, base)

Should be tested, if it works like this.

[Restructuring] Side effects

What happened?

When we restructure a CFG into an AST, we transform each condition into a symbol. Now, each node of a region gets its reaching condition assigned, thus a reaching condition can occur multiple times.

Consider the following example:

The putout is incorrect because the value of a changes and afterward, we have a check with the old value.

An correct output would, for example be:

cond_b1 = (a == 0)
if(cond_b1){
    a = 1;
}
if(! cond_b1 && b < 10){
    b = b*a;
}else{
    b = b-a;
}
return b;

How to reproduce?

A binary where this problem also occurs is test_switch.zip test18.
The C-code is:

int test18()
{
    int week;
    //Non sequential case constants
    
    /* Input week number from user */
    printf("Enter week number(1-7): ");
    scanf("%d", &week);
    
    switch(week)
    {
        case 1: 
            printf("Monday");
            week +=500 ;
        case 12: 
            printf("Tuesday");
            break;
        case 500: 
            printf("Friday");
            // break;
        default: 
            printf("Invalid input! Please enter week number between 1-7.");
    }
    printf("the number is %d", week);
    return 0;

}

and the output is:

int test18() {
    int var_0;
    int * var_1;
    printf("Enter week number(1-7): ");
    var_1 = &var_0;
    __isoc99_scanf(0x804c025, var_1);
    switch(var_0) {
    case 1:
        printf("Monday");
        var_0 += 0x1f4; // <-------- var_0 is changed here, but the old value should be used for the comparison in the if-statement
        break;
    case 0x1f4:
        printf("Friday");
        break;
    }
    if ((var_0 != 1) && (var_0 != 12)) { // <----------- If var_0 was 1 when reaching the switch, it is now != 1
        printf("Invalid input! Please enter week number between 1-7.");
    }
    else {
        printf("Tuesday");
    }
    printf("the number is %d", var_0);
    return 0;
}

When entering the number 1, the function prints "Monday", "Tuesday", and "the number is 501".
But the decompiled function, with input 1 prints "Monday", "Invalid input! Please enter week number between 1-7.", and "the number is 501".

Affected Binary Ninja Version(s)

Version 2.4.2846

[Lifter] Update to class based MLIL structure

Proposal

The 2.5-dev release of binaryninja introduced a class-based structure for binaryninja based on frozen dataclasses.
We should update the lifter accordingly.

Approach

Implement handling functions for the various classes introduced for the Medium Level Intermediate Language.

[Codegenerator] Invalid Condition

What happened?

Currently, when decompiling function main of fmt.zip the decompilation process crashes due to an invalid condition.

More precisely, the reaching condition of an AST node is evaluated as false because the type of a constant is wrong. (The problem with type propagation is tracked in Issue #42.)
Nevertheless, the program should not crash. The program occurs because the function _condition_string does not assume that a condition is true or false.

Proposal
There are multiple options for this problem.

In general, if a condition is false, then we can simply remove the true branch and execute always the false branch, i.e., we can substitute the condition node with the false branch. Symmetrically, one would replace each condition node with condition true by its true branch. But, in cases like this where the condition should not be false, we may lose content we do not want to lose.
We simply have a condition that is false or true and add this scenario to the function _condition_string.

In both cases, a warning could be nice.

How to reproduce?

Try to decompile main in fmt.zip

Affected Binary Ninja Version(s)

2.4.2846

[For-Loop] Properties of Condition

Proposal

Currently, we transform a while-loop into a for-loop independent of the condition. However, for-loops where the condition is a comparison like a == 10 or a != 10 are rather rare (except for pointers).

In the following example (test1_c in test_goto.zip), we have the condition var_0 == 15 in a for-loop with is rather unusual.

int test1_c() {
    int var_0;
    var_0 = 10;
    do {
        for (var_0; var_0 == 15; var_0 += 2) {
            printf("We go to the loop head.", var_0);
        }
        printf("value of a: %d\n", var_0);
        var_0++;
    }
    while (var_0 <= 19);
    return 0;
}

Here, a while loop would look more natural, i.e.,

int test1_c() {
    int var_0;
    var_0 = 10;
    do {
        while (var_0 == 15) {
            printf("We go to the loop head.", var_0);
            var_0 += 2;
        }
        printf("value of a: %d\n", var_0);
        var_0++;
    }
    while (var_0 <= 19);
    return 0;
}

Approach

Add a configuration option for the condition of for-loops.

which comparisons are allowed
how long/complex can the condition be

[Decoration] Update that it is possible to print aAbstract Syntax Forest

Proposal

Currently, we have ways to print the abstract syntax tree. However, when debugging the restructuring, it is helpful to be able to also print the abstract syntax forest.

Approach

Update the decoration, such one can also print syntaxforests. This can be easily done by using
condition_handler.get_condition_map() instead of the condition map.

[Restructuring] Insert missing cases for switch

What happened?

When inserting missing cases to a switch node and two possible cases would belong to the same case, then we insert two cases with the same constant. This is not valid.

More detailed, when considering test17 in the sample test_condition.zip, we have the following AST during the restructuring:

Here, node 4 and node 20 are possible case candidates for the switch node. But, we can not insert both because they both would be a case corresponding to constant 1. However, we insert both. After inserting missing cases we have the following AST:

The expectation would be that we only add one of the two cases to the switch or none. In this case, node 20 would be the preferable one, because here we have no additional condition.

Proposal
To solve this problem one has to update _add_new_case_nodes_to_switch_node in missing_case_finder.py. First, one has to add the case constant that we add during this function to the set of existing functions. Second, if two possible cases belong to the same constant one has to find a way to choose the "better" one.

How to reproduce?

test_switch.zip test17

Affected Binary Ninja Version(s)

Version 2.4.2846 (Build ID df7c027e)

[Restructuring] Copy return statements to reduce region size

Proposal

Sometimes, the decompiler does not terminate because the regions we restructure are too large. Furthermore, large regions often have more complicated structures.
One way to reduce the region size and to improve the readability is to copy return statements. (This is also done by IDA and Ghidra.)

Consider function lgetline in exercise2.zip

Before the restructuring the cfg looks at follows:

First, the restructuring restructures the loop region consisting of the nodes 1, 2, 4, 8.
During the loop-successor refinement, we add the nodes 3, 5, 6, 9, 7, 11, 10 to the loop region.
Thus, the loop region consists of all nodes except node 0.

Currently, this leads to the following output:

int lgetline(void * arg1, int arg2) {
    int var_1;
    int var_2;
    int var_3;
    void * var_0;
    var_2 = 0;
    while (true) {
        var_0 = arg1 + var_2;
        var_3 = var_2 + 1;
        if (var_2 < arg2 - 1) {
            var_1 = getchar();
            if (var_1 == 10) {
                *var_0 = (void *)var_1;
                *(arg1 + var_3)/*arg1[var_3]*/ = 0x0;
                arg2 = var_3;
            }
            else {
                if (var_1 != -1) {
                    *var_0 = (void *)var_1;
                    var_2 = var_3;
                    continue;
                }
                *var_0 = 0x0;
                arg2 = var_2;
            }
        }
        else {
            if (var_1 == 10) {
                *var_0 = (char *)var_1;
                var_2 = var_3;
            }
            *(arg1 + var_2)/*arg1[var_2]*/ = 0x0;
            arg2 = var_2;
        }
        return arg2;
    }
}

Other decompilers like Ghidra produce the following output by coping the return statement:

int lgetline(char *s,int lim)

{
  int state;
  int i;
  int c;
  
  i = 0;
  while( true ) {
    if (lim + -1 <= i) {
      if (c == 10) {
        s[i] = '\n';
        i = i + 1;
      }
      s[i] = '\0';
      return i;
    }
    c = getchar();
    if (c == 10) break;
    if (c == -1) {
      s[i] = '\0';
      return i;
    }
    s[i] = (char)c;
    i = i + 1;
  }
  s[i] = (char)c;
  s[i + 1] = '\0';
  return i + 1;
}

Approach

Try to implement a better version of finding loop successors and copy return statements if this helps to add fewer nodes to the loop region.

In the above example, even if we add the return statement to the basic blocks 5 and 9, we have to find a way to figure out where we start with the loop refinement. If we first consider the successor 3, then we still have the whole region.

The optimal outcome for the above example would be to move the return statement in basic block 10 to the basic blocks 9 and 5, remove the edges to basic block 10, and have basic block 3 as a unique successor. This would lead to a much nicer result.

Meaningful Variable Names

Proposal

We could leverage API calls to assign meaningful names to variables because API functions have known signatures including the types and names of parameters. If a variable is used in an API call, we could give it the name of the corresponding parameter if that name is not already used.
If Binary Ninja provides us the parameter names for a given API call, we can maybe use these names to rename the variable.

Approach

No response

Exception UnknownExpression

What happened?

Decompile main:

Traceback (most recent call last):
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/util/widget.py", line 51, in run
    code = self.decompile_for_widget(self.binary_view, self.function)
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/util/widget.py", line 67, in decompile_for_widget
    task = decompiler.decompile(function, options)
  File "/home/user/.binaryninja/plugins/dewolf/decompile.py", line 55, in decompile
    pipeline.run(task)
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/pipeline/pipeline.py", line 97, in run
    instance.run(task)
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/pipeline/commons/expressionpropagationcommons.py", line 47, in run
    while self.perform(task.graph, iteration):
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/pipeline/commons/expressionpropagationcommons.py", line 62, in perform
    self._initialize_maps(graph)
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/pipeline/commons/expressionpropagationcommons.py", line 107, in _initialize_maps
    self._blocks_map[str(instruction)].add((basic_block, index))
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/structures/pseudo/instructions.py", line 170, in __str__
    return f"{self.destination} = {self.value}"
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/structures/pseudo/operations.py", line 429, in __str__
    operands = ", ".join(map(str, self._operands))
  File "/home/user/.binaryninja/plugins/dewolf/decompiler/structures/pseudo/expressions.py", line 190, in __str__
    if self._type.is_boolean:
AttributeError: 'UnknownExpression' object has no attribute 'is_boolean'

How to reproduce?

Use the provided file in forkingserver.zip and decompile main. Tested with

main commit d39078bd2ad51a0eabb80f25c62bd1b93f55f644
57-global-variable commit bbb9290171a8aeb653e44b7f5ee9fd50b24c93e5

Affected Binary Ninja Version(s)

Version 3.0.3233 Personal (Build ID e250d0a3)

[Lifter] MLIL_MUL lifted without expression type

What happened?

When lifting MLIL_MUL expressions, no expr_type it set to account for the type of the result.

How to reproduce?

#include <stdio.h>

int main(){
  long b;
  int c;
  scanf("%lld %d", &b, &c);
  b *= c;
  printf("multiplied: %lld\n", b);
  return 0;
}

Result:

int main(int arg1, char ** arg2, char ** arg3) {
    void var_1;
    int var_0;
    int * var_3;
    long * var_2;
    var_3 = &var_0;
    var_2 = &var_1;
    __isoc99_scanf(/* format */ "%lld %d", var_2, var_3);
    printf(/* format */ "multiplied: %lld\n", var_1 * var_0, var_0);
    return 0;
}

The type of var_1 should be long.

Affected Binary Ninja Version(s)

2.4

[Types] Implement type confidence

Proposal

Implement a confidence value for types similar to the implementations in IDA, binaryninja and Ghidra.
This would allow to resolve conflicts of contradictory type information during type propagation and in the coherence stage.

Approach

Implement a confidence field
Lift confidence values from binaryninja
Utilize confidence in the Coherence stage
Utilize confidence in the TypePropagation stage

[Lifter] Integrate new global variable handling

Proposal

The new lifter (#18) does not reflect the most recent changes to the handling of global variables tested in tests/test_sample_binaries.py.

At the moment, we skipped three tests in this file:

test_global_strings_and_tables
test_global_indirect_ptrs
test_global_import_address_symbol

Approach

No response

Revive TypePropagation

Proposal

As of now, the TypePropagation stage hardly does anything. For example, we do not propagate the type of a pointer to the type it points on.

We should fix this and make this pipelinestage more functional.

Approach

Code from previous approaches:

"""Module implementing horizontal type propagation as a pipeline stage."""
from __future__ import annotations

from collections import Counter
from typing import Iterator, List, Tuple, Optional

from dewolf.pipeline.stage import PipelineStage
from dewolf.structures.graphs.nxgraph import NetworkXGraph
from dewolf.structures.graphs.basic import BasicEdge, BasicNode
from dewolf.structures.graphs.cfg import ControlFlowGraph
from dewolf.structures.pseudo.expressions import Expression, DataflowObject
from dewolf.structures.pseudo.operations import Condition, UnaryOperation, OperationType
from dewolf.structures.pseudo.instructions import Instruction
from dewolf.structures.pseudo.typing import CustomType, Float, Integer, Pointer, Type, UnknownType
from dewolf.task import DecompilerTask


class TypeNode(BasicNode):
    def __init__(self, references: List[Expression]):
        super().__init__(id(self))
        self.references = references

    def __contains__(self, item: Expression) -> bool:
        return item in self.references

    def __len__(self) -> int:
        return len(self.references)


class TypeRelation(BasicEdge):
    def __init__(self, source: TypeNode, sink: TypeNode, level: int):
        super().__init__(source, sink)
        self._level = level

    @property
    def level(self) -> int:
        return self._level


class SubExpressionGraph(NetworkXGraph):
    """Graph class modeling type-relations between expressions."""

    @classmethod
    def from_cfg(cls, cfg: ControlFlowGraph) -> SubExpressionGraph:
        """Generate a TypeGraph by parsing the given ControlFlowGraph."""
        graph = cls()
        for instruction in cfg.instructions:
            graph.add_instruction(instruction)
        return graph

    def add_instruction(self, instruction: Instruction):
        worklist: List[Tuple[DataflowObject, DataflowObject]] = [(instruction, subexpression) for subexpression in instruction]
        while worklist:
            parent, expression = worklist.pop()
            self.add_edge(BasicEdge(parent, expression))
            for child in expression:
                worklist.append((expression, child))


class TypeGraph(NetworkXGraph):

    REFERENCE_TYPE = {
        OperationType.dereference: -1,
        OperationType.address: 1
    }

    def propagate_type(self):
        type = self.get_type()
        node_data = list(self.iterate_node_levels())
        level_types = {level: self._apply_reference_level(type, level) for node, level in node_data}
        for node, level in node_data:
            print(f'type: {type}, level: {level} - {", ".join((str(x) for x in node.references))}')
            for expression in node.references:
                expression._type = level_types[level]

    def get_type(self) -> Type:
        types = list(self._collect_types())
        return TypePropagation.find_common_type(types)

    def get_node_containing(self, expression: Expression) -> Optional[TypeNode]:
        for node in self:
            if expression in node:
                return node

    def iterate_node_levels(self):
        assert len(self.get_roots()) == 1
        todo = [(0, self.get_roots()[0])]
        while todo:
            level, node = todo.pop()
            yield node, level
            for edge in self.get_out_edges(node):
                todo.append((level + edge.reference_offset, edge.sink))

    def _collect_types(self):
        for node, level in self.iterate_node_levels():
            for reference in node.references:
                yield self._apply_reference_level(reference.type, -level)

    def _apply_reference_level(self, type: Type, level: int) -> Type:
        if level == 0:
            return type
        deref = type.copy()
        if level < 0:
            for i in range(-level):
                assert isinstance(deref, Pointer), "Can only dereference pointer types"
                deref = deref.type
        else:
            for i in range(level):
                deref = Pointer(deref.type)
        return deref

    @classmethod
    def from_expression_graph(cls, graph: SubExpressionGraph):
        reference_edges = list(cls.find_reference_edges(graph))
        graph.remove_edges_from(reference_edges)
        typegraph = cls()
        for i, component in enumerate(graph.iter_components()):
            typegraph.add_node(TypeNode((node for node in component if not isinstance(node, Instruction))))
        for edge in reference_edges:
            source = typegraph.get_node_containing(edge.source)
            sink = typegraph.get_node_containing(edge.sink)
            reference_level = cls.REFERENCE_TYPE[edge.source.operation]
            inverted_edge = typegraph.get_edge(sink, source)
            if not inverted_edge or inverted_edge.reference < reference_level:
                typegraph.add_edge(TypeRelation(source, sink, reference_level))
                if inverted_edge:
                    typegraph.remove_edge(inverted_edge)
        return typegraph

    @staticmethod
    def find_reference_edges(graph: SubExpressionGraph) -> Iterator[BasicEdge]:
        return filter(
            lambda edge: isinstance(edge.source, UnaryOperation) and edge.source in [OperationType.dereference, OperationType.address],
            graph.edges
        )


class TypePropagation(PipelineStage):
    """Implements type propagation based on a set of heuristics."""

    name = "type-propagation"

    def run(self, task: DecompilerTask):
        """
        Run type propagation on the given task object.

        We assume there are two types of propagation: Assignment (horizontal) and Operation (vertical).
        Operation-Level propagation is directly implemented into pseudo.Operands through a recursive lookup.
        """
        subexpression_graph = SubExpressionGraph.from_cfg(task.graph)
        for type_graph in self.split_graph(subexpression_graph):
            print(f'TypeGraph: {len(type_graph)}')
            type_graph.propagate_type(task.graph)

    def split_graph(self, graph: SubExpressionGraph) -> Iterator[TypeGraph]:
        self.cut_non_type_edges(graph)
        for component in list(graph.iter_components()):
            print(", ".join((str(x) for x in component)))
            subgraph = graph.subgraph(component)
            yield TypeGraph.from_expression_graph(subgraph)

    def cut_non_type_edges(self, graph: SubExpressionGraph):
        for edge in list(graph.edges):
            if self.blocks_type_propagation(edge.source):
                graph.remove_edge(edge)

    @staticmethod
    def blocks_type_propagation(expression: Expression) -> bool:
        return isinstance(expression, Condition) or (
                isinstance(expression, UnaryOperation) and expression.operation == OperationType.cast)

    @staticmethod
    def find_common_type(types) -> Type:
        histogram = Counter(types)
        most_common_types = sorted(histogram.keys(), reverse=True, key=lambda x: (histogram[x], str(x)))
        if UnknownType() in most_common_types:
            most_common_types.remove(UnknownType())
        for filtered_type in filter(TypePropagation.is_non_primitive_type, most_common_types):
            return filtered_type
        if most_common_types:
            return most_common_types[0]
        return UnknownType()

    @staticmethod
    def is_non_primitive_type(type: Type) -> bool:
        """Check if the given type is primitive, so ew can ignore it."""
        if isinstance(type, Integer) and not isinstance(type, Float):
            return False
        if isinstance(type, Pointer) and type.type == CustomType.void():
            return False
        return True

[Types] Create function pointer type

Proposal

Add support to function pointers.

If a variable is called at some point, the type of it could be set to a function pointer.

Consider the following example:

$ python decompile.py extractme main

...
      int var_0;
...
      memcpy(var_0, 0x804c020, 71);
      var_0();
      var_0 = 0;
...

here, since the var_0 is being called after initialization, we can set its type to function pointer instead of integer.
extractme.zip

Approach

add new type FunctionPointer with the following fields
- name
- return type
- parameter / argument types
update TypePropagation to consider variables that are being called.
update CodeGenerator to generate declarations for the corresponding function pointers if needed.

[Expression Propagation] Propagate address operation into its usages if possible.

Proposal

Currently we avoid propagating of address of a variable into its usages.

Consider the following example:
whatami.zip

$ python decompile.py whatami 0x8048c5c

...
        var_1 = &var_2;
        if (setsockopt(var_0, 1, 2, var_1, 4) < 0) {
            perror("Unable to set socket option REUSEADDR.");
            exit(/* __noreturn */ 1);
...

We could propagate var_1 = &var_2 into setsockopt(var_0, 1, 2, var_1, 4) , so that we have shorter and cleaner code:

...
       if (setsockopt(var_0, 1, 2, &var_2;, 4) < 0) { //<------------------------------ &var_2 is now an argument
           perror("Unable to set socket option REUSEADDR.");
           exit(/* __noreturn */ 1);
...

Allow propagation of variable address into usages of that address if it does not lead to incorrect decompilation.

Approach

Check if it is always safe to propagate addresses or there are any corner cases there. Implement this propagation in ExpressionPropagationMemory.

Some other samples to test with:
say_my_name.zip welcomeMe
test_memory.zip test21

Be careful with
test_memory.zip test26
since it currently produces out-of-ssa error when propagating addresses, which presumably caused by incorrect propagation.

The implemented solution should decompile test26 correctly.

[Codegenerator] Define large/long constants globally

Proposal

At the moment, wedo common subexpression elimination for Constants that have a certain length (configurable).
However, in C, these constants are often defined globally.
We should extend the Code Generator to define constants of a certain length (add a config option for this) globally.

Approach

No response

Consolidate subexpression iteration

Proposal

There are about 8 dedicated functions throughout the current codebase dedicated to iterate all subexpressions of an expression.
I propose to remove this function in favor of the newly implemented Expression.subexpressions() iterator.

    def subexpressions(self) -> Iterator[DataflowObject]:
        """Yield all subexpressions in a depth-first manner."""
        worklist: List[DataflowObject] = [self]
        while worklist and (head := worklist.pop()):
            yield head
            worklist.extend(head)

Approach

Functions like these can be found by searching for worklistor todo.