trailofbits / tourniquet Goto Github PK

View Code? Open in Web Editor NEW

15.0 28.0 2.0 175 KB

A Python library for easy and fast program transformation/repair

Home Page: https://pypi.org/project/tourniquet/

License: Apache License 2.0

CMake 3.86% Dockerfile 0.94% Python 69.21% C 1.78% C++ 22.80% Makefile 1.41%

llvm program-repair program-transformation clang

tourniquet's Introduction

Tourniquet

A Python library for easy C/C++ syntax guided program transformation/repair. This is still very early in development.

Installation

PyPI

Tourniquet is available via PyPI:

$ pip install tourniquet

You'll need a relatively recent CMake and LLVM (9, 10, and 11 are supported).

Docker

Tourniquet can also be built and run within a Docker container:

$ docker build -t trailofbits/tourniquet .
$ docker run -it trailofbits/tourniquet

Enter an ipython instance and import tourniquet

Syntax Guided Program Repair TL;DR

Fixing programs is hard, and one of the most popular techniques for automated repair is search. If you're familiar with fuzzing, it is basically the reverse. In fuzzing you are mutating inputs to cause a crash, but in search based program repair you are mutating the program until you pass some specific tests.

One huge drawback of this search based approach is that the search space is huge and most mutations in it are useless. The idea behind syntax guided program repair is to come up with some generally useful syntax patterns and to use those for your search space. This means your search space is smaller (restricted by the syntax patterns) and you are focusing on patch candidates that might actually fix whatever bug is in front of you.

So What Even Is Tourniquet?

Tourniquet is a library and domain specific language for syntax guided program repair. Current tools have hard coded fix patterns within them, making it hard for humans to interact and tweak them. The goal of Tourniquet is to make it quick and easy to create repair templates that can immediately be used to try and repair bugs. We plan on using Tourniquet alongside program analysis tools to allow humans to create fix patterns that have semantic meaning.

Domain Specific Language

Rather than writing individual tree transform passes or other types of source manipulation, Tourniquet makes it easy to describe part of the syntax and part of the semantics of a repair and lets the computer do the rest. Here is a simple example template:

class YourSemanticAnalysis(Expression):
    def concretize(self, _db, _location):
        yield "SOME_ERROR_CONSTANT"


def your_matcher_func(line, col):
    return True


demo_template = PatchTemplate(
    FixPattern(  # Location 1
        IfStmt(
            LessThanExpr(Variable(), Variable()),  # Location 2
            NodeStmt()  # Location 3
        ),
        ElseStmt(
            ReturnStmt(YourSemanticAnalysis())  # Location 4
        )
    ),
    your_matcher_func  # Location 5
)

Location 1 is the beginning of the FixPattern. The FixPattern describes the overall shape of the repair. This means the human provides part of the syntax, and part of the semantics of the repair.

Location 2 shows some of the different types in the DSL. What this line is describing is a less than statement with two variables, all the variable information is automatically extracted from the Clang AST.

Location 3 is whatever source was matched by your matcher function, also extracted from the Clang AST.

Location 4 is an example of how you could integrate program analysis tools with Tourniquet. The FixPattern is trying to do a basic if...else statement where the else case returns some value. Return values have semantic properties, returning some arbitrary integer isn't usually a good idea. This means you can use some program analysis technique to infer what an appropriate return code might actually be, or simply ask a human to intervene.

Location 5 is for a matcher. The matcher is a callable that is supposed to take source line and column information and return True or False if the FixPatern is applicable to that source location. The idea here is that we couple specific types of fixes with specific types of bugs. We intend to use some other tools (such as Manticore) to help determine bug classes.

Using Tourniquet

# Create a new Tourniquet instance
demo = Tourniquet("test.db")

# Extract info from its AST into the database
demo.collect_info("demo_prog.c")

# Create a new patch template
demo_template = PatchTemplate(
    FixPattern(
        IfStmt(
            LessThanExpr(Variable(), Variable()),
            NodeStmt()
        )
    ),
    lambda x, y: True,
)

# Add the template to the tourniquet instance
demo.register_template("demo_template", demo_template)

# Tell Tourniquet you want to see results from this program, with this template,
# matching against some location
location = Location("demo_prog.c", SourceCoordinate(44, 3))
samples = demo.concretize_template("demo_template", location)

# Look at all the patch candidates!
print(list(samples))

# Attempt to automatically repair the program using that template
# Specify the file, some testcases, and the location information again
demo.auto_patch(
    "demo_template"
    [
        ("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 1),
        ("password", 0)
    ],
    location
)

Auto patch will return True or False depending on if you successfully found a patch to fix all testcases. Eventually we will support having a test case directory etc, this is still early in development.

Check out tourniquet's API documentation for more details.

Development

Install venv to be able to run make commands

$ docker build -t trailofbits/tourniquet .
$ docker run -it trailofbits/tourniquet
root@b9f3a28655b6:/tourniquet# apt-get install -y python3-venv
root@b9f3a28655b6:/tourniquet# python3 -m venv env
root@b9f3a28655b6:/tourniquet# make test

Contributors

Carson Harmon ([email protected])
Evan Sultanik ([email protected])
William Woodruff ([email protected])

Acknowledgements

The project or effort depicted is sponsored by the Air Force Research Laboratory (AFRL) and DARPA under contract FA8750-19-C-0004. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Air Force Research Laboratory (AFRL) and DARPA.

Distribution Statement: Approved for Public Release, Distribution Unlimited

tourniquet's People

Contributors

Stargazers

Watchers

Forkers

manas-embold stjordanis

tourniquet's Issues

Remove debugging statements and add clang format

Some code cleaning, some std::cout <<, turn that into logging and add a .clang-format file for linting etc

Abstract location information

Lots of bits of the tourniquet API currently take at least three parameters: a module_name, line, and col uniquely locating some source code feature. These should be abstracted behind a Location or similar dataclass, allowing us to replace three parameters with one.

Even more aggressively: We could have a Locator class, which can be derived from and supports concretize to produce a concrete Location.

Tourniquet currently treats all inputs as C++

This is wrong; it should intelligently default to C for inputs that end with .c, and C++ for inputs that end with the common C++ suffixes.

Distribute via PyPI

We should build binary wheels for macOS and Linux and distribute them via PyPI.

The dynamic linkage to clang might pose a problem, although auditwheel should be able to vendor the appropriate clang libraries. We'll see.

Use SQLAlchemy for all backend DB operations

We should use SQLAlchemy to abstract the process of table creation, as well as replace the raw queries currently used to access the DB.

LLVM 10 support

We should make sure that Tourniquet builds and functions correctly with Clang and LLVM 10.

PatchLang: More expressions

Some other expressions we could consider:

Function(): Like Variable(). The constructor could take additional concretization constraints, like an optional return type or parameter types.
Call(Function(), Variable(), ..., Variable()) and/or Call(Lit("..."), Variable(), ..., Variable()).
Asmt(Lhs, Expression()) for lhs = <expression>;. Lhs can be either a preexisting variable or a fresh variable; we should think about ways to express the latter.

Add a stub file for `tourniquet.extractor`

Once #34 is merged, we should remove it from mypy's ignores and add a proper stub file for its functions.

In particular, after #25, those interfaces are:

def extract_ast(filename: os.PathLike, is_cxx: bool) -> Dict[str, Any]:
    ...

def transform(filename: os.PathLike, is_cxx: bool, replacement: str, start_line: int, start_col: int, end_line: int, end_col: int) -> bool:
    ...

Extract module name

When running a new file create a new key from module_name --> filename

Add `tourniquet.exceptions`

We should have custom exceptions instead of ValueError, where appropriate.

Auto patch using DB information

Auto patch should be better now with raw stmt information stored into the DB. This is because when we are doing search, as long as the types match what is in the fix patterns it's just guess and check.

In MATE we only had access to variable level source info, now in the DB we have matched nodes with their source level strings so we can just pull from the DB to have much better templates.

Port over and update transform

Port over the C++ code actually responsible for patching, maybe use stringstreams instead of using the actual rewriter this time.

Change setup.py so extractor isn't a standalone module

Export extractor only as tourniquet.extractor rather than having extractor and tourniquet be distinct python modules

Release builds

Don't always build the Python extension in debug mode.

Build process/patch testing

We need to talk, generally, about tourniquet's approach to building source files and testing patches.

As of #37, that approach looks like the following:

Patching: Patches are applied to the original source file (not a copy) within a context manager, and then rolled back when the context manager ends its scope.
Testing: Programs are assumed to be single-file, standalone executables with no special flags. They're passed directly into clang; builds that error are skipped. The auto-patcher assumes that the target program can be tested using inputs on stdin and exits with with an exit code on various different error conditions.

By and large, I think this approach to patching is fine for now: we'll probably want to not patch the original source file further down the line, but hiding it inside a context manager keeps things clean for the time being.

On the other hand, the approach to testing needs to be rethought and redesigned. In particular, we should think about the following:

How are we going to compile source files that belong to a larger (multi-file) program? Are we going to recompile the entire program each time, or attempt to recompile only the particular translation unit with the help of compilation records from blight? How are we going to ensure that we faithfully re-link the modified translation unit?
How are we going to test patched programs? We should probably implement a set of methodologies, with "throw stuff at stdin and see how it exits" being one of them. We should probably come up with an abstract "test" interface that each methodology refines; we can then make the process of selecting the appropriate methodology for a program something that requires human feedback.

Fix the directory structure

We should use a conventional src structure, with src/tourniquet/__init__.py to provide the proper package/module structure instead of tourniquet/tourniquet.py and tourniquet/__init__.py to fix the former up.

Tests for SQL extraction backend

Write tests to confirm that the proper tables are made and that the proper info is in the tables

Add patch view

For different fix patterns, have a view mode as well to see what a non concretized version looks like

Port over PatchLang and create binding to send matcher string

Exporting the AST info gives us some good information to reason about, but there is still a lot of information left in the clang hierarchy. It will be hard to match syntactic patterns without using the clang API.

I think we should still support some type of syntactic matcher expressions like before, but instead of making ASTMatchers bindings we can just have another C++ extension that takes a string and constructs a matcher expression. The code for this should already be in clang-query, so we can take that.

This is still better than what we had before, because

You can actually specify fix patterns etc from the user facing component
You are not limited by syntactic patterns, they are just a suggestion now.
Before the match API call would also collect source info etc about your match, we don't have to do that now
We don't have to send patch-lang type information back and forth, its nice.

Check out what we could do if we had syntactic patterns (str --> match --> True/False) + regular python
I imagine Node.matches actually invokes the C++ extension

This is an example of matching a syntactic pattern (arithmetic operations) and trying to relate it a math poi
that says "yeah thats something we should check out". The replacement pattern says lets replace that node with
some arbitrary BinaryOperator. This means during search, we will try searching for other potential operators.

new_t = template(
     Node.matches(binaryOperator(anyOf(
                             hasOperatorName("<<"),
                             hasOperatorName("+"),
                             hasOperatorName("-"),
                             hasOperatorName("*"),
                             hasOperatorName("/"),
                            ))) and cpg.run_math_poi(Node) == True, 
     pattern(
        Node, 
        BinaryOperator()
      )
    )

Use a template language for Expression and Statement templates

Right now, subclasses of Expression and Statement hardcode their abstract syntax features in the view and concretize methods. To avoid duplication and to make addition of new subclasses easier, these should be lifted to a _template (or similar) class attribute that contains a Handlebar (or other) template.

For example:

class IfStmt(Statement):
    _template = """
    if ({{ cond }}) {
      {{ expr }}
    }
    """

view and concretize could then use _template as appropriate.

Build and auto-publish API docs

We should do almost exactly what blight does: https://github.com/trailofbits/blight/blob/master/.github/workflows/docs-deploy.yml

trailofbits / tourniquet Goto Github PK

tourniquet's Introduction

Tourniquet

Installation

PyPI

Docker

Syntax Guided Program Repair TL;DR

So What Even Is Tourniquet?

Domain Specific Language

Using Tourniquet

Development

Contributors

Acknowledgements

tourniquet's People

Contributors

Stargazers

Watchers

Forkers

tourniquet's Issues

Recommend Projects

Recommend Topics

Recommend Org