Giter VIP home page Giter VIP logo

bcdb's Introduction

The Bitcode Database (bcdb)

Tests Cachix cache standard-readme compliant

A database and infrastructure for distributed processing of LLVM bitcode.

The Bitcode Database (BCDB) is a research tool being developed as part of the ALLVM Project at UIUC. It has the following subprojects:

  • MemoDB: A content-addressable store and a memoizing distributed processing framework backed by SQLite or RocksDB. It can cache the results of various analyses and optimizations.
  • BCDB proper: Builds on MemoDB to store massive amount of LLVM bitcode and automatically deduplicate the bitcode at the function level. All the other subprojects are build on BCDB.
  • Guided Linking: A tool that can optimize dynamically linked code as though it were statically linked.
  • Outlining: A work-in-progress optimization to reduce code size using outlining.
  • SLLIM: A easy-to-use tool to apply various code size optimizations (including our outliner) to existing software without messing around with build systems.
  • Nix bitcode overlay: Nix expressions to automatically build lots of Linux packages in the form of LLVM bitcode.

Table of Contents

Background

The BCDB has been developed primarily by Sean Bartell, to support his PhD research on code size optimization.

This project initially grew out of the ALLVM Project, started by Will Dietz, which aims to explore the new possibilities that would be enabled if all software on a computer system were shipped in the form of LLVM IR. The BCDB was originally designed as a way to store massive amounts of LLVM IR more efficiently by using deduplication.

Install

mkdir build
cd build
cmake ..
make
make check

Dependencies

Building BCDB without Nix is not officially supported. If you want to try it anyway, you'll need to install these dependencies first:

  • C++ compiler with C++17 support.
  • LLVM version 11 through 14 (development versions up to 15 may work, but this is not guaranteed)
    • LLVM must be built with exception handling support. Official packages have this disabled, so you'll need to build LLVM yourself with cmake -DLLVM_ENABLE_EH=ON (or use Nix).
    • LLVM's FileCheck and not programs must be installed as well. Some packages (including some of LLVM's official packages) exclude these programs or split them off into a separate package.
    • When working on the BCDB code, you should make sure LLVM is built with assertions enabled (cmake -DLLVM_ENABLE_ASSERTIONS=ON).
  • Clang, same version as LLVM.
    • Clang doesn't need exception handling or assertions, so you can use an official Clang package.
  • CMake, at least version 3.13.
  • Libsodium
  • SQLite
  • Python, at least 3.6.
  • Boost, at least 1.75.
  • Optional dependencies:
    • RocksDB, preferably at least 6.19, with LZ4 and Zstandard support (ROCKSDB_LITE is not supported).

Building dependencies automatically with Nix

If you have Nix installed, it can automatically build BCDB along with known-working versions of its dependencies. See default.nix for the list of attributes you can build. For example:

nix-build -A bcdb
result/bin/bcdb -help

If you want to modify the BCDB code, you can instead build just the dependencies with Nix, and enter a shell that has them installed:

nix-shell -A bcdb
mkdir build
cd build
cmake ..
make

If you install and enable direnv, it will effectively set up the Nix shell every time you enter the bcdb directory.

In any case, you can speed up Nix by using our Cachix cache, which includes prebuilt versions of LLVM. Simply install Cachix and run cachix use bcdb.

Usage

See the subproject subdirectories for usage instructions and more documentation.

Maintainer

Sean Bartell.

Contributing

There's no formal process for contributing. You're welcome to submit a PR or contact [email protected].

License

Apache License 2.0 with LLVM Exceptions, copyright 2018โ€“2022 Sean Bartell and other contributors. See license in LICENSE.TXT.

bcdb's People

Contributors

andrewf29 avatar dependabot[bot] avatar theo25 avatar yotann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bcdb's Issues

Handle DICompileUnits properly

Currently, splitting and joining causes DICompileUnits to be duplicated so each function gets its own copy of the compile unit. To fix this, we need to use the remainder module to keep track of which compile units are actually the same.

Outlining code bugs

There are various cases in the outlining code that I haven't fully thought through, and some of them are probably handled incorrectly. If we try to perform actual outlining, this will lead to incorrect code being generated.

  • various TODOs and FIXMEs in lib/Outlining
  • tests should be much more thorough
  • if the original function has an address space, section, comdat, or garbage collector, how should we handle them?
  • parameter attributes may be handled incorrectly
  • function attributes may be handled incorrectly
    • see constructFunction in llvm/lib/Transforms/Utils/CodeExtractor.cpp
  • metadata may be handled incorrectly
    • debug
    • TBAA
    • noalias
    • callback
    • llvm.loop
    • prof

Splitting debug info is too inefficient

When applied to a module with megabytes of debug metadata, the splitter is extremely slow and uses way too much memory.

Backtrace:

llvm::MDNode::operator new
llvm::DISubprogram::getImpl
llvm::DISubprogram::cloneImpl
llvm::MDNode::clone
MDNodeMapper::mapTopLevelUniquedNode
Mapper::mapMetadata [clone .part.318]
Mapper::mapMetadata
Mapper::remapInstruction
Mapper::remapFunction
llvm::ValueMapper::remapFunction
llvm::RemapFunction
ExtractFunction
bcdb::SplitModule

Detect violated constraints

For debugging purposes, we should add checks at run time and raise an error if any of the constraints are violated. How to do this is explained in the paper.

Security implications of guided linking

One concern I have, in looking at guided linking, is that it potentially causes a huge increase in the ROP surface of all programs linked using it. If it were possible to specify portions of the optimized set which, in the optimized output, must not have any (transitive) dependency between them, this could significantly alleviate the issue.

To illustrate, let's start with the simple optimized set given in Fig. 1 of the paper: program1 IR needs library IR; program2 IR needs library IR. If we consider the case where program1 is some normal program expected to be run by unprivileged users, and program2 is a tiny helper program that is SetUID in order to obtain specific resources it then hands off to program1, but both use the same set of libraries, then guided linking may significantly reduce the overall security of the system.

If, however, it was possible to state that no dependency relationship may be created from code in program2 IR to code in program1 IR (with being in the same merged library counting as a dependency relationship in both directions), this problem could be avoided.

Upgrade FalseMemorySSA

lib/outlining/FalseMemorySSA.cpp is based on MemorySSA.cpp from LLVM 12. LLVM 13 has a few improvements to this file, which are probably worth copying over.

bcdb invalidate is too slow

When invalidating a function with >10,000 cached results, bcdb invalidate is extremely slow (many minutes, CPU-bound). It's much faster to just run sqlite3 /path/to/bcdb 'DELETE FROM call WHERE fid = -1;', even though that should be equivalent. Probably the BCDB's connection setting pragmas (maybe the write-ahead log?) are making it slow.

Handle DICompileUnits properly.

Currently, splitting and joining causes DICompileUnits to be duplicated so each function gets its own copy of the compile unit. To fix this, we need to use the remainder module to keep track of which compile units are actually the same.

Consider giving names to split functions

Because split functions have no name, the globalopt pass (included in opt -O1) deletes them. If we give all the split functions a standardized name (like f) this won't be a problem. However, any name we choose could potentially conflict with other names used by the program.

Another option: store split functions without a name, but give users the option to add a name when retrieving a function from the BCDB.

Split functions which use blockaddresses normally

In the normal case, function @f may have blockaddresses stored in global constant @g, which is only used by function @f. The obvious solution in this case is to put @g in the split module along with @f.

In the general case, blockaddresses for function @f may be used by other functions, but this should be extremely rare and we don't need to handle it well.

Support ThinLTO

Currently the guided linker combines the entire merged library into a single module. For large sets of software, optimizing and compiling this module is very slow (e.g., LLVM+Clang takes several hours). We should add ThinLTO support so the merged library can be optimized faster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.