Typed TreeBuilder
This piece of work aims to solve four problems, in this priority order:
- Level up the output from Object Introspection as a Library (OIL) to the same level as OI Debugger, enabling new use cases.
- Unify the code path that OIL and OID take in our generated code, increasing maintainability.
- Prevent desynchronisation between the logic that writes to the data segment (generated code) and the logic that reads from it (TreeBuilder/OIL consumers), similarly increasing maintainability.
While these may be a lofty set of goals, I believe by taking this approach we can turn them into one cohesive piece of work.
Glossary
Term |
Definition |
OI |
Object Introspection. A system of debug analysis and code generation that enables traversal and information gathering from arbitrary C++ types. |
OID |
Object Introspection Debugger. The current main deployment of OI, attached externally to a running process as a debugger. Produces a rich output detailing the captured object in RocksDB or JSON format. |
OIL |
Object Introspection as a Library. A C++ interface to the OI technology, available in a JIT mode or a compile time mode. Produces a limited output describing the size of the object. |
Statically Typed |
A type of an object known entirely at compile time, and erased before runtime. |
Dynamically Typed |
A type of an object unknown at compile time, and instead explored at runtime. |
Motivation
OI's traversal of arbitrary type hierarchies presents many opportunities, including: opportunities for compression, computing object differentials at a primitive level, and analysis based on captured memory.
OIL in its current form uses an accumulator to track each element which contributes to the total size. OID instead utilises a data segment to send information about the captured object from the target to the attached debugger. It is very easy to have OIL return instead a std::vector<uint8_t>
which contains the contents of this data segment, providing the ability to access this already rich data in the end-user application. This is feasible because OIL runs on request in the target process, and unlike OID can utilise the memory allocator for a dynamic array.
While an array of bytes is a great way to store the data efficiently and share between processes, it provides an incredibly poor interface for the consumer. Our current consumer is TreeBuilder, a behemoth which infers the contents of the data segment in an entirely different way to the producer. That is, the generated code writes to the data segment based on dynamic logic for containers in .toml
files, while TreeBuilder
reads from the data segment using a statically compiled set of logic.
An ideal system maintains the efficiency of the data segment while presenting a friendly interface to the end user. I present Typed TreeBuilder, which should achieve both of these interface goals.
Approach
Provide statically typed objects which occupy the same amount of space as a uint8_t*
(8-bytes) while having a rich type in the JIT code. Rely on the optimising compiler to inline the rich types into pointer operations, exactly as they were before in most cases. Provide a second dynamically typed representation to be used outside of the JIT code which describes the same data, and does not need to be so efficiency constrained.
The plan is to implement this in two phases. Phase One will implement types in the data segment and provide a representation of this type to TreeBuilder at runtime, ensuring that we put in and take out the same elements from the data segment. Phase Two will provide the additional information that TreeBuilder requires (type knowledge) through this runtime representation, vastly simplifying TreeBuilder and providing this data through the OIL interface as well. This allows an opportunity to test and evaluate the new system in production before making it required to use OIL or TreeBuilder, as Phase Two won't be easily gateable.
Type System
The static type system is simple here, based on the Simply Typed Lambda Calculus. Specifically, we require some very primitive types: Unit
(no data, 0-bytes), Int
, and Bool
. Then two composite types exist: Product
(pair), two data types both present; and Sum
, a tagged union with only one of the two types present.
$$ T \Coloneqq \texttt{Unit} | \texttt{Int} | \texttt{Bool} | T_1 * T_2 | T_1 + T_2 $$
With these simple types we can represent everything in the datasegment. This is a sufficient set of types to store everything we store, by combining the types to create richer ones. However, we also need a couple more basic types to store values efficiently. Pointer
makes sense as a number that isn't implemented as a Variable Length Integer (VarInt
), as they will effectively always use most of the bits. Secondly, we'll need a List
primitive for lists of values of known length, as because of implementation constraints fully recursive data types will be impossible (e.g. $\[T] \Coloneqq Unit + (T * [T])$
) and lists of that form are spatially inefficient. Finally, we'll need to optimise Sum
to accept N
values rather than 2, limiting the number of tag bits required, again for space efficiency.
Specifically this will extend our array:
$$ T \Coloneqq ... | \texttt{Pointer} | [T] $$
Which should be sufficient. I'm no mathematician, so we'll leave it at that.
Generated Code Typing Examples
In our generated code we can perform static typing via template metaprogramming. By describing our Data Buffer as a rich type we can describe the possible operations at each stage.
namespace StaticTypes {
using String_DS = DataSegment::Containers::Tuple<
StaticTypes::Bool,
StaticTypes::VarInt,
StaticTypes::VarInt>;
} // namespace StaticTypes
StaticTypes::Unit
getSizeType(
const std::string &t,
const StaticTypes::String out,
) {
bool sso = ((uintptr_t)t.data() < (uintptr_t)(&t + sizeof(std::string)))
&&
((uintptr_t)t.data() >= (uintptr_t)&t);
return out.write(sso, t.capacity(), t.size());
}
A String
is represented as a Tuple
as a fixed number of fields of different types are used. In this case we use a Bool
to describe whether Short-String Optimisation (SSO) is occurring, a VarInt
for the capacity of the string, and a VarInt
for the size. A possible implementation for a string getSizeType
function is also shown, which confirms SSO with a pointer check before writing the three elements into the Tuple and returning the new offset as an unwritable Void
type.
In the current implementation on main
, primitives are completely ignored as only information that can't be recreated from the type information is stored in the data segment. We'll keep that for these examples, representing the underlying C++ primitive types as our static Unit
type.
The class Foo
would be represented in static types as follows:
class Foo {
std::string name;
uint64_t age;
bool isEnabled;
std::string nickname;
};
using Foo_DS = StaticTypes::Tuple<
StaticTypes::String,
StaticTypes::Unit,
StaticTypes::Unit,
StaticTypes::String>;
Following the philosophy of the data segment, we store the (relevant) fields of a struct as a tuple, as all that matters is that there are a fixed number of fields of potentially different types. A specific Struct
type is unnecessary, as it is simply a collection of fields - the metadata can be re-added later. This Tuple
will need to be automatically generated by codegen for arbitrary structs, but hopefully it is clear that this can be achieved with some quite simple recursion.
OIL Interface Typing Examples
For OIL we'll provide a new interface, alongside the existing ObjectIntrospection::getObjectSize
, which produces the full OI interface to the type. This will look something like:
template <class T>
ObjectIntrospection::DataReader
ObjectIntrospection::introspect(
const T& object,
std::vector<uint8_t>& buffer = {},
);
The DataReader
provides a compromise between the underlying byte array and a high level representation. While it's still a seeking data structure, it provides rich information about the underlying type and can compute values where necessary. Further, we can cache values as in the current Type Graph work.
Important here is the correlation between the type described in the previous section (Code Generation Typing Examples), and the C++ Object DataReader
. I propose a function on each of the statically typed Code Generation types that produces a dynamic type for the equivalent type.
For example:
using String = Tuple<
StaticTypes::Bool,
StaticTypes::VarInt,
StaticTypes::VarInt,
>;
using RuntimeTypes::Dynamic = std::variant<
RuntimeTypes::Tuple,
RuntimeTypes::Bool,
RuntimeTypes::VarInt,
...>;
RuntimeTypes::Dynamic
String::describe() {
return Tuple(
Bool::describe(), VarInt::describe(), VarInt::describe()
);
}
Where String::describe()
won't be explicitly implemented, and will instead be automatically implemented by the Tuple::describe
logic. By implementing this we can input and extract the same information to the data segment as we currently can, making this the division of phase 1. At the end of phase 1 we have a type safe serialization/deserialization between the JIT code and TreeBuilder.
Phase 2 moves on to providing all of the output TreeBuilder currently can from a RuntimeTypes::Dynamic
object. This is what will bring OIL up to parity.
class RuntimeTypes::Struct: public Tuple {
public:
Struct(name, std::span<const char *> fieldNames, ...): Tuple(...), fieldNames_(fieldNames), {
assert(Tuple.size() == fieldNames.size());
};
private:
const char *name_;
std::span<const char *> fieldNames_;
};
class Foo_DS : public StaticTypes::Tuple<
StaticTypes::String,
StaticTypes::Unit,
StaticTypes::Unit,
StaticTypes::String> {
static RuntimeTypes::Dynamic describe() {
return Struct("Foo", {"name", "age", "isEnabled", "nickname"}, Tuple::describe());
}
};
It's plausible, but skipped here for the sake of brevity (whoops), to make these describe functions and constructors constexpr
. This means they could further be computed at compile time and stored in the text segment. For particularly compile time OIL which may have these values available but seldom used, keeping the information in the binary instead of dynamically generating them could reduce the resident memory burden. This can be explored in the future. Similar features like creating a static string pool in the JIT code can keep things memory efficient.
TreeBuilder Typing Examples
With some rearchitecting TreeBuilder becomes incredibly simple. If we use the OIL APIs (through a bit of fiddling) in OID, we can access all of the information we would need for TreeBuilder in a ready to go form.
Pseudocode for an implementation:
std::function<ObjectIntrospection::DataReader, uint8_t*, size_t> getDataReader =
lookupInTextSegment(mangle(ObjectIntrospection::getDataSegmentReader<A,B,C>));
uint8_t* dataBlob = getDataSegmentStart();
size_t dataSize = getDataSegmentSize();
ObjectIntrospection::RuntimeTypes::Dynamic dataReader =
getDataReader(dataBlob, dataSize);
recurse: {
node.write("typePath", dataReader.typePath); // etc...
for (ObjectIntrospection::DataReader& child : dataReader.children()) {
recurse(child);
}
}
The complexity here is mapping the text segment in both the target and the debugger. Once that is complete, TreeBuilder in OID becomes a small shim around the functionality already provided to OIL. That is, we can share a TreeBuilder implementation between OIL (the programmer's context) and OID (the debugger), allowing any programmer to exercise the full power of OI. This is particularly useful for uploading the OIL output to a database in a similar way to OID, but with significantly more control over where it is called.
Extensions
So far has been on gaining parity between OIL and what OID currently does in production. There are several experimental features that haven't been described, but I think fit well into this design. Although this won't be completed in the same piece of work, it is important to have enough of a design now that we know it won't become impossible after this work.
Part of this will likely involve a further explosion of template parameters. For example, we might start seeing bool MemoryCapture = false
in the templates of getSizeType
functions. This could be followed by if constexpr (MemoryCapture) { ... }
within the function. An alternative would be to do something like:
namespace FeatureFlags {
enum FeatureFlags: uint8_t {
None = 0,
CaptureMemory = 1 << 1,
ChaseRawPointers = 1 << 2,
};
}
template <uint8_t Flags>
void foo() {
if constexpr (Flags & FeatureFlags::CaptureMemory) {
std::cout << "CaptureMemory" << std::endl;
}
if constexpr (Flags & FeatureFlags::ChaseRawPointers) {
std::cout << "ChaseRawPointers" << std::endl;
}
std::cout << "foo" << std::endl;
}
int main() {
foo<FeatureFlags::CaptureMemory>();
foo<FeatureFlags::ChaseRawPointers>();
foo<FeatureFlags::None>();
foo<FeatureFlags::CaptureMemory | FeatureFlags::ChaseRawPointers>();
}
This will be quick to resolve at compile (JIT) time as there is no template recursion, while allowing a good amount of flexibility and fast runtime execution with features turned off. It will also prevent adding a new template parameter every time an on or off feature is added, and allow us to reuse the same JIT code for multiple types with different feature sets - important for compile time OIL. More so, it'll mean we can use the same lump of code whether we have features turned on or off at a given callsite, keeping codegen relatively simple. This compares to using ifdef
s or the like which would increase the complexity of codegen, particularly when multiple probes are codegenned as one blob.
To fit this piece in entirely with our type system, we need the ability to know at compile time what our output will look like in the face of feature flags (memory capture being the specific example). This is achievable in C++ using std::conditional
, enabling us to decide with the same template arguments what the output will look like.
template <uint8_t Flags>
typename std::conditional<
Flags & FeatureFlags::CaptureMemory,
uintptr_t,
void
>::type foo() { ... }
As we can change the type by modifying our call this type framework shouldn't impede any new features. You will need to establish how your new feature affects the output types, but really you should know this anyway...
Memory Capture
Memory capture in OIL can exist in two forms:
- Providing pointers into the type that can be dereferenced in ordinary C++ code based on information captured.
- Dumping the entire type hierarchy's primitives for post-processing and analysis.
Both of these fit cleanly with the above. Pointers involve adding a std::conditional
gated Pointer
everywhere it's relevant. Dumping primitives involves changing the Void
type of the primitive getSizeType
to Int
or Bool
, again conditionally gated.
### OIL Pointer Chasing
My recommended way to implement this would be a provided function which validates pointers. So in the current two cases, no pointer chasing would be implemented as bool check(uintptr_t p) { return false; }
and --chase-raw-pointers
would be implemented as bool check(uintptr_t p) { return true; }
.
A pointer validation function would further allow us to have a set of pointers which it is valid to follow and can be determined dynamically, or anything else the user deems appropriate. We would pass this through as an additional argument to getSizeType
, allowing it to be implemented either statically or dynamically. In all likelihood it would always be static, but the option would be nice. The type system helps significantly here as we'd have to represent a pointer that may or may not be chased as some sort of Optional<T> = Void + T
, which can easily be implemented in the type system and removes that risk of desync.
Risk
Recursive templates like the $T_1 * T_2$ logic can be very slow to compile with a large number of parameters. For large C++ types we may have dozens of fields, perhaps hundreds. To alleviate this risk I will make large types one of the first test cases, and if it is too slow I can stop early and take a different approach.
We also risk breakage in changing existing behaviour that perhaps hasn't been touched in a long time. However, our (relatively) comprehensive test suite reduces this risk. Further, the risk after this is landed of such breakages in the future is much lower, so this should be worth it longer term.