osandov / drgn Goto Github PK

Programmable debugger

License: Other

Python 50.35% C 48.40% Makefile 0.21% Shell 0.14% M4 0.43% Awk 0.47%

drgn's Introduction

drgn

drgn (pronounced "dragon") is a debugger with an emphasis on programmability. drgn exposes the types and variables in a program for easy, expressive scripting in Python. For example, you can debug the Linux kernel:

>>> from drgn.helpers.linux import list_for_each_entry
>>> for mod in list_for_each_entry('struct module',
...                                prog['modules'].address_of_(),
...                                'list'):
...    if mod.refcnt.counter > 10:
...        print(mod.name)
...
(char [56])"snd"
(char [56])"evdev"
(char [56])"i915"

Although other debuggers like GDB have scripting support, drgn aims to make scripting as natural as possible so that debugging feels like coding. This makes it well-suited for introspecting the complex, inter-connected state in large programs.

Additionally, drgn is designed as a library that can be used to build debugging and introspection tools; see the official tools.

drgn was developed at Meta for debugging the Linux kernel (as an alternative to the crash utility), but it can also debug userspace programs written in C. C++ support is in progress.

Documentation can be found at drgn.readthedocs.io.

Installation

Package Manager

drgn can be installed using the package manager on some Linux distributions.

Fedora >= 32
```
$ sudo dnf install drgn
```
RHEL/CentOS >= 8

Enable EPEL. Then:
```
$ sudo dnf install drgn
```
Arch Linux

Install the drgn package from the AUR.
Debian >= 12 (Bookworm)
```
$ sudo apt install python3-drgn
```
openSUSE
```
$ sudo zypper install python3-drgn
```
Ubuntu

Enable the michel-slm/kernel-utils PPA. Then:
```
$ sudo apt install python3-drgn
```

pip

If your Linux distribution doesn't package the latest release of drgn, you can install it with pip.

First, install pip. Then, run:

$ sudo pip3 install drgn

This will install a binary wheel by default. If you get a build error, then pip wasn't able to use the binary wheel. Install the dependencies listed below and try again.

Note that RHEL/CentOS 6, Debian Stretch, Ubuntu Trusty, and Ubuntu Xenial (and older) ship Python versions which are too old. Python 3.6 or newer must be installed.

From Source

To get the development version of drgn, you will need to build it from source. First, install dependencies:

Fedora

$ sudo dnf install autoconf automake check-devel elfutils-devel gcc git libkdumpfile-devel libtool make pkgconf python3 python3-devel python3-pip python3-setuptools

RHEL/CentOS
```
$ sudo dnf install autoconf automake check-devel elfutils-devel gcc git libtool make pkgconf python3 python3-devel python3-pip python3-setuptools
```
Optionally, install libkdumpfile-devel from EPEL on RHEL/CentOS >= 8 or install libkdumpfile from source if you want support for the makedumpfile format.

Replace dnf with yum for RHEL/CentOS < 8.

Debian/Ubuntu

$ sudo apt install autoconf automake check gcc git liblzma-dev libelf-dev libdw-dev libtool make pkgconf python3 python3-dev python3-pip python3-setuptools zlib1g-dev

Optionally, install libkdumpfile from source if you want support for the makedumpfile format.

Arch Linux

$ sudo pacman -S --needed autoconf automake check gcc git libelf libtool make pkgconf python python-pip python-setuptools

Optionally, install libkdumpfile from the AUR or from source if you want support for the makedumpfile format.

openSUSE

$ sudo zypper install autoconf automake check-devel gcc git libdw-devel libelf-devel libkdumpfile-devel libtool make pkgconf python3 python3-devel python3-pip python3-setuptools

Then, run:

$ git clone https://github.com/osandov/drgn.git
$ cd drgn
$ python3 setup.py build
$ sudo python3 setup.py install

See the installation documentation for more options.

Quick Start

drgn debugs the running kernel by default; run sudo drgn. To debug a running program, run sudo drgn -p $PID. To debug a core dump (either a kernel vmcore or a userspace core dump), run drgn -c $PATH. Make sure to install debugging symbols for whatever you are debugging.

Then, you can access variables in the program with prog['name'] and access structure members with .:

$ sudo drgn
>>> prog['init_task'].comm
(char [16])"swapper/0"

You can use various predefined helpers:

>>> len(list(bpf_prog_for_each()))
11
>>> task = find_task(115)
>>> cmdline(task)
[b'findmnt', b'-p']

You can get stack traces with stack_trace() and access parameters or local variables with trace['name']:

>>> trace = stack_trace(task)
>>> trace[5]
#5 at 0xffffffff8a5a32d0 (do_sys_poll+0x400/0x578) in do_poll at ./fs/select.c:961:8 (inlined)
>>> poll_list = trace[5]['list']
>>> file = fget(task, poll_list.entries[0].fd)
>>> d_path(file.f_path.address_of_())
b'/proc/115/mountinfo'

See the user guide for more details and features.

Getting Help

The GitHub issue tracker is the preferred method to report issues.
There is also a Linux Kernel Debuggers Matrix room.

License

drgn is licensed under the LGPLv2.1 or later.

drgn's People

Contributors

Stargazers

Watchers

Forkers

jeffmahoney sdimitro codesun minwooim shushen nathandialpad mishuang2017 naota htejun mkarrman delphix prakashsurya jwadams machshev amlannayak ofaaland iatapps kuan-li gobenji albertygu rdna jgkamat ethercflow zeta1999 dennisszhou sirspudd liu-song-6 keithglidewell balakrishnasai shahraaz-cn brenns10 jianlin-lv kamalesh-babulal pythoncheatsheet ssun3 alastor-erinyes b-xiang shiloong davide125 luv2c0d3 danieljordan10 keyingliu x-lugoo santakd slapurmoma5 achievezheng arch-zheng peilin-ye chenyingk svetlitski duanjiong mrcodechef nobitanobi alakesh mic92 chengyuanlicy ssahgal gbaf yydzhou xwlan dipeshchouhan iuriimattos2 linecode mu-l mbrukman liguang-li alexlzhu qmonnet mykola-lysenko prozak ammarfaizi2 byte-lab yibit linux-kern ajor alviroiskandar bergwolf imran-kn mrvan paulz-98 weber-wenbo-wang lsgunth var52yt doytsujin rtadepalli fengjixuchui nhatsmrt shunghsiyu chenshanpei gkmccready michel-slm inspirationhello hitmoon svens-s390 wengang-oracle boryas hbcbh1999 longjohncoder jakehillion soez

drgn's Issues

Can't compile on Fedora30

Hi,
I have install python3-devel on Fedora and still gets the error with both command lines:

#sudo yum install python3-devel
Package python3-devel-3.7.4-1.fc30.x86_64 is already installed.
Dependencies resolved.

#python3 setup.py build
#python3.7m setup.py build
.............................................................................
In file included from ../../libdrgn/python/error.c:4:
../../libdrgn/python/drgnpy.h:9:10: fatal error: Python.h: No such file or directory
.............................................................................

debuging kernel doesn't start: invalid `Elf' handle

I followed the install-guide. After that I started it via sudo drgn -k and got:

Traceback (most recent call last):
  File "/usr/bin/drgn", line 11, in <module>
    load_entry_point('drgn==0.0.1', 'console_scripts', 'drgn')()
  File "/usr/lib/python3.7/site-packages/drgn-0.0.1-py3.7-linux-x86_64.egg/drgn/internal/cli.py", line 90, in main
    prog.load_default_debug_info()
_drgn.FileFormatError: libelf error: invalid `Elf' handle

Any idea what I'm doing wrong?

Support DW_FORM_ref_addr in DWARF index

dwz uses DW_FORM_ref_addr to deduplicate between compilation units, but our DWARF index code doesn't support DW_FORM_ref_addr. It'll take a little bit of refactoring to support that, as we'll need to look up the new CU for a ref_addr.

Make missing debug info warnings noisier

I've gotten a few reports that drgn wasn't able to find a variable that were actually due to missing debug info. Clearly the warnings aren't noticeable enough. Let's make them noisier somehow (color them red? print them after the automatic imports?).

radix_tree_for_each doesn't work with flag XA_FLAGS_ALLOC1

We have the following data structure:

struct mlx5_tc_ct_priv {
...
struct idr fte_ids;
struct xarray tuple_ids;
...
};

I can print fte_ids without problem. But I can't print tuple_ids. The error message is:

Traceback (most recent call last):
File "/usr/local/bin/drgn", line 11, in
load_entry_point('drgn==0.0.4', 'console_scripts', 'drgn')()
File "/usr/local/lib64/python3.7/site-packages/drgn-0.0.4-py3.7-linux-x86_64.egg/drgn/internal/cli.py", line 129, in main
runpy.run_path(args.script[0], init_globals=init_globals, run_name="main")
File "/usr/lib64/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/usr/lib64/python3.7/runpy.py", line 96, in run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/usr/lib64/python3.7/runpy.py", line 85, in run_code
exec(code, run_globals)
File "./test.py", line 53, in
for node in radix_tree_for_each(tuple_ids.address_of()):
File "/usr/local/lib64/python3.7/site-packages/drgn-0.0.4-py3.7-linux-x86_64.egg/drgn/helpers/linux/radixtree.py", line 73, in radix_tree_for_each
yield from aux(node, 0)
File "/usr/local/lib64/python3.7/site-packages/drgn-0.0.4-py3.7-linux-x86_64.egg/drgn/helpers/linux/radixtree.py", line 68, in aux
index + (i << parent.shift.value()),
File "/usr/local/lib64/python3.7/site-packages/drgn-0.0.4-py3.7-linux-x86_64.egg/drgn/helpers/linux/radixtree.py", line 67, in aux
cast(parent.type_, slot).read_(),
_drgn.FaultError: address is not mapped: 0x42c

When initializing the xarray with flag XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1, it works:

(1, Object(prog, 'void *', value=0xffff9c884caf9628))
(2, Object(prog, 'void *', value=0xffff9c884caf9710))
(3, Object(prog, 'void *', value=0xffff9c884caf8228))
(4, Object(prog, 'void *', value=0xffff9c884caf8310))

Support C++ inheritance

At a glance, we need to:

Add a list of base classes to drgn.Type/struct drgn_type. This needs to store the base class type, the location of the base class in the derived class, and the access and virtual specifiers of the inheritance.
Include inheritance when pretty printing types.
Parse the DWARF information for inheritance. See section 5.7.3 (Derived or Extended Structures, Classes and Interfaces) in the DWARF 5 spec.
Handle members in derived classes. I think we want Type.members to only contain members added in the derived class itself; the user can access all members by also looking at the list of base classes (which is thankfully how it's represented in DWARF). However, Type.member() and Object.<member> need to check the members of the base class(es). This needs care to be consistent with name lookups in C++.

drgn doesn't handle XZ compressed kernel modules

Built and installed from source on fedora 30:

$ cd /usr/lib/modules/5.1.18-300.fc30.x86_64/kernel 
$ ls
arch  crypto  drivers  fs  kernel  lib  mm  net  security  sound  virt
$ rpm -qf arch
kernel-core-5.1.18-300.fc30.x86_64
$ sudo drgn
could not get debugging information for:
kernel (could not find vmlinux)
/usr/lib/modules/5.1.18-300.fc30.x86_64/kernel/drivers/net/usb/ipheth.ko.xz (could not get section addresses: libelf error: invalid `Elf' handle)
/usr/lib/modules/5.1.18-300.fc30.x86_64/kernel/crypto/ccm.ko.xz (could not get section addresses: libelf error: invalid `Elf' handle)
/usr/lib/modules/5.1.18-300.fc30.x86_64/kernel/drivers/input/misc/uinput.ko.xz (could not get section addresses: libelf error: invalid `Elf' handle)
/usr/lib/modules/5.1.18-300.fc30.x86_64/kernel/fs/fuse/fuse.ko.xz (could not get section addresses: libelf error: invalid `Elf' handle)
... 167 more
drgn 0.0.1 (using Python 3.7.4)
For help, type help(drgn).
>>> import drgn
>>> from drgn import cast, container_of, execscript, NULL, Object, reinterpret
>>> from drgn.helpers.linux import *
>>>
>>> find_task(prog, 1).cred.user.locked_vm
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib64/python3.7/site-packages/drgn-0.0.1-py3.7-linux-x86_64.egg/drgn/helpers/linux/pid.py", line 117, in find_task
    return pid_task(find_pid(prog_or_ns, pid), prog['PIDTYPE_PID'])
  File "/usr/local/lib64/python3.7/site-packages/drgn-0.0.1-py3.7-linux-x86_64.egg/drgn/helpers/linux/pid.py", line 35, in find_pid
    ns = prog_or_ns['init_pid_ns'].address_of_()
KeyError: 'init_pid_ns'

prog.symbol() finds static functions by address but not by name

Example case - the static function slab_bug cannot be found by name in prog.symbol()

>>> prog.symbol('slab_bug')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
LookupError: could not find symbol with name 'slab_bug'

But if we the the address of its Object and pass this to the same function it is returned:

>>> prog.symbol(prog['slab_bug'].address_of_())
Symbol(name='slab_bug', address=0xfffffffface933c0, size=0xb5)

Not sure if this is on purpose but I did find it surprising.

Add accessibility (public, private, protected) to drgn.TypeMember

We need to:

Add the accessibility to struct drgn_type_member and drgn.TypeMember (probably as an enum).
Include it when pretty printing a type.
Parse it from DWARF.

From section 5.7.6 in the DWARF 5 spec:

A data member entry may have a DW_AT_accessibility attribute. If no accessibility attribute is present, private access is assumed for an member of a class and public access is assumed for an member of a structure, union, or interface.

Section 2.8 defines the accessibility codes (DW_ACCESS_public, DW_ACCESS_private, and DW_ACCESS_protected).

Parse and format default function arguments

d35243b added a representation for default function arguments in the drgn API, but we don't actually parse them from DWARF. As of this writing, neither GCC nor Clang actually emits the required DWARF, but we should prepare for it if we can.

Parsing

Section 4.1 in the DWARF 5 spec defines how default arguments are represented:

A DW_AT_default_value attribute for a formal parameter entry. The value of this attribute may be a constant, or a reference to the debugging information entry for a variable, or a reference to a debugging information entry containing a DWARF procedure. If the attribute form is of class constant, that constant is interpreted as a value whose type is the same as the type of the formal parameter. If the attribute form is of class reference, and the referenced entry is for a variable, the default value of the parameter is the value of the referenced variable. If the reference value is 0, no default value has been specified. Otherwise, the attribute represents an implicit DW_OP_call_ref to the referenced debugging information entry, and the default value of the parameter is the value returned by that DWARF procedure, interpreted as a value of the type of the formal parameter.

For a constant form there is no way to express the absence of a default value.

The constant form is easy. The variable reference form is also relatively easy, except it doesn't specify how the type of the variable is related to the type of the parameter: are they supposed to be the same type, or are we supposed to convert it to the parameter type if the variable type differs? It's hard to tell without real producers. A reference to a DWARF procedure is obviously going to require DWARF expression support.

Formatting

We also currently don't include the default arguments when printing a function type. We can probably reuse the struct/array initializer formatting code to format parameters with default arguments.

Better document add_type_finder and add_object_finder

What's the purpose of these functions? Please add examples to the docs.

Support stack traces for kdump files

drgn uses the NT_PRSTATUS ELF note in core dumps to get the initial registers for stack unwinding. (I think?) kdump files include this note as well, but libkdumpfile doesn't have an interface to get it. It should suffice to add kdump_prstatus_raw() to libkdumpfile similar to kdump_vmcoreinfo_raw(), although we might want a generic kdump_note_raw() instead.

Support static members

Static members are basically variables defined within the scope of a class. We should include them in drgn.Type/struct drgn_type. The question is, should they be in Type.members or in a separate list? If the former, how do we distinguish them from normal members? In libdrgn, we'd probably need a bool is_static. In Python, we could probably make Type.bit_offset and Type.offset be None for static members.

In DWARF, static members are represented by DW_TAG_variable DIE children of a DW_TAG_{struct,union,class} DIE. See
section 5.7.7 (Class Variable Entries) of the DWARF 5 spec.

Support looking for kernel modules/vmlinux in a separate base directory

Currently, we hardcode a few standard paths where vmlinux and kernel modules can be found (vmlinux_paths and module_paths in linux_kernel.c). If the files are not available at their standard locations (e.g., because we're debugging a vmcore from another machine), then the current workaround is to specify each file with -s. However, it would be more convenient to be able specify a base directory to use instead of /lib/modules/$(uname -r).

OverflowError thrown by prog.symbol()

We were hitting an issue with sdb that looked like this:

sdb> stacks
TASK_STRUCT        STATE             COUNT
==========================================
0xffffffff862423c0
sdb encountered an internal error due to a bug. Here's the
information you need to file the bug:
----------------------------------------------------------
Target Info:
        ProgramFlags.IS_LIVE|IS_LINUX_KERNEL
        Platform(<Architecture.X86_64: 1>, <PlatformFlags.IS_LITTLE_ENDIAN|IS_64_BIT: 3>)

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/sdb-0.1.0-py3.6.egg/sdb/internal/repl.py", line 85, in eval_cmd
    for obj in invoke(self.target, [], input_):
  File "/usr/local/lib/python3.6/dist-packages/sdb-0.1.0-py3.6.egg/sdb/pipeline.py", line 165, in invoke
    yield from execute_pipeline(first_input, pipeline)
  File "/usr/local/lib/python3.6/dist-packages/sdb-0.1.0-py3.6.egg/sdb/pipeline.py", line 83, in execute_pipeline
    yield from massage_input_and_call(pipeline[-1], this_input)
  File "/usr/local/lib/python3.6/dist-packages/sdb-0.1.0-py3.6.egg/sdb/pipeline.py", line 43, in massage_input_and_call
    yield from cmd.call(objs)
  File "/usr/local/lib/python3.6/dist-packages/sdb-0.1.0-py3.6.egg/sdb/command.py", line 169, in call
    result = self._call(objs)
  File "/usr/local/lib/python3.6/dist-packages/sdb-0.1.0-py3.6.egg/sdb/commands/stacks.py", line 285, in _call
    sym = sdb.get_symbol(frame_pc)
  File "/usr/local/lib/python3.6/dist-packages/sdb-0.1.0-py3.6.egg/sdb/target.py", line 85, in get_symbol
    return prog.symbol(sym)
OverflowError: int too big to convert
----------------------------------------------------------
Link: https://github.com/delphix/sdb/issues/new

The issue wouldn't always be there which I found very weird.
Doing some printf-debugging though I found a pointer that
seemed to cause the issue consistently in the drgn REPL:

>>> prog.symbol(0xffffffff862423c0)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
OverflowError: int too big to convert

Looking at the code and the recent commits the error seemed
to stem from here:

~/drgn$ git diff
diff --git a/libdrgn/python/util.c b/libdrgn/python/util.c
index d0dfb2c..6c7ff2b 100644
--- a/libdrgn/python/util.c
+++ b/libdrgn/python/util.c
@@ -92,9 +92,11 @@ int index_converter(PyObject *o, void *p)
        index_obj = PyNumber_Index(o);
        if (!index_obj)
                return 0;
+
        if (arg->is_signed) {
                arg->svalue = PyLong_AsLongLong(index_obj);
                Py_DECREF(index_obj);
+               printf("serapheim test -> signed exit!\n");
                return (arg->svalue != -1LL || !PyErr_Occurred());
        } else {
                arg->uvalue = PyLong_AsUnsignedLongLong(index_obj);

>>> prog.symbol(0xffffffff862423c0)
serapheim test -> signed exit!
Traceback (most recent call last):
  File "<console>", line 1, in <module>
OverflowError: int too big to convert

The above was probably caused by this commit:

commit 2561226918555ad0d2d12ea6d3ed4cc9026c91d7
Author: Omar Sandoval <[email protected]>
Date:   Fri Nov 29 20:40:40 2019 -0800

    libdrgn: python: add signed integer support to index_converter

    This is preparation for the next change.

Applying the following change deals with the issue:

~/drgn$ git diff
diff --git a/libdrgn/python/program.c b/libdrgn/python/program.c
index fec3d2f..caea315 100644
--- a/libdrgn/python/program.c
+++ b/libdrgn/python/program.c
@@ -722,7 +722,7 @@ static PyObject *Program_symbol(Program *self, PyObject *args, PyObject *kwds)
 {
        static char *keywords[] = {"address", NULL};
        struct drgn_error *err;
-       struct index_arg address;
+       struct index_arg address = {};
        struct drgn_symbol *sym;
        PyObject *ret;

>>> prog.symbol(0xffffffff862423c0)
Symbol(name='__schedule', address=0xffffffff86242100, size=0x870)

The reason is that since address is not expicitly initialized
it gets a garbage value from the previous stack frame.

Remove template arguments from drgn.Type.tag

As noted by 352c31e, drgn.Type.tag includes the template arguments because the DW_AT_name attribute in DWARF includes them. tag should just be what the C standard calls tag and what the C++ standard calls class-head-name: the identifier after the struct, union, or class keyword.

To do this, we need to strip everything starting at the < character when parsing DWARF for a structure, union, or class. But this means that we can't use the string directly in the mmap'd DWARF anymore; we'll have to allocate it and track it to deallocate it (maybe interning it in a set, maybe just keeping them all in a vector). Or, we can edit the mmap, but that might get awkward with the DWARF index.

Then, when formatting a type with template parameters, we need to include those after the tag.

Document, install, and add tests for tools/

#49 added the first drgn-based tool. There's some followup work to be done there:

We should document the tools directory (e.g., what belongs there).
We should install these tools somewhere when installing drgn.
We should add test cases that can run in vmtest to make sure the script works on all kernel versions.

drgn -c doesn't work: not an ELF file

drgn -k works. But I also want to debug the core dump. But it doesn't work.

$ file vmcore
vmcore: Kdump compressed dump v6, system Linux, node dev-r630-04, release 4.19.36+, version #96 SMP Thu Aug 15 14:50:12 CST 2019, machine x86_64, domain lab.mtl.com
$ drgn -c ./vmcore
Traceback (most recent call last):
File "/usr/local/bin/drgn", line 11, in
load_entry_point('drgn==0.0.1', 'console_scripts', 'drgn')()
File "/usr/local/lib64/python3.6/site-packages/drgn-0.0.1-py3.6-linux-x86_64.egg/drgn/internal/cli.py", line 83, in main
prog.set_core_dump(args.core)
_drgn.FileFormatError: not an ELF file

Maybe because the vmcore is compressed. If it is the reason, then it there a way to uncompress it?

ValueError: not an ELF core file

Hi all,
I install libkdumpfile and drgn in a debian buster docker container, but when run drgn -k, the below error messages show:
root@f2871d330b6d:~# drgn -k
Traceback (most recent call last):
File "/usr/local/bin/drgn", line 10, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/drgn/internal/cli.py", line 115, in main
prog.set_kernel()
ValueError: not an ELF core file

BTW, this is my first time to use drgn to want to debug the running kernel, I wonder if the drgn can't be used in container ?

API: Equality Operator for StackFrame

Two distinct StackFrame objects that seem to have the same fields (but originated from different stack traces) are not equal.

Example:

>>> a
<_drgn.StackFrame object at 0x7fa6551968f0>
>>> a.pc
18446744071656044545
>>> a.symbol()
Symbol(name='__schedule', address=0xffffffff8599f570, size=0x891)
>>> b
<_drgn.StackFrame object at 0x7fa655196b10>
>>> b.pc
18446744071656044545
>>> b.symbol()
Symbol(name='__schedule', address=0xffffffff8599f570, size=0x891)
>>> a.pc == b.pc
True
>>> a.symbol() == b.symbol()
True
>>> a == b
False

libkdumpfile doesnt work with drgn

I am using gcc version 7.3.1 and whenever I build libkdumpfile with the latest drgn it is never detected.

Currently using centos 7 with 4.19 kernel

[root@bl-vsnap-100 drgn]# drgn -c /var/crash/127.0.0.1-2020-06-18-17\:18\:49/vmcore
Traceback (most recent call last):
  File "/usr/local/bin/drgn", line 11, in <module>
    load_entry_point('drgn==0.0.5', 'console_scripts', 'drgn')()
  File "/usr/local/lib64/python3.6/site-packages/drgn-0.0.5-py3.6-linux-x86_64.egg/drgn/internal/cli.py", line 113, in main
    prog.set_core_dump(args.core)
ValueError: drgn was built without libkdumpfile support

[root@bl-vsnap-100 ~]# sdb /usr/lib/debug/lib/modules/4.19.119-2c.el7.x86_64/vmlinux /var/crash/127.0.0.1-2020-06-18-17\:18\:49/vmcore
Traceback (most recent call last):
  File "/usr/local/bin/sdb", line 11, in <module>
    load_entry_point('sdb==0.1.0', 'console_scripts', 'sdb')()
  File "/usr/local/lib/python3.6/site-packages/sdb-0.1.0-py3.6.egg/sdb/internal/cli.py", line 213, in main
    prog = setup_target(args)
  File "/usr/local/lib/python3.6/site-packages/sdb-0.1.0-py3.6.egg/sdb/internal/cli.py", line 158, in setup_target
    prog.set_core_dump(args.core)
ValueError: drgn was built without libkdumpfile support

manylinux wheels

Hello again,

I've written some scripts which generate manylinux2010 wheels for drgn. These wheels require no compilation to install, and they bundle the dependencies (such as libkdumpfile) so that it is quite easy to install on any reasonably recent Linux distro (tested on RHEL/OL6+, Ubuntu 20.04).

Among my coworkers, the difficulty to build and install drgn plus libkdumpfile hinders adoption. Especially on older systems such as RHEL/OL 6/7, due to older compiler versions, etc. I think that distributing wheels might simplify installation enough to encourage new users to try out drgn and lower the bar to entry from Crash. Both libkdumpfile and drgn are GPLv3, so there shouldn't be any license concerns with wheel distribution.

Would you be interested in a PR containing these scripts?

Make object pretty printing more flexible

The current Object.__str__() implementation has nice defaults for interactive use (e.g., it automatically dereferences pointers, includes the type name, etc.). However, for scripted use, it'd be nice to have more flexibility.

The following is a list of formatting options that I could come up with (* indicates new functionality):

Include member names when printing structs/unions/classes
Include indices when printing arrays *
Dereference pointers
Format char [] and char * as string literals
Format char as character literals *
Include type names
Include zero-initialized members/elements
"Symbolize" pointers (i.e., include symbol+offset) *

It should be possible to specify these options separately for the top-level object, members, and elements, where applicable (e.g., whether to include type names).

We should define default options (possibly user-defined?) to use for Object.__str__().

There are a couple of options for the interface. One is an Object.format_() method that takes flags. E.g.,

print(obj.format_(FormatObjectFlags.DEREFERENCE |
                  FormatObjectFlags.STRING |
                  FormatObjectFlags.STRING_MEMBERS |
                  FormatObjectFlags.STRING_ELEMENTS |
                  FormatObjectFlags.TYPE_NAME |
                  FormatObjectFlags.TYPE_NAME_MEMBERS))

print(obj.format_(prog.default_format_object_flags &
                  ~FormatObjectFlags.DEREFERENCE))

Pros:

Clear semantics

Cons:

Very verbose
Not convenient for using default options with some minor modifications

Another option is defining our own format specification syntax via Object.__format__() that could then be used via f-strings and str.format(). E.g., with a syntax inspired by chmod:

# Initial = resets all options, remaining options enable [d]ereference,
# [s]tring, and [t]ype name for the [o]bject, [m]embers, and/or [e]lements.
print(f'{obj:=,d=o,s=ome,t=om}')

# Use the defaults sans the [d]ereference [o]bject flag.
print(f'{obj:d-o}')
# Equivalent:
print('{:d-o}'.format(obj))
print(format(obj, 'd-o'))

Pros:

Much more concise
Easy to make minor adjustments to default flags

Cons:

Somewhat opaque syntax

context of bad reference object is not present when FaultError is thrown

Context

When the value of a reference-type object is determined, we generate a FaultError if the object is stored at an address which is not mapped. This happens when calling .read_() on it, which happens implicitly in most cases (e.g. printing, doing math). This may happen via a significantly different code path than when the reference-type object was instantiated.

Example problem

For example, one class may call some_int = pointer_to_struct_obj.some_member, to "dereference" an integer-type member. However, since a reference-type object is created, drgn does not read from the target address space at this time. The pointer_to_struct_obj's value is determined, (i.e. points to memory address 0x1234), but some_int's value is not known, (e.g. we only know that it is at memory address 0x1235). Therefore, some_int, an reference-type object to an integer, may be stored at an invalid memory address (i.e. one that is not mapped by the target kernel, dump, or process).

The class mentioned above may pass this "integer" (actually a reference object to an integer at target memory address 0x1235) to a different, unrelated class, which then uses the "integer", by doing math with it, printing it, etc. All of these implicitly call some_int.read_(), which attempts to read from the referenced address in the target. If the address is not mapped, a FaultError will be thrown. At best, this FaultError can identify that we tried to read an object of type X (e.g. long long int) from address Y (e.g. 0x1235). However, from the programmer's point of view, the problem was that pointer_to_struct_obj was a bad pointer, and we dereferenced it by doing pointer_to_struct_obj.some_member. Unfortunately, this information is not available at the time the FaultError is generated (specifically, the type and value of pointer_to_struct_obj, and the name of the member being accessed some_member).

An even more confusing example involves reference pointers. When doing some_int = pointer_to_struct_obj.some_member, it may be that pointer_to_struct_obj is a reference object, i.e. we do not know the value of the pointer, only where it is stored. We need to determine its value now, so we attempt to read from the target, generating a FaultError if it is not mapped. This is confusing because the problem is not that the struct we are trying to dereference is at a bad address, but rather that the pointer to the struct is stored at a bad address.

Proposed Solution

I propose that we change the semantics of drgn.Object.member_() such that if the memory address referenced by the newly-created reference object is not mapped by the target, we generate a FaultError at this time. The FaultError should include the type of the struct and name, type and address of the member that's being accessed.

This should probably also be extended to the creation of reference objects in general. In the more general case, we may not be able to include much additional information in the FaultError, but at least the exception will be thrown when the reference object is created, rather than by the likely unrelated code that uses the reference object.

eBPF maybe?

Hello, just found your project while browsing LKML on new memory slab manager.

After looking into dragon I am curious. Have you heard anything about eBPF? I just want to understand what drgn can do, what eBPF can't? Isn't it a better option to put your priceless effort into bcc project?

It's just a suggestion, not any kind of criticism.

drgn could use libkdumpfile for more crash dumps

drgn has issues with regular kernel core files; things like module data etc aren't readable. By default, drgn only used libkdumpfile for "KDUMP" files. I was able to work around this by:

--- a/libdrgn/program.c
+++ b/libdrgn/program.c
@@ -195,7 +195,7 @@ drgn_program_set_core_dump(struct drgn_program *prog, const char *path)
        err = has_kdump_signature(path, prog->core_fd, &is_kdump);
        if (err)
                goto out_fd;
-       if (is_kdump) {
+       if (is_kdump || 1) {
                err = drgn_program_set_kdump(prog);
                if (err)
                        goto out_fd;

which just forces the use of kdumpfile, but that prevents user-space core file handling and is a bit of a hack. Some way to use kdumpfile for kernel cores would be nice, since it handles all of the complicated address space stuff.

Find load addresses of manually-reported debug information for userspace programs

Currently, debug information reported for a userspace program is reported at the dummy address range of [0, 0). It should be possible to determine the load address by correlating the build ID of the reported file to the build ID in the core dump/live program memory, similar to how libdwfl does it in dwfl_segment_report_module(). We could either implement similar logic in userspace_report_debug_info(), or preferably factor out the functionality in libdwfl so that we could use it from userspace_report_debug_info().

Support getting local variables from stack traces

In my opinion, this is the biggest feature missing from drgn. I'd love to be able to do something like:

trace = prog.stack_trace(...)
for frame in trace:
    if frame.symbol().name == 'vfs_read':
        print(d_path(frame['file'].f_path))

Additionally, @htejun requested an interface to list all of the local variables available in a given frame, which is helpful in places where you're fighting the compiler's optimizations.

With the libdwfl stack trace patches, the low-level pieces are all there (unless we end up needing to extend libdwfl's DWARF expression support). We just need to map the program counter to the scope in DWARF and look up the variable in each containing scope.

Cc: @dennisszhou

Use drgn-specific exceptions

@prakashsurya pointed out that drgn throwing generic Python exceptions makes it hard to differentiate between a Python programming error and drgn-specific errors, like out of date helper code. We might want to add drgn-specific exceptions.

One concern with, e.g., changing ValueError to drgn.ValueError is that existing scripts out there may already be depending on the existing exceptions. One way to handle this would be to make the drgn-specific exceptions inherit from the generic exception; existing code will continue working, and newer code can catch the more specific exception.

missing docs in sdist

When building the docs from the sdist on PyPI:

+ sphinx-build-3 docs html
Running Sphinx v3.5.2
making output directory... done
WARNING: html_static_path entry '_static' does not exist
WARNING: favicon file 'favicon.ico' does not exist
[...]

These are indeed missing from the tarball and should be included.

Awkward error messages in certain inputs of prog.type()

This isn't a huge deal but I figured I'd report it for future reference.

>>> prog.type('struct')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<string>", line None
SyntaxError: expected identifier after '(null)'

The input to the command may be stupid but I think the error message is wrong. Did we meant to say expected identifier after 'struct'?

Furthermore, if we give an identifier of a struct type that doesn't exist we get the standard LookupError:

>>> prog.type('struct bogus')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
LookupError: could not find 'struct bogus'

But if we provide another keyword after this the syntax error gets confusing again:

>>> prog.type('struct int')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<string>", line None
SyntaxError: expected identifier after 'int'

>>> prog.type('struct int bogus')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<string>", line None
SyntaxError: expected identifier after 'int'

Again not a big deal as the above are extreme edge cases.

Display discovered variables, constants, and functions

It would be helpful to be able to see the variables, constants, and functions drgn discovered for a particular core file. I.e. what valid values can be used inside prog[...]

Recent libkdumpfile changes breaks drgn for kdumps

The latest successful SDB nightly was 2 days ago: https://github.com/delphix/sdb/actions/runs/307613188
Yesterday the nightly failed (https://github.com/delphix/sdb/runs/1265111694?check_suite_focus=true) because when drgn tried to open our regression kdump dump to run the tests we got back VMCOREINFO does not contain valid OSRELEASE.

Since it seems like libkdumpfile was the only repo that had changes during that time I've started looking into some of these changes that look suspicious. This could be a regression on the drgn kdump code if the API changed but it could also be a bug in the new libkdumpfile code. In any case I wanted to create this issue here for reference.

Handle gaps in vmemmap in for_each_page()

/proc/kcore has some code for adding vmemmap to /proc/kcore via walk_system_ram_range(). However, that excludes ranges of pages that aren't used for RAM (e.g., ranges that are used for video ROM). I've run into this in drgn. Let's figure out if this is correct, and if not, send a fix to include all of vmemmap.

test_set_pid fails on 32 bit architectures

___________________________ TestProgram.test_set_pid ___________________________
self = <tests.test_program.TestProgram testMethod=test_set_pid>
    def test_set_pid(self):
        # Debug the running Python interpreter itself.
        prog = Program()
        self.assertIsNone(prog.platform)
        self.assertFalse(prog.flags & ProgramFlags.IS_LIVE)
        prog.set_pid(os.getpid())
        self.assertEqual(prog.platform, host_platform)
        self.assertTrue(prog.flags & ProgramFlags.IS_LIVE)
        data = b"hello, world!"
        buf = ctypes.create_string_buffer(data)
>       self.assertEqual(prog.read(ctypes.addressof(buf), len(data)), data)
E       OSError: [Errno 22] Invalid argument
tests/test_program.py:52: OSError

armv7hl failure: https://kojipkgs.fedoraproject.org//work/tasks/7502/65317502/build.log
i686 failure: https://kojipkgs.fedoraproject.org//work/tasks/7503/65317503/build.log

Change in module structure in the upstream kernel breaks drgn

Root Cause

Before version 5.8 the module structure looked like this:

struct module_sect_attr {
	struct module_attribute mattr;
	char *name;
	unsigned long address;
};

From that version, onwards it looks like this:

struct module_sect_attr {
	struct bin_attribute battr;
	unsigned long address;
};

Unfortunately drgn makes the assumption that the name field is part of the module_sect_attr structure within kernel_module_section_iterator_next_offline() when caching the kernel module sections:

static struct drgn_error *
kernel_module_section_iterator_next_offline(struct kernel_module_section_iterator *it,
					    const char **name_ret,
					    uint64_t *address_ret)
{
...
	err = drgn_object_member(&kmod_it->tmp3, &kmod_it->tmp2, "address");
	if (err)
		return err;
...
	err = drgn_object_member(&kmod_it->tmp3, &kmod_it->tmp2, "name"); // <---- Assumption here
	if (err)
		return err;
...
	return NULL;
}

An alternative way of getting the name field in these newer kernels starting from the same struct could be the following:

struct module_sect_attr -> battr -> attr -> name

There should be some conditional logic somewhere to deal with this, ideally based on the fields within the struct and not based on the version of the kernel - this way older kernels that had this patch backported can still avoid this bug.

Symptoms

In crash dumps that ran a 5.8 and later kernel (and sometimes earlier for distributions like Ubuntu - we hit this on their 5.4 kernel, so it seems like the have backported that patch) and we experience an infinite loop because of the above logic.

The specifics of the infinite loop are the following:

1] report_loaded_kernel_module() calls cache_kernel_module_sections() which in turn calls kernel_module_section_iterator_next{,_offline}.
2] The latter returns DRGN_ERROR_LOOKUP here:

static struct drgn_error *
kernel_module_section_iterator_next_offline(struct kernel_module_section_iterator *it,
					    const char **name_ret,
					    uint64_t *address_ret)
{
...
	err = drgn_object_member(&kmod_it->tmp3, &kmod_it->tmp2, "name"); // <---- DRGN_ERROR_LOOKUP
	if (err)
		return err;
...
}

3] This gets us back all the to report_loaded_kernel_module() where try to report that error:

report_loaded_kernel_module(struct drgn_debug_info_load_state *load,
			    struct kernel_module_iterator *kmod_it,
			    struct kernel_module_table *kmod_table)
{
...
	do {
		uint64_t start, end;
		err = cache_kernel_module_sections(kmod_it, kmod->elf, &start, // <---- DRGN_ERROR_LOOKUP returned
						   &end);
		if (err) {
			err = drgn_debug_info_report_error(load, kmod->path, // <---- err variable set to 0 here
							   "could not get section addresses",
							   err);
			if (err)
				return err;
			continue;          // <---- since the err variable we repeat steps 1 to 3 over and over on the *same* module - there is no progress
		}
...

This infinite loop may hint at a separate bug but nevertheless I wanted to report them together

Add mypy stubs

The bulk of the core drgn Python library is in C extensions, and we don't currently have mypy stubs to do type checking.

Ideally, in order to avoid duplicating the information in two places, we should be able to generate it from the documentation. We might be able to use existing tools like stubgen.

Check that vmlinux and kernel modules match core dump/running system

Right now, we'll blindly accept the debug info passed by the user. We should sanity check that they match the program we're debugging. This issue is specifically for the kernel; userspace is going to need a different implementation.

Ideally, we should be able to check by build ID. Unfortunately, as far as I can tell, there is no easy way to get the build ID of vmlinux from either /proc/kcore or a vmcore. We probably want to add the build ID to the VMCOREINFO note, but in the mean time we can check OSRELEASE. See the discussion in delphix/sdb#41.

I think we can already get the build ID for kernel modules via the sections we get from sysfs or the modules variable in the kernel.

As mentioned in the sdb issue, there should also be a way to override these sanity checks and load the debug information anyways.

Add drgn.Thread API

Problem Statement

drgn currently has no representation of a thread. The only thread-specific operation at the moment is getting a stack trace, which takes a TID to identify the thread:

>>> prog.stack_trace(1)
#0  __schedule+0x231/0x654
#1  schedule+0x29/0xa5
#2  schedule_hrtimeout_range_clock+0x18d/0x19c
#3  ep_poll+0x3cd/0x3ec
#4  do_epoll_wait+0xb0/0xc6
#5  __x64_sys_epoll_wait+0x1a/0x1d
#6  do_syscall_64+0x48/0x106
#7  entry_SYSCALL_64+0x7c/0x156

When debugging the kernel, it can also take a struct task_struct *:

>>> task = find_task(prog, 1)
>>> task.type_
struct task_struct *
>>> prog.stack_trace(task)
#0  __schedule+0x231/0x654
#1  schedule+0x29/0xa5
#2  schedule_hrtimeout_range_clock+0x18d/0x19c
#3  ep_poll+0x3cd/0x3ec
#4  do_epoll_wait+0xb0/0xc6
#5  __x64_sys_epoll_wait+0x1a/0x1d
#6  do_syscall_64+0x48/0x106
#7  entry_SYSCALL_64+0x7c/0x156

This is fine if you know what TID you want. But, in a lot of cases you probably want/expect the debugger to figure out the interesting TID for you. For example, if you're debugging a core dump, you probably want to look at the thread that segfaulted. We can do that for the kernel with something like:

>>> task = per_cpu(prog["runqueues"], prog["crashing_cpu"]).curr
>>> task.type_
struct task_struct *
>>> prog.stack_trace(task)
#0  panic+0x194/0x408
#1  sysrq_handle_crash+0x28/0x2c
#2  __handle_sysrq+0xd0/0x218
#3  write_sysrq_trigger+0xbc/0x108
#4  proc_reg_write+0x90/0xd8
#5  __vfs_write+0x38/0x68
#6  vfs_write+0x11c/0x298
#7  ksys_write+0x84/0x13c
#8  system_call+0x5c/0x0

But there's currently no way to do this for userspace core dumps, and even for kernel core dumps it'd be more reliable to get it from the core dump metadata instead.

The goal of this issue is to add generic APIs to represent and find threads in drgn.

API

This is currently what I have in mind as the MVP (which we may of course want to tweak):

class Program:
    def thread(self, tid: int) -> Thread:
        """Get the thread with the given thread ID."""
        ...

    def threads(self) -> Iterable[Thread]:
        """Get all threads in the program."""
        ...

    def crashed_thread(self) -> Thread:
        """Get the thread that caused the crash."""
        ...

class Thread:
    tid: int
    """Thread ID."""

    object: Object
    """
    If debugging the kernel, ``struct task_struct *`` object for this thread.
    Otherwise, not defined.
    """

    def stack_trace(self) -> StackTrace:
        """Get the stack trace for this thread."""
        ...

And the underlying libdrgn APIs, something like:

struct drgn_thread;
struct drgn_thread_iterator;

struct drgn_error *drgn_program_find_thread(struct drgn_program *prog,
					    uint32_t tid,
					    struct drgn_thread **ret);

struct drgn_error *
drgn_thread_iterator_create(struct drgn_program *prog,
			    struct drgn_thread_iterator **ret);
void drgn_thread_iterator_destroy(struct drgn_thread_iterator *it);
struct drgn_error *drgn_thread_iterator_next(struct drgn_thread_iterator *it,
					     struct drgn_thread **ret);

struct drgn_error *drgn_program_crashed_thread(struct drgn_program *prog,
					       struct drgn_thread **ret);

void drgn_thread_destroy(struct drgn_thread *thread);
uint32_t drgn_thread_tid(struct drgn_thread *thread);
struct drgn_error *drgn_thread_object(struct drgn_thread *thread,
				      struct drgn_object *ret);
struct drgn_error *drgn_thread_stack_trace(struct drgn_thread *thread,
					   struct drgn_stack_trace **ret);

drgn doesn't support stack traces of live userspace processes yet, so for now we only need this new interface to support the live kernel, kernel core dumps, and userspace core dumps.

Implementation

The implementation of these APIs will be different depending on what we're debugging.

Background

Linux core dumps use the ELF format, which can contain metadata in "note" segments. For a core dump, this metadata includes what threads the process had and what the state of the registers were, among other things:

$ eu-readelf --notes core

Note segment of 3356 bytes at offset 0x510:
  Owner          Data size  Type
  CORE                 336  PRSTATUS
    info.si_signo: 11, info.si_code: 0, info.si_errno: 0, cursig: 11
    sigpend: <>
    sighold: <>
    pid: 1120070, ppid: 1120030, pgrp: 1120070, sid: 1120030
    utime: 0.000000, stime: 0.003308, cutime: 0.000000, cstime: 0.000000
    orig_rax: -1, fpvalid: 1
    r15:                       0  r14:                       0
    r13:                       0  r12:                 4198480
    rbp:      0x00007ffebac7ae20  rbx:                 4198816
    r11:                     582  r10:                 1120070
    r9:          140732032060353  r8:                        0
    rax:                    4660  rcx:                       0
    rdx:                       0  rsi:         140732032060720
    rdi:         140281380410560  rip:      0x000000000040113f
    rflags:   0x0000000000010206  rsp:      0x00007ffebac7ae20
    fs.base:   0x00007f95cdd9b580  gs.base:   0x0000000000000000
    cs: 0x0033  ss: 0x002b  ds: 0x0000  es: 0x0000  fs: 0x0000  gs: 0x0000
  CORE                 136  PRPSINFO
    state: 0, sname: R, zomb: 0, nice: 0, flag: 0x0000000000400600
    uid: 1000, gid: 985, pid: 1120070, ppid: 1120030, pgrp: 1120070
    sid: 1120030
    fname: crashme, psargs: ./crashme 
  CORE                 128  SIGINFO
    si_signo: 11, si_errno: 0, si_code: 1
    fault address: 0x1234
  CORE                 320  AUXV
    SYSINFO_EHDR: 0x7ffebacce000
    HWCAP: 0xbfebfbff  <fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe>
    PAGESZ: 4096
    CLKTCK: 100
    PHDR: 0x400040
    PHENT: 56
    PHNUM: 13
    BASE: 0x7f95cddbb000
    FLAGS: 0
    ENTRY: 0x401050
    UID: 1000
    EUID: 1000
    GID: 985
    EGID: 985
    SECURE: 0
    RANDOM: 0x7ffebac7b219
    26: 0x2
    EXECFN: 0x7ffebac7bfee
    PLATFORM: 0x7ffebac7b229
    NULL
  CORE                 696  FILE
    15 files:
      00400000-00401000 00000000 4096                /home/osandov/crashme
      00401000-00402000 00001000 4096                /home/osandov/crashme
      00402000-00403000 00002000 4096                /home/osandov/crashme
      00403000-00404000 00002000 4096                /home/osandov/crashme
      00404000-00405000 00003000 4096                /home/osandov/crashme
      7f95cdbcd000-7f95cdbf3000 00000000 155648      /usr/lib/libc-2.33.so
      7f95cdbf3000-7f95cdd3f000 00026000 1359872     /usr/lib/libc-2.33.so
      7f95cdd3f000-7f95cdd8b000 00172000 311296      /usr/lib/libc-2.33.so
      7f95cdd8b000-7f95cdd8e000 001bd000 12288       /usr/lib/libc-2.33.so
      7f95cdd8e000-7f95cdd91000 001c0000 12288       /usr/lib/libc-2.33.so
      7f95cddbb000-7f95cddbc000 00000000 4096        /usr/lib/ld-2.33.so
      7f95cddbc000-7f95cdde0000 00001000 147456      /usr/lib/ld-2.33.so
      7f95cdde0000-7f95cdde9000 00025000 36864       /usr/lib/ld-2.33.so
      7f95cddea000-7f95cddec000 0002e000 8192        /usr/lib/ld-2.33.so
      7f95cddec000-7f95cddee000 00030000 8192        /usr/lib/ld-2.33.so
  CORE                 512  FPREGSET
    xmm0:  0x25252525252525252525252525252525
    xmm1:  0xffff00ffffff0000000000ffffffff00
    xmm2:  0xffff00ffffff0000000000ffffffff00
    xmm3:  0x0000000000000000000000000000ff00
    xmm4:  0x2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f
    xmm5:  0x00000000000000000000000000000000
    xmm6:  0x00000000000000000000000000000000
    xmm7:  0x00000000000000000000000000000000
    xmm8:  0x000009000000543b031b01000000000a
    xmm9:  0x00000000000000000000000000000000
    xmm10: 0x00000000000000000000000000000000
    xmm11: 0x00000000000000000000000000000000
    xmm12: 0x00000000000000000000000000000000
    xmm13: 0x00000000000000000000000000000000
    xmm14: 0x00000000000000000000000000000000
    xmm15: 0x00000000000000000000000000000000
    st0: 0x00000000000000000000  st1: 0x00000000000000000000
    st2: 0x00000000000000000000  st3: 0x00000000000000000000
    st4: 0x00000000000000000000  st5: 0x00000000000000000000
    st6: 0x00000000000000000000  st7: 0x00000000000000000000
    mxcsr:   0x0000ffff00001f80
    fcw: 0x037f  fsw: 0x0000
  LINUX               1088  X86_XSTATE

The most relevant note for this project is the PRSTATUS note, which is what drgn uses to start a stack trace. struct drgn_thread will probably want to store the PRSTATUS note for the thread.

Kernel

For the kernel, a struct drgn_thread also needs the struct task_struct * object.

drgn_program_find_thread() needs to find the struct task_struct * (with linux_helper_find_task()) and wrap it in a struct drgn_thread.
The thread iterator needs to iterate over all tasks and wrap each one in a struct drgn_thread. We have a for_each_task() helper implemented in Python, but that will need to be translated to C so that it can be used in libdrgn (including its dependencies: for_each_pid(), idr_for_each(), radix_tree_for_each()).
drgn_program_crashed_thread() needs to get the value of crashing_cpu and find the corresponding PRSTATUS note and TID with drgn_program_find_prstatus_by_cpu(), then it's pretty much equivalent to drgn_program_find_thread(). Obviously, this won't apply when debugging the live kernel.

Userspace Core Dumps

drgn_program_find_thread() needs to find the PRSTATUS note with drgn_program_find_prstatus_by_tid() and save it in a struct drgn_thread.
The thread iterator needs to iterate over all PRSTATUS notes and save each one in a struct drgn_thread.
drgn_program_crashed_thread() needs to determine which thread crashed and wrap its PRSTATUS note. From what I can tell, Linux puts the crashed PRSTATUS first in the core dump, but this needs to be verified.

Python Bindings

The bulk of the implementation will be in libdrgn, but there will need to be some boilerplate Python bindings.

Future Directions

A Thread.name attribute containing the comm of the thread.
A Program method to get all threads that were running at the time of the crash (or better yet, a mapping from CPU number to the thread that was running on that CPU). This would probably have to be kernel-only, as I don't know of a way to get that information from a userspace core dump.
Support for live userspace processes, including Thread.pause() and Thread.resume() methods.
Thread.object for userspace programs (maybe as the pthread_t if using pthreads?).

Add helpers for page table walking and userspace memory access

For example, is there a way to read from addresses in task.mm?

# drgn -k -s /usr/lib/debug/boot/vmlinux-5.3.0-42-generic --no-default-symbols
>>> for task in for_each_task(prog):
...   prog.read(task.mm.arg_start, task.mm.arg_end - task.mm.arg_start)
...
Traceback (most recent call last):
  File "/usr/lib/python3.7/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<console>", line 2, in <module>
_drgn.FaultError: could not find memory segment: 0x7ffdc3780ef7

Support C++ friendship

Friendship isn't very important to debugging use cases as far as I can tell, but it'd be good to include this information for completeness. We need to:

Add a list of friends to drgn.Type/struct drgn_type. Friends can be other classes or functions, so I'm not sure yet how to represent these.
Parse the DWARF information for friendship. See section 5.7.5 (Friends) in the DWARF 5 spec.
Include friendship when pretty printing types.

Enable networking in vmtest kernels

The new TCP tests from #33 are skipped on vmtest because we build with CONFIG_NET=n. If it doesn't grow the kernel too much to enable it, we should. This will require a rebuild of the existing kernels, so manage.py is also going to need some changes to support adding an EXTRAVERSION (or LOCALVERSION?).

Get libdwfl stack trace patches upstream

I'd prefer not to keep carrying the libdwfl stack trace patches in drgn's elfutils fork, so I still need to get these upstream. The maintainer wasn't completely happy with the interface, so it'll need some iteration.

OverflowError thrown for empty array variable with unspecified size

(example taken from the zfs kernel module)
There is a global in the source code that looks like this:

static const zfs_ioc_key_t zfs_keys_remap[] = {
    /* no nvl keys */
};

While inspecting it with drgn we're thrown an (undocumented to the user) OverflowError:

>>> prog['zfs_keys_remap']
Traceback (most recent call last):
  File "<console>", line 1, in <module>
OverflowError: DW_AT_count is too large

Looking at the relevant DWARF info with eu-readelf we see the following:

... <cropped> ...
 [bd8797]    variable
             name                 (strp) "zfs_keys_create"
             decl_file            (data1) 12
             decl_line            (data2) 3257
             type                 (ref4) [bd8792]
             location             (exprloc)
              [ 0] addr .rodata+0xfc40 <zfs_keys_create>
 [bd87ad]    variable
             name                 (strp) "zfs_keys_clone"
             decl_file            (data1) 12
             decl_line            (data2) 3396
             type                 (ref4) [bd8792]
             location             (exprloc)
              [ 0] addr .rodata+0xfc00 <zfs_keys_clone>
 [bd87c3]    array_type
             type                 (ref4) [bd864c]
             sibling              (ref4) [bd87d4]
 [bd87cc]      subrange_type
               type                 (ref4) [bd87d9]
               upper_bound          (sdata) -1      // <----- This is what trips our code
 [bd87d4]    const_type
             type                 (ref4) [bd87c3]
 [bd87d9]    base_type
             byte_size            (data1) 8
             encoding             (data1) signed (5)
             name                 (strp) "ssizetype"
 [bd87e0]    variable
             name                 (strp) "zfs_keys_remap"
             decl_file            (data1) 12
             decl_line            (data2) 3433
             type                 (ref4) [bd87d4]
             location             (exprloc)
              [ 0] addr .rodata+0xfc00 <zfs_keys_clone>
... <cropped> ...

According to Omar this should be easy to work around in our DWARF parsing code.
For new people picking this up a relevant commit highlighting a relevant codepath is this 77253db

reserved identifier violation

I would like to point out that an identifier like “___PASTE” does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

Add kernel modules to vmtest

#73 demonstrated that we're not testing kernel module loading enough. We need to:

Enable module support for vmtest kernels (it's disabled right now for simplicity, but it shouldn't be too hard to make, e.g., a tarball of modules).
Enable some arbitrary, ubiquitous module in the vmtest kernel config.
Test that it's properly loaded, both when using /proc/modules and /sys/module and when using the module list via the core dump.

accessing global percpu variables fails with "address is not mapped"

Hello,

I'm trying to access the global percpu variable 'runqueues' on a VM running a 5.9 kernel. The kernel is configured with the things drgn expects, at least according to vmtest/config.

>>> prog['runqueues']
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/dmjordan/src/drgn/drgn/internal/cli.py", line 25, in displayhook
text = value.format_(columns=shutil.get_terminal_size((0, 0)).columns)
_drgn.FaultError: address is not mapped: 0x1eb440

Using the per_cpu_ptr helper fails in a similar way:

>>> drgn.helpers.linux.percpu.per_cpu_ptr(prog['runqueues'], 0)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/dmjordan/src/drgn/drgn/helpers/linux/percpu.py", line 31, in per_cpu_ptr
return Object(ptr.prog_, ptr.type_, value=ptr.value_() + offset)
_drgn.FaultError: address is not mapped: 0x1eb440

Using plain gdb confirms that 0x1eb440 is the address of runqueues.

I've tried a few other global percpu variables with the same results.

drgn is ok with percpu vars when they're hanging off a structure (i.e. from __alloc_percpu):

>>> prog['async_pf_cache'].cpu_slab
(struct kmem_cache_cpu *)0x1f3a00

>>> drgn.helpers.linux.percpu.per_cpu_ptr(prog['async_pf_cache'].cpu_slab, 0)
*(struct kmem_cache_cpu *)0xffff88803dff3a00 = {
.freelist = (void **)0x0,
.tid = (unsigned long)0,
.page = (struct page *)0x0,
.partial = (struct page *)0x0,
}

I wonder whether there's something wrong with my syntax, or if this is actually a problem with drgn. Thanks for looking.

Support C++ references

E.g., type& foo. This should be a matter of adding the boilerplate for a new "reference" type kind and parsing it from DWARF. It will be almost identical to a pointer type. The only difference that comes to mind is that a reference type may not refer to a void type (in C++, at least, but maybe we can be more permissive).

Add helper for getting kernel CONFIG options

If the kernel was compiled with CONFIG_IKCONFIG=y, then the gzipped kernel configuration exists in kernel memory (see here). We can't access it currently:

>>> prog.variable('kernel_config_data')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
LookupError: could not find variable 'kernel_config_data'

This is because kernel_config_data and kernel_config_end are defined in assembly. We'll need to add an interface for looking up ELF symbols by name.

does drgn support aarch64 platform vmcore

From below snippets, It seems drgn can only handle vmcores of current host machine.

static struct drgn_error *
open_vmlinux_debug_info(struct drgn_program *prog,
                        struct string_builder *missing_debug_info)
{
        static const char * const vmlinux_paths[] = {
                /*
                 * The files under /usr/lib/debug should always have debug information,
                 * so check those first.
                 */
                "/usr/lib/debug/boot/vmlinux-%s",
                "/usr/lib/debug/lib/modules/%s/vmlinux",
                "/boot/vmlinux-%s",
                "/lib/modules/%s/build/vmlinux",
        };

Is it possible to analysis vmcore generated from a different machine(arm64 platfrom)?