Giter VIP home page Giter VIP logo

embedshim's Introduction

embedshim

Update: containerd removes shim.v1 interface and introduces sandbox API. This project will be migrated into standalone sandbox API after containerd 2.0 releases.

The embedshim is the kind of task runtime implementation, which can be used as plugin in containerd.

With current shim design, it is used to manage the lifecycle of container process and allow to be reconnected after containerd restart. The one of the key design elements of a small shim is to be a container process monitoring, at least it is important to containerd created by runC-like runtime.

Without pidfd and ebpf trace point feature, it is unlikely to receive exit notification in time and receive exit code correctly as non-parents after shim dies. And in kubernetes infra, even if the containers in pod can share one shim, the VmRSS of shim(Go Runtime) is still about 8MB.

So, this plugin aims to provide task runtime implementation with pidfd and eBPF sched_process_exit tracepoint to manage deamonless container with low overhead.

embedshim-overview

asciicast

Build/Install

The embedshim needs to compile bpf with clang/llvm. So install clang/llvm as first.

$ echo "deb http://apt.llvm.org/focal/ llvm-toolchain-focal main" | sudo tee -a /etc/apt/sources.lis
$ wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
$ sudo apt-get update -y
$ sudo apt-get install -y g++ libelf-dev clang lld llvm

And then pull the repo and build it.

$ git clone https://github.com/fuweid/embedshim.git
$ cd embedshim
$ git submodule update --init --recursive
$ make
$ sudo make install

The binary is named by embedshim-containerd which has full functionality in linux. You can just replace your local containerd with it.

$ sudo install bin/embedshim-containerd $(command -v containerd)
$ sudo systemctl restart containerd

And check plugin with ctr

$ ctr plugin ls | grep embed
io.containerd.runtime.v1        embed                    linux/amd64    ok

Status

The embedshim supports to run container in headless or with input. But it still works in progress, do not use in production.

  • Task Event(Create/Start/Exit/Delete/OOM) support

Requirements

  • raw tracepoint bpf >= kernel v4.18
  • CO-RE BTF vmlinux support >= kernel v5.4
  • pidfd polling >= kernel v5.3

License

embedshim's People

Contributors

fuweid avatar linxiulei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

embedshim's Issues

program handle_sched_process_exit: apply CO-RE relocations: bad magic number '[159 235 1 0]' in record at byte 0x0

Background

When loading the shim plugin, the following error is reported:

time="2024-01-21T11:07:44.731463684+08:00" level=warning msg="failed to load plugin io.containerd.runtime.v1.embed" error="program handle_sched_process_exit: apply CO-RE relocations: bad magic number '[159 235 1 0]' in record at byte 0x0"

Adding a print statement in exitsnoop.bpf.c seems to circumvent the issue:

 rt = (struct task_info *)bpf_map_lookup_elem(&tracing_tasks, &pid);
    if (!rt)
            // todo Adding this print avoids the 'apply CO-RE relocations: bad magic number' error, specific cause to be located
            bpf_printk("rt is: %p\n",rt);
        return 0;

Fix error in plugin initialization

Fix error in plugin initialization

failed to load plugin io.containerd.runtime.v1.embed  error="program handle_sched_process_exit: CO-RE relocations: can't read types: type id 5995: unknown kind: Unknown (19)"
could not load runtime instance due to initialization error  error="program handle_sched_process_exit: CO-RE relocations: can't read types: type id 5995: unknown kind: Unknown (19)"

update: key too big for map: argument list too long: unknown

After running for a while, the containerd logs are continuously reporting an error:

level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-scheduler-master1,Uid:4bba31f5bbd08c1ecb43f3eeca03effb,Namespace:kube-system,Attempt:221,} failed, error" error="failed to create containerd task: failed to create init process: failed to insert taskinfo for init process(id=5585c9eb3702e459fb2c73b0314e2d77670df6af8b23b0662c4032e7e328af1a, namespace=k8s.io): update: key too big for map: argument list too long: unknown"

It appears that the error is occurring during the update of an eBPF map. The following Go code seems to be involved in the issue:

// traceInitProcess checks init process is alive and starts to trace it's exit
// event by exitsnoop bpf tracepoint.
func (m *monitor) traceInitProcess(init *initProcess) (retErr error) {
	m.Lock()
	defer m.Unlock()

	fd, err := pidfd.Open(uint32(init.Pid()), 0)
	if err != nil {
		return fmt.Errorf("failed to open pidfd for %s: %w", init, err)
	}
	defer func() {
		if retErr != nil {
			unix.Close(int(fd))
		}
	}()

	// NOTE: The pid might be reused before pidfd.Open(like oom-killer or
	// manually kill), so that we need to check the runc-init's exec.fifo
	// file descriptor which is the "identity" of runc-init. :)
	//
	// Why we don't use runc-state commandline?
	//
	// The runc-state command only checks /proc/$pid/status's starttime,
	// which is not reliable. And then it only checks exec.fifo exist in
	// disk, but the runc-init has been killed. So we can't just use it.
	if err := checkRuncInitAlive(init); err != nil {
		return err
	}

	nsInfo, err := getPidnsInfo(uint32(init.Pid()))
	if err != nil {
		return fmt.Errorf("failed to get pidns info: %w", err)
	}

	if err := m.initStore.Trace(uint32(init.Pid()), &exitsnoop.TaskInfo{
		TraceID:   init.traceEventID,
		PidnsInfo: nsInfo,
	}); err != nil {
		return fmt.Errorf("failed to insert taskinfo for %s: %w", init, err)
	}
	defer func() {
		if retErr != nil {
			m.initStore.DeleteTracingTask(uint32(init.Pid()))
			m.initStore.DeleteExitedEvent(init.traceEventID)
		}
	}()

	// Before trace it, the init-process might be killed and the exitsnoop
	// tracepoint will not work, we need to check it alive again by pidfd.
	if err := fd.SendSignal(0, 0); err != nil {
		return err
	}

	if err := m.pidPoller.Add(fd, func() error {
		// TODO(fuweid): do we need to check the pid value in event?
		status, err := m.initStore.GetExitedEvent(init.traceEventID)
		if err != nil {
			init.SetExited(unexpectedExitCode)
			return fmt.Errorf("failed to get exited status: %w", err)
		}

		init.SetExited(int(status.ExitCode))
		return nil
	}); err != nil {
		return err
	}
	return nil
}

It seems that the key is not being validated properly. The key 5585c9eb3702e459fb2c73b0314e2d77670df6af8b23b0662c4032e7e328af1a is just an example, and there are other keys that also fail, such as 1ea7f8369914d19bda8da29673e4f4e037c1b39e185f6f4da0dc167539754ca2, 578193dfea54c854054abdea0a7bea11ab99e35a8d89c6469ed28084d5ab5080.

Feature: support basic task events

  • TaskCreateEventTopic for task create "/tasks/create"
  • TaskStartEventTopic for task start "/tasks/start"
  • TaskOOMEventTopic for task oom "/tasks/oom"
  • TaskExitEventTopic for task exit "/tasks/exit"
  • TaskDeleteEventTopic for task delete "/tasks/delete"
  • TaskExecAddedEventTopic for task exec create "/tasks/exec-added"
  • TaskExecAddedEventTopic for task exec start "/tasks/exec-started"

execing command in container: error stream protocol error: invalid exit code value 4294967295

When using crictl to execute the exec command, there is an intermittent error reported:

execing command in container: error stream protocol error: invalid exit code value 4294967295

The relevant Go code snippet is as follows:

		return pidMonitor.pidPoller.Add(pidFD, func() error {
			execPid := e.Pid()

			status := 256

			event, err := execExitStore.GetExitedEvent(e.traceEventID)

			if err == nil && event.Pid == uint32(execPid) {
				status = int(event.ExitCode)
			}
			execExitStore.DeleteExitedEvent(e.traceEventID)

			e.SetExited(status)
			return nil
		})

When event, err := execExitStore.GetExitedEvent(e.traceEventID) encounters an error, the status is set to 256. The setExited method then calls unix.WaitStatus(status).ExitStatus(), as shown here:

func (e *execProcess) setExited(status int) {
    e.status = unix.WaitStatus(status).ExitStatus()
    e.exited = time.Now()

    if e.parent.platform != nil {
       e.parent.platform.ShutdownConsole(context.Background(), e.console)
    }
    close(e.waitBlock)
}

According to the source code, this will return -1:

func (w WaitStatus) ExitStatus() int {
	if !w.Exited() {
		return -1
	}
	return int(w>>shift) & 0xFF
}

As a result, the error message is generated:

execing command in container: error stream protocol error: invalid exit code value 4294967295

When the command does not exist, `kubectl exec` hangs.

Background

When bash is absent, containers managed by `embedshim experience a hang, as illustrated by the following command:

[root@master1 ~]#  kubectl -it exec  xxx-bqwbh -n manager -- bash
Defaulted container "xxx" out of: xxx, init-rights (init)

Under normal circumstances, the output is as follows:

[root@master1 ~]#  kubectl -it exec  xxx-h4vvp -n manager -- bash
Defaulted container "xxx" out of: xxx, permission-change (init),
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "54071ac1a5e33002e26bd979bb5f94d8ba6072f0cf125966e721b0ddf2129889": OCI runtime exec failed: exec failed: unable to start container process: exec: "bash": executable file not found in $PATH: unknown

bug: fd leaky when delete created container

critest will call CreateContainer and delete it. And then the fifo will be leaky.

The case name is runtime should support removing created container [Conformance].

reproduce:

critest -runtime-endpoint /run/containerd/containerd.sock -ginkgo.focus 'runtime should support removing created container'

The result is from v1.5.11 containerd (using runc-v2 shim).
It is upstream issue. But block v0.1.0 release.

โžœ  testing sudo lsof -p $(pidof containerd)
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
lsof: WARNING: can't stat() fuse file system /run/user/1000/doc
      Output information may be incomplete.
COMMAND      PID USER   FD      TYPE             DEVICE SIZE/OFF     NODE NAME
container 155110 root  cwd       DIR              259,2     4096        2 /
container 155110 root  rtd       DIR              259,2     4096        2 /
container 155110 root  txt       REG              259,2 47675128  8398013 /usr/bin/containerd
container 155110 root  mem-W     REG              259,2   524288 17566340 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/metadata.db
container 155110 root  mem-W     REG              259,2  8388608 17575408 /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
container 155110 root  mem       REG              259,2  1983576  8391102 /usr/lib/x86_64-linux-gnu/libc-2.33.so
container 155110 root  mem       REG              259,2   150720  8391620 /usr/lib/x86_64-linux-gnu/libpthread-2.33.so
container 155110 root  mem       REG              259,2    22912  8391104 /usr/lib/x86_64-linux-gnu/libdl-2.33.so
container 155110 root  mem       REG              259,2   216192  8391094 /usr/lib/x86_64-linux-gnu/ld-2.33.so
container 155110 root    0r      CHR                1,3      0t0        5 /dev/null
container 155110 root    1u     unix 0xffff9202e6745940      0t0  1957280 type=STREAM
container 155110 root    2u     unix 0xffff9202e6745940      0t0  1957280 type=STREAM
container 155110 root    3u  a_inode               0,14        0    12472 [eventpoll]
container 155110 root    4r     FIFO               0,13      0t0  1955442 pipe
container 155110 root    5w     FIFO               0,13      0t0  1955442 pipe
container 155110 root    6uW     REG              259,2  8388608 17575408 /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
container 155110 root    7u  a_inode               0,14        0    12472 [eventpoll]
container 155110 root    8r  a_inode               0,14        0    12472 inotify
container 155110 root    9u  a_inode               0,14        0    12472 [eventpoll]
container 155110 root   10r     FIFO               0,13      0t0  1949683 pipe
container 155110 root   11w     FIFO               0,13      0t0  1949683 pipe
container 155110 root   12u     unix 0xffff920220e92a80      0t0  1949684 /run/containerd/debug.sock type=STREAM
container 155110 root   13u     unix 0xffff920220e96a40      0t0  1949685 /run/containerd/containerd.sock.ttrpc type=STREAM
container 155110 root   14u     unix 0xffff920220e91980      0t0  1949686 /run/containerd/containerd.sock type=STREAM
container 155110 root   15uW     REG              259,2   524288 17566340 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/metadata.db
container 155110 root   16u     IPv4            1952557      0t0      TCP localhost:41601 (LISTEN)
container 155110 root   23u     FIFO               0,25      0t0    10010 /run/containerd/io.containerd.grpc.v1.cri/containers/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885/io/2766993513/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885-stdout (deleted)
container 155110 root   24u     FIFO               0,25      0t0    10011 /run/containerd/io.containerd.grpc.v1.cri/containers/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885/io/2766993513/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885-stderr (deleted)

Dependency version warning

Dependency line:

github.com/fuweid/embedshim --> github.com/containerd/containerd --> github.com/urfave/cli

github.com/containerd/containerd v1.5.13 --> github.com/urfave/cli v1.22.1

https://github.com/containerd/containerd/blob/v1.5.13/go.mod#L119

Background

Repo github.com/containerd/containerd at version v1.5.13 uses replace directive to pin dependencygithub.com/urfave/cli to version v1.22.1.

According to Go Modules wikis, replace directives in modules other than the main module are ignored when building the main module.
It means such replace usage in dependency's go.mod cannot be inherited when building main module. And it turns out that fuweid/embedshim depends on urfave/[email protected], which is different from the pinned version containerd/containerd needed.

https://github.com/fuweid/embedshim/blob/unstable/go.mod(Line 19)

github.com/urfave/cli v1.22.2

https://github.com/containerd/containerd/blob/v1.5.13/go.mod(line 52&119)

github.com/urfave/cli v1.22.2
github.com/urfave/cli => github.com/urfave/cli v1.22.1

So this is just a reminder in the hope that you can notice such an inconsistency.

Solution

1. Bump the version of dependency github.com/containerd/containerd

You can try upgrading dependency github.com/containerd/containerd to a newer version, which may have eliminated the use of this directive.

2. Add the same replace rule to your go.mod

replace github.com/urfave/cli => github.com/urfave/cli v1.22.1

Feature: support exec API

The runC-like command doesn't support create-start two steps like init.
There needs a wrapper to support exec by pidfd and exitsnoop.

And maybe draft propose two steps in runc community.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.