Giter VIP home page Giter VIP logo

coroot / coroot-node-agent Goto Github PK

View Code? Open in Web Editor NEW
290.0 8.0 50.0 11.54 MB

A Prometheus exporter based on eBPF that gathers comprehensive container metrics

Home Page: https://coroot.com/docs/metrics/node-agent

License: Apache License 2.0

Dockerfile 0.01% Go 99.73% Makefile 0.01% C 0.23% Shell 0.02%
ebpf logs monitoring prometheus prometheus-exporter prometheus-metrics network-metrics node-metrics observability

coroot-node-agent's Introduction

Coroot-node-agent

Go Report Card License

The agent gathers metrics related to a node and the containers running on it, and it exposes them in the Prometheus format.

It uses eBPF to track container related events such as TCP connects, so the minimum supported Linux kernel version is 4.16.

Features

TCP connection tracing

To provide visibility into the relationships between services, the agent traces containers TCP events, such as connect() and listen().

Exported metrics are useful for:

  • Obtaining an actual map of inter-service communications. It doesn't require integration of distributed tracing frameworks into your code.
  • Detecting connections errors from one service to another.
  • Measuring network latency between containers, nodes and availability zones.

Related blog posts:

Log patterns extraction

Log management is usually quite expensive. In most cases, you do not need to analyze each event individually. It is enough to extract recurring patterns and the number of the related events.

This approach drastically reduces the amount of data required for express log analysis.

The agent discovers container logs and parses them right on the node.

At the moment the following sources are supported:

  • Direct logging to files in /var/log/
  • Journald
  • Dockerd (JSON file driver)
  • Containerd (CRI logs)

To learn more about automated log clustering, check out the blog post "Mining metrics from unstructured logs".

Delay accounting

Delay accounting allows engineers to accurately identify situations where a container is experiencing a lack of CPU time or waiting for I/O.

The agent gathers per-process counters through Netlink and aggregates them into per-container metrics:

Related blog posts:

Out-of-memory events tracing

The container_oom_kills_total metric shows that a container has been terminated by the OOM killer.

Instance meta information

If a node is a cloud instance, the agent identifies a cloud provider and collects additional information using the related metadata services.

Supported cloud providers: AWS, GCP, Azure, Hetzner

Collected info:

  • AccountID
  • InstanceID
  • Instance/machine type
  • Region
  • AvailabilityZone
  • AvailabilityZoneId (AWS only)
  • LifeCycle: on-demand/spot (AWS and GCP only)
  • Private & Public IP addresses

Related blog posts:

Installation

The documentation is available at coroot.com/docs/metric-exporters/node-agent.

Metrics

The collected metrics are described here.

Coroot

The best way to turn metrics to answers about app issues is to use Coroot - a zero-instrumentation observability tool for microservice architectures.

A live demo of Coroot is available at community-demo.coroot.com

Contributing

To start contributing, check out our Contributing Guide.

License

Coroot-node-agent is licensed under the Apache License, Version 2.0.

The BPF code is licensed under the General Public License, Version 2.0.

coroot-node-agent's People

Contributors

apetruhin avatar blue-troy avatar def avatar dependabot[bot] avatar keisku avatar tombokombo avatar wenhuwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coroot-node-agent's Issues

Agent does not run inside a KinD cluster

Running a KinD cluster on Linux for testing. The coroot agent fails to start with this message

netlink receive: no such file or directory

Is there any work around? It would be great to be able to test on KinD rather than spin up a real cluster on GKE.

Only containerD is shown in application

As per documents and installation, after installing coroot and agent, prometheus was attached properly and only visible application was containerD any help appreciated

How to disable profiling

Hello!

I see in the code

profiling.Start() defer profiling.Stop()
Is it possible to add a condition that specifies whether to enable or not profiling?

k0s v1.26.1+k0s.0 failed to inspect container

I'm having trouble getting service maps working. I installed coroot into a 1 master node + 2 worker nodes cluster, all applications show external endpoints, and no CPU/Memory data is picked up.

I used helm which installed the following coroot versions:

$ helm install --namespace coroot --create-namespace coroot coroot/coroot

image: ghcr.io/coroot/coroot-node-agent:1.6.4
image: ghcr.io/coroot/coroot:0.13.1
$ k0s version
v1.26.1+k0s.0

$ uname -a
Linux kmaster 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ sudo sysctl -a | grep bpf
kernel.bpf_stats_enabled = 1
kernel.unprivileged_bpf_disabled = 0
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 1
net.core.bpf_jit_limit = 264241152

node agent logs are showing failed to get container metadata for pid and failed to inspect container errors

I0226 11:40:37.256313   13256 registry.go:191] TCP connection from unknown container {connection-open none 14216 10.244.0.218:46266 10.103.101.131:80 25 5635465385867 <nil>}
W0226 11:40:37.256387   13256 registry.go:244] failed to get container metadata for pid 14216 -> /kubepods/burstable/podd785437d-e85d-40f0-b13f-52a66f1dda5d/cc5347fb09a49ab8a1017960f8c70e4e765dedb561cb0e2eb7325196fc4efcf4: failed to interact with dockerd (%!s(<nil>)) or with containerd (%!s(<nil>))

It seems that I could not push my code and create pull request, I got this error message:
Permission to coroot/coroot-node-agent.git denied to irvanmohamad

I would like to add support for k0s distribution in containerd.go file, so that it would be like this:

sockets := []string{"/var/snap/microk8s/common/run/containerd.sock", "/run/k3s/containerd/containerd.sock", "/run/containerd/containerd.sock", "/run/k0s/containerd.sock"}

failed to dial "/proc/1/root/run/containerd/containerd.sock": context deadline exceeded

coroot agent can't inspect instances on k3s cluster.

➜ k -n coroot logs -f coroot-node-agent-6tgvf
I0331 02:15:45.065649  521709 cilium.go:31] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct4_global: no such file or directory
I0331 02:15:45.065778  521709 cilium.go:37] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct6_global: no such file or directory
I0331 02:15:45.065784  521709 cilium.go:44] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v2: no such file or directory
I0331 02:15:45.065788  521709 cilium.go:44] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v3: no such file or directory
I0331 02:15:45.065792  521709 cilium.go:54] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v2: no such file or directory
I0331 02:15:45.065795  521709 cilium.go:54] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v3: no such file or directory
I0331 02:15:45.066085  521709 main.go:76] agent version: 1.7.4
I0331 02:15:45.066112  521709 main.go:82] hostname: huwl-QiTianM455-N000
I0331 02:15:45.066114  521709 main.go:83] kernel version: 5.19.0-35-generic
I0331 02:15:45.066141  521709 main.go:69] machine-id:  d4a6c200f13211ec8299c0898d1e2c00
I0331 02:15:45.066224  521709 metadata.go:66] cloud provider:
I0331 02:15:45.066227  521709 collector.go:157] instance metadata: <nil>
W0331 02:15:45.066887  521709 registry.go:65] Cannot connect to the Docker daemon at unix:///proc/1/root/run/docker.sock. Is the docker daemon running?
W0331 02:15:49.069727  521709 registry.go:68] couldn't connect to containerd through the following UNIX sockets [/var/snap/microk8s/common/run/containerd.sock,/run/k0s/containerd.sock,/run/k3s/containerd/containerd.sock,/run/containerd/containerd.sock]: failed to dial "/proc/1/root/run/containerd/containerd.sock": context deadline exceeded
I0331 02:15:49.165799  521709 registry.go:262] calculated container id 1 -> /init.scope ->
I0331 02:15:49.165856  521709 registry.go:267] "ignoring" cg="/init.scope" pid=1
I0331 02:15:49.165896  521709 registry.go:262] calculated container id 2 -> / ->
I0331 02:15:49.165905  521709 registry.go:267] "ignoring" cg="/" pid=2
➜ sudo ls -al /proc/1/root/run/containerd/containerd.sock
[sudo] password for huwl:
lrwxrwxrwx 1 root root 35  3月 30 19:50 /proc/1/root/run/containerd/containerd.sock -> /run/k3s/containerd/containerd.sock

➜ sudo ls /run/k3s/containerd/containerd.sock
/run/k3s/containerd/containerd.sock
✖ k -n coroot describe pod coroot-node-agent-6tgvf
Name:             coroot-node-agent-6tgvf
Namespace:        coroot
Priority:         0
Service Account:  default
Node:             --------
Start Time:       Fri, 31 Mar 2023 10:15:44 +0800
Labels:           app=coroot-node-agent
                  app.kubernetes.io/instance=coroot
                  app.kubernetes.io/name=node-agent
                  controller-revision-hash=c74fb5cf8
                  pod-template-generation=1
Annotations:      prometheus.io/port: 80
                  prometheus.io/scrape: true
Status:           Running
IP:               10.42.0.113
IPs:
  IP:           10.42.0.113
Controlled By:  DaemonSet/coroot-node-agent
Containers:
  node-agent:
    Container ID:  containerd://9faf71324df7cf6e4b3fbb587acda1b66efbc07142e518a914b4a1ab77e25eb8
    Image:         ghcr.io/coroot/coroot-node-agent:1.7.4
    Image ID:      ghcr.io/coroot/coroot-node-agent@sha256:a0572c1cc25b16f1625e760c893b20ee0d42263d3f8a98eda7cdeca88a8fd935
    Port:          80/TCP
    Host Port:     0/TCP
    Command:
      coroot-node-agent
      --cgroupfs-root
      /host/sys/fs/cgroup
    State:          Running
      Started:      Fri, 31 Mar 2023 10:15:45 +0800
    Ready:          True
    Restart Count:  0

kube-state-metrics is missing despite being deployed and running and shows in Prometheus

Hi, I have main coroot deployed in one cluster and working on add other clusters to this one by adding already deployed prometheus, kube-state-metrics already deployed on them and just deploying coroot-node-agent, but I can't see kube-state-metrics and service map
image
image
so I started investgating and found this fails on the coroot-node-agent pods "failed to get container metadata for pid 16843 -> /kubepods/burstable/pod6f222fb5-3d0e-425e-899c-e5495124a057/ea64d45c2a6338bb0f9aae2f05ec4a77e323915d25ed11b19cb2504cbf2113d0: failed to interact with dockerd (%!s()) or with containerd (%!s())"

kubernetes version : v1.25.16+vmware.1
OS: Ubuntu 22.04.4 LTS
kernal : 6.5.0-21-generic
container runtime : containerd://1.6.28
coroot node agent tag : 1.18.9

Some apps are not recognized

I have some apps that fail to recognize. IPs and endpoints are there, but no details on app.
Screenshot 2022-10-25 at 17 56 20
Apps are descheduler, ???, drone CI (server), bitwarden, authelia.

Is there any way to debug why they're failed to parse? Maybe missing some critical labels?

Segfault on MongoDB eBPF tracing

Hello,

Running node-agent 1.8.1 with the new eBPF tracing and ran into this problem on one of our worker nodes:

I0518 13:43:59.861092   13864 registry.go:194] TCP connection from unknown container {connection-open none 3730 10.192.6.192:53932 10.192.6.192:9100 40 592331228739525 <nil>}
panic: runtime error: slice bounds out of range [4:0]

goroutine 78 [running]:
go.mongodb.org/mongo-driver/x/bsonx/bsoncore.newBufferFromReader({0x1ef17c0, 0xc000b12600})
	/go/pkg/mod/go.mongodb.org/[email protected]/x/bsonx/bsoncore/document.go:125 +0x1f4
go.mongodb.org/mongo-driver/x/bsonx/bsoncore.NewDocumentFromReader(...)
	/go/pkg/mod/go.mongodb.org/[email protected]/x/bsonx/bsoncore/document.go:101
go.mongodb.org/mongo-driver/bson.NewFromIOReader({0x1ef17c0?, 0xc000b12600?})
	/go/pkg/mod/go.mongodb.org/[email protected]/bson/raw.go:27 +0x25
github.com/coroot/coroot-node-agent/tracing.bsonToString({0x1ef17c0?, 0xc000b12600?})
	/tmp/src/tracing/mongo.go:60 +0x25
github.com/coroot/coroot-node-agent/tracing.parseMongo({0xc0009ea260, 0x200, 0x200})
	/tmp/src/tracing/mongo.go:56 +0x225
github.com/coroot/coroot-node-agent/tracing.handleMongoQuery({0xfffffffffff43bd5?, 0x203000?, 0x29be0a0?}, {0x38?, 0x8?, 0x29be0a0?}, 0xc0009ea240, {0xc000513440, 0x3, 0x3})
	/tmp/src/tracing/mongo.go:22 +0x7c
github.com/coroot/coroot-node-agent/tracing.HandleL7Request({0xc0004bda90, 0x41}, {{{0x0?, 0x0?}, 0xc000012078?}, 0x0?}, 0xc0009ea240, 0x0?)
	/tmp/src/tracing/tracing.go:76 +0x69d
github.com/coroot/coroot-node-agent/containers.(*Container).onL7Request(0xc000241560, 0x4c6d, 0x13, 0x0, 0xc0009ea240)
	/tmp/src/containers/container.go:523 +0x312
github.com/coroot/coroot-node-agent/containers.(*Registry).handleEvents(0xc000228280, 0xc000172120)
	/tmp/src/containers/registry.go:221 +0xa2d
created by github.com/coroot/coroot-node-agent/containers.NewRegistry
	/tmp/src/containers/registry.go:89 +0x4db
  • It only happens on this one node out of several dozen nodes
  • It only happens with MongoDB

I'll try to track down some more details if I can about the MongoDB pod that it's trying to trace.

k3s compatibility

I'm not sure if it related to k3s, but my testlab is running it, so here is some report:

Installed coroot from ymls provided in docs.
Looks like everything started and works correctly, pods are in ready state, UI opens.
Specified prometheus - it works. Also, added podmonitor since i use prometheus-operator. Coroot node-agent targets are green and scraping as expected.

However, i got alot of those errors in node-agent logs:

I0917 01:17:31.021970  589105 registry.go:222] got cgroup by pid 603314 -> /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod62c5feb2_0cb6_4693_876c_4b54014e085e.slice/cri-containerd-22ebc6fd6681fe43a65f392647adf215c94c4e263bbbd7cbd4cce0980022bfb7.scope
W0917 01:17:31.022457  589105 registry.go:230] failed to interact with dockerd (%!s(<nil>)) or with containerd (%!s(<nil>))

Additionally, my map looks like this:
Screenshot 2022-09-17 at 04 20 42
10.251.x.x is my pod network, backed by cilium.

And if i go to app details, i receive this:
Screenshot 2022-09-17 at 04 21 22

One thing that comes to mind is that k3s have non-standard containerd socket, /run/k3s/containerd/containerd.sock (default containerd is /run/containerd/containerd.sock), but i can't find any hardcode in node-agent code.

restar node-agent

Deploy to kubernetes 1.20 (rancher 2.5) container (docker 20.10.17) .

Always restart coroot-node-agent , logs:

I0726 10:54:35.984709 3655059 cilium.go:35] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct6_global: no such file or directory
I0726 10:54:35.984723 3655059 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v2: no such file or directory
I0726 10:54:35.984738 3655059 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v3: no such file or directory
I0726 10:54:35.984750 3655059 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v2: no such file or directory
I0726 10:54:35.984763 3655059 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v3: no such file or directory
I0726 10:54:35.984964 3655059 main.go:81] agent version: 1.8.8
I0726 10:54:35.985040 3655059 main.go:87] hostname: umt-k8s-mts-datapro-c1.sovcombank.group
I0726 10:54:35.985045 3655059 main.go:88] kernel version: 5.4.17-2136.304.4.1.el8uek.x86_64
I0726 10:54:35.985121 3655059 main.go:71] machine-id:  98710f42f40b2be32d22d80e35a5e0c4
I0726 10:54:35.985145 3655059 tracing.go:29] no OpenTelemetry collector endpoint configured
I0726 10:54:35.985291 3655059 metadata.go:66] cloud provider:
I0726 10:54:35.985299 3655059 collector.go:157] instance metadata: <nil>
I0726 10:54:38.990940 3655059 containerd.go:37] using /run/containerd/containerd.sock
W0726 10:54:38.991003 3655059 registry.go:72] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
W0726 10:54:38.991003 3655059 registry.go:72] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
F0726 10:54:38.994821 3655059 main.go:112] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F0726 10:54:38.994821 3655059 main.go:112] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F0726 10:54:38.994821 3655059 main.go:112] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F0726 10:54:38.994821 3655059 main.go:112] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F0726 10:54:38.994821 3655059 main.go:112] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory

checking directories on the parent node

root@umt-k8s-mts-datapro-c1:/home/kulishovkm # ls -la /proc/1/root/var/run/cri/
ls: cannot access '/proc/1/root/var/run/cri/': No such file or directory
root@umt-k8s-mts-datapro-c1:/home/kulishovkm # ls -la /sys/kernel/debug/tracing
ls: cannot access '/sys/kernel/debug/tracing': No such file or directory
root@umt-k8s-mts-datapro-c1:/home/kulishovkm # ls -la /sys/kernel/debug/tracing/
ls: cannot access '/sys/kernel/debug/tracing/': No such file or directory

linux kernel 5.4 on parent node

image version coroot-node-agent 1.8.8

this my DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: coroot-agent-node-agent
  labels:
    helm.sh/chart: node-agent-0.1.34
    app.kubernetes.io/name: node-agent
    app.kubernetes.io/instance: coroot-agent
    app.kubernetes.io/version: "1.8.8"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-agent
      app.kubernetes.io/instance: coroot-agent
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-agent
        app.kubernetes.io/instance: coroot-agent
        app: coroot-node-agent
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '80'
    spec:
      tolerations:
        - operator: Exists
      priorityClassName:
      hostPID: true
      containers:
        - name: node-agent
          image: "registry.sovcombank.group/s-devops/ghcr.io/coroot/coroot-node-agent:1.8.8"
          command: ["coroot-node-agent", "--cgroupfs-root", "/host/sys/fs/cgroup"]
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              cpu: "1"
              memory: 1Gi
            requests:
              cpu: 100m
              memory: 50Mi
          env:
          ports:
            - containerPort: 80
              name: http
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /host/sys/fs/cgroup
              name: cgroupfs
              readOnly: true
            - mountPath: /sys/kernel/debug
              name: debugfs
              readOnly: false
      volumes:
        - hostPath:
            path: /sys/fs/cgroup
          name: cgroupfs
        - hostPath:
            path: /sys/kernel/debug
          name: debugfs

Can you tell me what I'm doing wrong

Too many errors at lunch of node

Hi,

We have been experiencing a situation where, upon starting the k8s node, the node-agent begins producing tens of thousands of error logs for several seconds. This sudden surge in log generation is causing disruptions in our logging system's functionality.

I have look for the available configuration options, including loglvl management for the node agent, but I have not found any configuration. It seems that there is no existing configuration option to mitigate the excessive logging behavior of the node agent and no way to change loglvl of node-agent.

This issue has been negatively impacting our system's performance and hampering our ability to effectively monitor and analyze logs. We would greatly appreciate any assistance or guidance you can provide to help us resolve this problem.

image

RSS memory calculation overflow for some low in resource usage processes

For some nodes in my setup in node view, Memory consumers, bytes graph I see some processes have 18.45E memory consumption, so the graph gets unusable.

My nodes in question run on Debian 11, using cgroups v2.

It seems to be a consequence of current RSS calculation in the agent here, wherein RSS is calculated as: current - vars["file"],.

For example, for system-getty.slice cgroup I see:

cat /sys/fs/cgroup/system.slice/system-getty.slice/memory.current
249856
cat /sys/fs/cgroup/system.slice/system-getty.slice/memory.stat | grep ^file\
file 270336

So subtraction gives negative result.

RX timestamp: no timestamp found

E0528 22:45:11.952218 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952133 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952133 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952133 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952133 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952278 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952278 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952278 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952278 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952302 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952302 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952302 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found E0528 22:45:11.952302 2630039 pinger.go:94] failed to get RX timestamp: no timestamp found W0528 22:45:11.953642 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.953642 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.967648 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.967648 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.968086 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.968086 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.973091 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.973091 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.980331 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address W0528 22:45:11.980331 2630039 container.go:851] failed to send packet to ::1: write ip4 0.0.0.0->::1: address ::1: non-IPv4 address

Are the above logs normal? I haven't found the reason why these logs are printed. Is it related to the listening docker container or the kernel?

No support for GKE?

Tried to install coroot on GKE - node-agents are all crashing with panics like this:

panic: interface conversion: bpf.MapValue is *lbmap.Backend4ValueV3, not *lbmap.Backend4Value

goroutine 101 [running]:
github.com/coroot/coroot-node-agent/containers.lookupCilium4({{{0xc000a51a78?, 0xc000216340?}, 0xc000122078?}, 0x1940?}, {{{0xb07965?, 0xc000216340?}, 0xc000122078?}, 0xd31b?})
        /tmp/src/containers/cilium.go:107 +0x4d8
github.com/coroot/coroot-node-agent/containers.lookupCiliumConntrackTable({{{0xc000216340?, 0xc000056000?}, 0xc000122078?}, 0x1970?}, {{{0xc000a519d0?, 0xbbbddc?}, 0xc000122078?}, 0x1?})
        /tmp/src/containers/cilium.go:66 +0x54
github.com/coroot/coroot-node-agent/containers.(*Container).getActualDestination(0xc0008ab9e0, 0x12fb, {{{0x0?, 0x1?}, 0xc000122078?}, 0x1b78?}, {{{0x0, 0xffff0a0c0001}, 0xc000122078}, 0x1bb})
        /tmp/src/containers/container.go:477 +0xa7
github.com/coroot/coroot-node-agent/containers.(*Container).onConnectionOpen(0xc0008ab9e0, 0x12fb, 0x8, {{{0x0?, 0x0?}, 0xc000122078?}, 0x0?}, {{{0x0, 0xffff0a0c0001}, 0xc000122078}, ...}, ...)
        /tmp/src/containers/container.go:451 +0x27b
github.com/coroot/coroot-node-agent/containers.(*Registry).handleEvents(0xc00047e040, 0xc0005233e0)
        /tmp/src/containers/registry.go:192 +0x3df
created by github.com/coroot/coroot-node-agent/containers.NewRegistry
        /tmp/src/containers/registry.go:89 +0x4db

`proc.ReadFds()` hangs

Description

This function hangs. Especially, dest, err := os.Readlink(path.Join(fdDir, entry.Name()))

func ReadFds(pid uint32) ([]Fd, error) {
fdDir := Path(pid, "fd")
entries, err := os.ReadDir(fdDir)
if err != nil {
return nil, err
}
res := make([]Fd, 0, len(entries))
for _, entry := range entries {
fd, err := strconv.ParseUint(entry.Name(), 10, 64)
if err != nil {
continue
}
dest, err := os.Readlink(path.Join(fdDir, entry.Name()))
if err != nil {
continue
}
var socketInode string
if strings.HasPrefix(dest, "socket:[") && strings.HasSuffix(dest, "]") {
socketInode = dest[len("socket:[") : len(dest)-1]
}
res = append(res, Fd{Fd: fd, Dest: dest, SocketInode: socketInode})
}
return res, nil
}

Reproduction

$ git log -1
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
commit 8e1fa825ad97ce88d587e8991cd8357c19f90dd4 (HEAD -> main, origin/main, origin/HEAD, dd-trace)
Author: Nikolay Sivko <[email protected]>
Date:   Wed Dec 20 17:19:07 2023 +0300

    CRI-O: fix container log discovery

$ uname -a
Linux ip-10-0-133-150 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ docker version
Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.20.10
 Git commit:        afdd53b
 Built:             Thu Oct 26 09:07:41 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:07:41 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.26
  GitCommit:        3dd1e886e55dd695541fdcd67420c2888645a495
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ pwd
/home/ubuntu/workspace/coroot-node-agent

$ docker build . -t coroot-node-agent-dev

[+] Building 58.9s (18/18) FINISHED                                                                                                                                                                                                                                                                                                                                                                                                                                        docker:default
 => [internal] load .dockerignore                                                                                                                                                                                                                                                                                                                                                                                                                                                    0.0s
 => => transferring context: 59B                                                                                                                                                                                                                                                                                                                                                                                                                                                     0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                                                                                                                                                                                                                                 0.0s
 => => transferring dockerfile: 553B                                                                                                                                                                                                                                                                                                                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye                                                                                                                                                                                                                                                                                                                                                                                                                   0.6s
 => [internal] load metadata for docker.io/library/golang:1.19-bullseye                                                                                                                                                                                                                                                                                                                                                                                                              0.6s
 => [builder 1/9] FROM docker.io/library/golang:1.19-bullseye@sha256:2fdfcb03b1445f06f1cf8a342516bfd34026b527fef8427f40ea7b140168fda2                                                                                                                                                                                                                                                                                                                                                0.0s
 => [stage-1 1/3] FROM docker.io/library/debian:bullseye@sha256:71f0e09d55a4042ddee1f114a0838d99266e185bf33e71f15c15bf6b9545a9a0                                                                                                                                                                                                                                                                                                                                                     0.0s
 => [internal] load build context                                                                                                                                                                                                                                                                                                                                                                                                                                                    0.0s
 => => transferring context: 22.23kB                                                                                                                                                                                                                                                                                                                                                                                                                                                 0.0s
 => CACHED [builder 2/9] RUN apt update && apt install -y libsystemd-dev                                                                                                                                                                                                                                                                                                                                                                                                             0.0s
 => CACHED [builder 3/9] COPY go.mod /tmp/src/                                                                                                                                                                                                                                                                                                                                                                                                                                       0.0s
 => CACHED [builder 4/9] COPY go.sum /tmp/src/                                                                                                                                                                                                                                                                                                                                                                                                                                       0.0s
 => CACHED [builder 5/9] WORKDIR /tmp/src/                                                                                                                                                                                                                                                                                                                                                                                                                                           0.0s
 => CACHED [builder 6/9] RUN go mod download                                                                                                                                                                                                                                                                                                                                                                                                                                         0.0s
 => [builder 7/9] COPY . /tmp/src/                                                                                                                                                                                                                                                                                                                                                                                                                                                   0.1s
 => [builder 8/9] RUN CGO_ENABLED=1 go test ./...                                                                                                                                                                                                                                                                                                                                                                                                                                   51.7s
 => [builder 9/9] RUN CGO_ENABLED=1 go install -mod=readonly -ldflags "-X main.version=unknown" /tmp/src                                                                                                                                                                                                                                                                                                                                                                             5.7s
 => CACHED [stage-1 2/3] RUN apt update && apt install -y ca-certificates && apt clean                                                                                                                                                                                                                                                                                                                                                                                               0.0s
 => [stage-1 3/3] COPY --from=builder /go/bin/coroot-node-agent /usr/bin/coroot-node-agent                                                                                                                                                                                                                                                                                                                                                                                           0.2s
 => exporting to image                                                                                                                                                                                                                                                                                                                                                                                                                                                               0.3s
[docker.log](https://github.com/coroot/coroot-node-agent/files/13790552/docker.log)

 => => exporting layers                                                                                                                                                                                                                                                                                                                                                                                                                                                              0.2s
 => => writing image sha256:52fd0dd6da8116dae22bee78bb8c62f24917e5332f5d3ec880b0b68a2fc35f27                                                                                                                                                                                                                                                                                                                                                                                         0.0s
 => => naming to docker.io/library/coroot-node-agent-dev                                                                                                                                                                                                                                                                                                                                                                                                                             0.0s

$ docker run --detach --name coroot-node-agent-dev --privileged --pid host -p 8080:80 -v /sys/kernel/debug:/sys/kernel/debug:rw -v /sys/fs/cgroup:/host/sys/fs/cgroup:ro coroot-node-agent-dev --cgroupfs-root=/host/sys/fs/cgroup

I've inserted additional trace logs to precisely identify the code segment responsible for this issue.
Also followed this doc.

See the result of git diff.

trace-log.patch

Logs

See the attachment, docker logs coroot-node-agent-dev result.

docker.log

Docker for Desktop MacOS: kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory

I am trying to run coroot locally in my Docker for Desktop on Mac, to experiment a bit. I have the following error:

$ kubectl -n coroot logs -f coroot-node-agent-sgg44
I1226 11:46:02.098046   14086 net.go:29] ephemeral-port-range: 32768-60999
I1226 11:46:02.120372   14086 cilium.go:29] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct4_global: no such file or directory
I1226 11:46:02.120501   14086 cilium.go:35] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct6_global: no such file or directory
I1226 11:46:02.120511   14086 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v2: no such file or directory
I1226 11:46:02.120518   14086 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v3: no such file or directory
I1226 11:46:02.120524   14086 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v2: no such file or directory
I1226 11:46:02.120598   14086 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v3: no such file or directory
I1226 11:46:02.121085   14086 main.go:100] agent version: 1.15.1
I1226 11:46:02.121188   14086 main.go:106] hostname: linuxkit-e6ee0407aa7e
I1226 11:46:02.121193   14086 main.go:107] kernel version: 6.4.16-linuxkit
W1226 11:46:02.121674   14086 main.go:69] failed to read machine-id: open /proc/1/root/sys/devices/virtual/dmi/id/product_uuid: no such file or directory
W1226 11:46:02.121674   14086 main.go:69] failed to read machine-id: open /proc/1/root/sys/devices/virtual/dmi/id/product_uuid: no such file or directory
W1226 11:46:02.121723   14086 main.go:69] failed to read machine-id: open /proc/1/root/etc/machine-id: no such file or directory
W1226 11:46:02.121723   14086 main.go:69] failed to read machine-id: open /proc/1/root/etc/machine-id: no such file or directory
W1226 11:46:02.121731   14086 main.go:69] failed to read machine-id: open /proc/1/root/var/lib/dbus/machine-id: no such file or directory
W1226 11:46:02.121731   14086 main.go:69] failed to read machine-id: open /proc/1/root/var/lib/dbus/machine-id: no such file or directory
I1226 11:46:02.121747   14086 tracing.go:36] OpenTelemetry traces collector endpoint: http://coroot-opentelemetry-collector:4318/v1/traces
I1226 11:46:02.121841   14086 otel.go:28] OpenTelemetry logs collector endpoint: http://coroot-opentelemetry-collector:4318/v1/logs
I1226 11:46:02.121967   14086 metadata.go:66] cloud provider:
I1226 11:46:02.121979   14086 collector.go:157] instance metadata: <nil>
I1226 11:46:02.122107   14086 profiling.go:51] profiles endpoint: http://coroot-pyroscope:4040/ingest
linkKProbes
linkKProbes end
W1226 11:46:02.309859   14086 registry.go:74] Cannot connect to the Docker daemon at unix:///proc/1/root/run/docker.sock. Is the docker daemon running?
W1226 11:46:02.309859   14086 registry.go:74] Cannot connect to the Docker daemon at unix:///proc/1/root/run/docker.sock. Is the docker daemon running?
I1226 11:46:05.318197   14086 containerd.go:37] using /run/containerd/containerd.sock
W1226 11:46:05.318317   14086 registry.go:80] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
W1226 11:46:05.318317   14086 registry.go:80] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
W1226 11:46:05.320744   14086 registry.go:83] systemd journal not found in /proc/1/root/run/log/journal,/proc/1/root/var/log/journal
W1226 11:46:05.320744   14086 registry.go:83] systemd journal not found in /proc/1/root/run/log/journal,/proc/1/root/var/log/journal
F1226 11:46:05.321518   14086 main.go:136] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F1226 11:46:05.321518   14086 main.go:136] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F1226 11:46:05.321518   14086 main.go:136] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F1226 11:46:05.321518   14086 main.go:136] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory
F1226 11:46:05.321518   14086 main.go:136] kernel tracing is not available: stat /sys/kernel/debug/tracing: no such file or directory

I followed the https://coroot.com/docs/coroot-community-edition/getting-started/installation installation steps. All is working/running:

$ kubectl -n coroot get pods
NAME                                             READY   STATUS    RESTARTS        AGE
coroot-78dc5f6597-l2v9b                          1/1     Running   0               17m
coroot-clickhouse-shard0-0                       1/1     Running   0               17m
coroot-kube-state-metrics-78c5649759-xbk8r       1/1     Running   0               17m
coroot-node-agent-sgg44                          0/1     Error     8 (5m23s ago)   17m
coroot-opentelemetry-collector-68db4f574-26zbc   1/1     Running   4 (16m ago)     17m
coroot-prometheus-server-6bc948d5dc-rhzts        2/2     Running   0               17m
coroot-pyroscope-85ffd498b7-v4stb                1/1     Running   0               17m

Is there a way that I can change the daemonset in a way to have it running? So I can at least experiment a bit with it?
Thanks in advance!

The metrics interface of some nodes cannot respond normally

Env

# k get node -owide
NAME          STATUS   ROLES    AGE      VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.165.6.25   Ready    node     584d     v1.19.4   10.165.6.25   <none>        Ubuntu 18.04.6 LTS      5.4.187-0504187-generic       docker://19.3.13
10.165.6.26   Ready    node     581d     v1.19.4   10.165.6.26   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.6.27   Ready    node     581d     v1.19.4   10.165.6.27   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.8.23   Ready    node     109d     v1.19.4   10.165.8.23   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
....

# helm -n coroot list
NAME  	NAMESPACE	REVISION	UPDATED                             	STATUS  	CHART       	APP VERSION
coroot	coroot   	1       	2023-10-25 15:48:12.919318 +0800 CST	deployed	coroot-0.5.1	0.21.0

# k -n coroot get pods -owide | grep coroot-node-agent
coroot-node-agent-249ws                           1/1     Running   0          36m     10.165.208.69    10.165.6.27   <none>           <none>
coroot-node-agent-6bxlb                           1/1     Running   0          4h27m   10.165.204.252   10.165.8.23   <none>           <none>
coroot-node-agent-tfhdw                           1/1     Running   6          4h27m   10.165.210.2     10.165.6.26   <none>           <none>
coroot-node-agent-89xqp                           1/1     Running   7          4h26m   10.165.202.98    10.165.6.25   <none>           <none>

Description

the some nodes coroot-node-agent status was done.

image

the cpu profile shows that the netlink.AddrList function takes up more than 70% of the cpu time.
image

abnormal coroot-node-agent pods cpu usage is about 2.5C, normal pod cp usage is about 0.2C
image

all node configurations and pod numbers are similar, and please help me troubleshoot the problem.

Error running coroot-node-agent on docker and WSL2

Hi,
Thanks for such a great tool.
I'm just running coroot-node-agent on Docker using the command provided in the readme file and I'm getting the following error:
`
I0926 05:12:35.317767 95622 main.go:76] agent version: 1.0.21

I0926 05:12:35.317978 95622 main.go:82] hostname: docker-desktop

I0926 05:12:35.317983 95622 main.go:83] kernel version: 5.10.102.1-microsoft-standard-WSL2

W0926 05:12:35.318069 95622 main.go:65] failed to read machine-id: open /proc/1/root/sys/devices/virtual/dmi/id/product_uuid: no such file or directory

W0926 05:12:35.319393 95622 main.go:65] failed to read machine-id: open /proc/1/root/etc/machine-id: no such file or directory

W0926 05:12:35.319610 95622 main.go:65] failed to read machine-id: open /proc/1/root/var/lib/dbus/machine-id: no such file or directory

I0926 05:12:35.319757 95622 metadata.go:63] cloud provider:

I0926 05:12:35.319781 95622 collector.go:152] instance metadata:

F0926 05:12:35.320025 95622 main.go:103] netlink receive: no such file or directory`

Not sure If I have missed something, I'm running windows 11 and WSL2.

Thanks for your help.
Best Regards,
Alex

【Advice】Providing performance testing documents and data

May I ask what is the additional network latency for L4 and L7 caused by coroot-node-agent?
What is the impact of eBPF application topology, trace, etc. on the application?
Can official documents provide performance pressure test data?

After our testing, the coroot-node-agent network latency p90 increased by 6460us and QPS decreased by 50%, as shown in the flame diagram below.
image

50faa2b703bca3c9eb9496bc91f2c13d
987a63f0f0dfaf69fd9d7245ac04eef2

Why is the value size of #define MAX_PAYLOAD_SIZE 1024? Is it quite time-consuming to process the logic of L7 here, and why is it 1024 bytes?

the process of performance test:

  1. deploy coroot according to the document of coroot website.
➜  ebpf-performance kubectl -n coroot get pod -owide
NAME                                              READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
coroot-68d887b548-4fhkn                           1/1     Running   0          16d   10.2.2.10    192.168.1.14   <none>           <none>
coroot-clickhouse-shard0-0                        1/1     Running   0          16d   10.2.2.54    192.168.1.14   <none>           <none>
coroot-kube-state-metrics-597cfdc9f5-pjvxm        1/1     Running   0          16d   10.2.2.209   192.168.1.14   <none>           <none>
coroot-node-agent-6wshb                           1/1     Running   0          16d   10.2.2.219   192.168.1.14   <none>           <none>
coroot-node-agent-cfsfx                           1/1     Running   0          16d   10.2.1.124   192.168.1.20   <none>           <none>
coroot-node-agent-rt8hk                           1/1     Running   0          16d   10.2.0.110   192.168.1.24   <none>           <none>
coroot-opentelemetry-collector-6659857566-nw4m4   1/1     Running   0          40h   10.2.2.160   192.168.1.14   <none>           <none>
coroot-prometheus-server-669b7ccbb6-jfvzn         2/2     Running   0          16d   10.2.2.216   192.168.1.14   <none>           <none>
coroot-pyroscope-6fb8fc4db-l5df5                  1/1     Running   0          16d   10.2.2.102   192.168.1.14   <none>           <none>
coroot-pyroscope-ebpf-6c6wx                       1/1     Running   0          16d   10.2.0.54    192.168.1.24   <none>           <none>
coroot-pyroscope-ebpf-dj6c6                       1/1     Running   0          16d   10.2.2.61    192.168.1.14   <none>           <none>
coroot-pyroscope-ebpf-tjkcq                       1/1     Running   0          16d   10.2.1.59    192.168.1.20   <none>           <none>
  1. deploy the client and server of perfomance test with command taskset -c 0-1 wrk -t 2 -c 4 -d 60s http://ip --latency.
➜  ebpf-performance kubectl -n cilium-test get pod -owide
NAME                    READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
nginx-b89648f96-2bz7r   1/1     Running   0          11m   10.2.2.57    192.168.1.14   <none>           <none>
wrk-58fb8c49ff-d7p2c    1/1     Running   0          33m   10.2.1.161   192.168.1.20   <none>           <none>
  1. we can start our performance test with client and server.
    the result of without coroot environment:
[root@wrk-58fb8c49ff-s4g8b /]# taskset -c 0-1 wrk -t 2 -c 4 -d 60s http://172.16.2.191 --latency
Running 1m test @ http://172.16.2.191
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   286.99us  357.21us  12.03ms   96.71%
    Req/Sec     8.22k     1.90k   16.70k    89.92%
  Latency Distribution
     50%  235.00us
     75%  252.00us
     90%  297.00us
     99%    2.23ms
  982111 requests in 1.00m, 796.08MB read
Requests/sec:  16366.99
Transfer/sec:     13.27MB

the result of with coroot environment:

[root@wrk-58fb8c49ff-d7p2c /]# taskset -c 0-1 wrk -t 2 -c 4 -d 60s http://10.2.2.57 --latency
Running 1m test @ http://10.2.2.57
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.78ms    7.83ms 182.95ms   92.91%
    Req/Sec     3.22k     1.58k    8.56k    60.55%
  Latency Distribution
     50%  394.00us
     75%    1.43ms
     90%    7.29ms
     99%   33.57ms
  384280 requests in 1.00m, 311.49MB read
Requests/sec:   6396.37
Transfer/sec:      5.18MB
  1. the test environment:
os:Ubuntu / 20.04 LTS amd64 (64bit)   
cri:containerd 1.6.20
Kubernetes version:1.24.4
kernel version:5.4.0-139-generic
  1. the yaml of client and server.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: wrk
spec:
  selector:
    matchLabels:
      run: wrk
  replicas: 1
  template:
    metadata:
      labels:
        run: wrk
    spec:
      initContainers:
      - name: setsysctl
        image: xxx/busybox:latest
        securityContext:
          privileged: true
        command:
        - sh
        - -c
        - |
          sysctl -w net.core.somaxconn=65535
          sysctl -w net.ipv4.ip_local_port_range="1024 65535"
          sysctl -w net.ipv4.tcp_tw_reuse=1
          sysctl -w fs.file-max=1048576
      containers:
      - name: wrk
        image: xxx/wrk:4.2.0
        ports:
        - containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  minReadySeconds: 0
  strategy:
    type: RollingUpdate # 策略:滚动更新
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        sidecar.istio.io/inject: "false"
        app: nginx
    spec:
      restartPolicy: Always
      initContainers:
        - name: setsysctl
          image: xxx/busybox:latest
          securityContext:
            privileged: true
          command:
            - sh
            - -c
            - |
              sysctl -w net.core.somaxconn=65535
              sysctl -w net.ipv4.ip_local_port_range="1024 65535"
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w fs.file-max=1048576
      containers:
        - name: nginx
          image: xxx/nginx:1.14.2
          imagePullPolicy: Always
          ports:
            - containerPort: 80
          command:
            - /bin/sh
            - -c
            - "cd /usr/share/nginx/html/ && dd if=/dev/zero of=1k bs=1k count=1 && dd if=/dev/zero of=100k bs=1k count=100 && nginx -g \"daemon off;\""

Failed to start in docker agent version: 1.19.8

kernel:
Linux localhost 4.19.90-52.22.v2207.ky10.aarch64 #1 SMP Tue Mar 14 11:52:45 CST 2023 aarch64 aarch64 aarch64 GNU/Linux
coroot agent version: 1.19.8
docker version:
Client:
Version: 23.0.6
API version: 1.42
Go version: go1.19.9
Git commit: ef23cbc
Built: Fri May 5 21:16:16 2023
OS/Arch: linux/arm64
Context: default

Server: Docker Engine - Community
Engine:
Version: 23.0.6
API version: 1.42 (minimum version 1.12)
Go version: go1.19.9
Git commit: 9dbdbd4
Built: Fri May 5 21:17:31 2023
OS/Arch: linux/arm64
Experimental: false
containerd:
Version: v1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0

failed to start, and logs

I0604 04:14:47.431304 1274943 main.go:111] agent version: 1.19.8
I0604 04:14:47.431370 1274943 main.go:117] hostname: localhost
I0604 04:14:47.431377 1274943 main.go:118] kernel version: 4.19.90-52.22.v2207.ky10.aarch64
I0604 04:14:47.434007 1274943 main.go:75] machine-id:  1d7b4ad8f42243509ec578c69b2b0c9d
I0604 04:14:47.434066 1274943 tracing.go:37] OpenTelemetry traces collector endpoint:
I0604 04:14:47.434118 1274943 otel.go:29] OpenTelemetry logs collector endpoint:
I0604 04:14:47.434292 1274943 metadata.go:67] cloud provider:
I0604 04:14:47.434303 1274943 collector.go:157] instance metadata: <nil>
I0604 04:14:47.434435 1274943 profiling.go:52] profiles endpoint:
W0604 04:14:52.105964 1274943 registry.go:82] couldn't connect to containerd through the following UNIX sockets [/var/snap/microk8s/common/run/containerd.sock,/run/k0s/containerd.sock,/run/k3s/containerd/containerd.sock,/run/containerd/containerd.sock]: failed to dial "/proc/1/root/run/containerd/containerd.sock": context deadline exceeded
W0604 04:14:52.105964 1274943 registry.go:82] couldn't connect to containerd through the following UNIX sockets [/var/snap/microk8s/common/run/containerd.sock,/run/k0s/containerd.sock,/run/k3s/containerd/containerd.sock,/run/containerd/containerd.sock]: failed to dial "/proc/1/root/run/containerd/containerd.sock": context deadline exceeded
W0604 04:14:52.106002 1274943 registry.go:85] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
W0604 04:14:52.106002 1274943 registry.go:85] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
I0604 04:14:52.107340 1274943 tracer.go:79] L7 tracing is disabled
F0604 04:14:52.534310 1274943 main.go:149] failed to link program: reading file "/sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id": open /sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id: no such file or directory
F0604 04:14:52.534310 1274943 main.go:149] failed to link program: reading file "/sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id": open /sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id: no such file or directory
F0604 04:14:52.534310 1274943 main.go:149] failed to link program: reading file "/sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id": open /sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id: no such file or directory
F0604 04:14:52.534310 1274943 main.go:149] failed to link program: reading file "/sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id": open /sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id: no such file or directory
F0604 04:14:52.534310 1274943 main.go:149] failed to link program: reading file "/sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id": open /sys/kernel/debug/tracing/events/syscalls/sys_enter_connect/id: no such file or directory
[sw@localhost docker_compose]$ ls /sys/kernel/debug/tracing/events/
ls: cannot access '/sys/kernel/debug/tracing/events/': Permission denied
[sw@localhost docker_compose]$ sudo ls /sys/kernel/debug/tracing/events/
alarmtimer  cma		      drm	filemap       hns3	   io_uring  libata   net	      percpu	    rcu      sched   spi      thermal  workqueue
block	    compaction	      enable	fs_dax	      huge_memory  ipi	     mdio     nvme	      power	    regmap   scsi    sunrpc   timer    writeback
bpf_trace   context_tracking  ext4	ftrace	      i2c	   irq	     migrate  oom	      printk	    rpcrdma  signal  swiotlb  ucsi     xdp
bridge	    cpuhp	      fib	gpio	      ib_mad	   jbd2      module   page_isolation  qdisc	    rpm      skb     target   udp      xfs
cgroup	    devlink	      fib6	header_event  initcall	   kmem      napi     pagemap	      ras	    rseq     smbus   task     vmscan   xhci-hcd
clk	    dma_fence	      filelock	header_page   iommu	   kvm	     neigh    page_pool       raw_syscalls  rtc      sock    tcp      wbt
[sw@localhost docker_compose]$ sudo ls /sys/kernel/debug/tracing/events/syscalls
```


I guess maybe some feature on tracing is not supported in this kernel , I disabled trace with `  "--disable-l7-tracing"` it still can not start.





OOM kill agent

Hello coroot team, I have another issue, now centered on the node-agent.
Attached metrics from lens:

c55edbfc-618d-4f6d-8737-7309f7fc1104
5fc153cf-d4b2-47fa-a1b8-8b7e6779987e
36d2fa8d-6077-4de1-995a-250a5c598204

We are reaching the 1GB default memory limit on 3 nodes. We have coroot running in 3-4 different clusters and this behaviour appears only in one. We have tried looking into node differences but we didn't find anything. Do you have any idea why these spikes might be happening and how it could be mitigated?

Questions about EPHEMERAL_PORT_RANGE

hi, I found that the EPHEMERAL_PORT_RANGE environment variable limits the detection of listening ports. The value is set to 32768-60999. Can I unlimit all of it? Whether it will have some bad effects?such as 0-0. Thanks.

crashloop on panic with labels

Hello, from today we occured error:
image

agent version: 1.8.8
kernel version: 5.10.184-175.731.amzn2.x86_64
EKS 1.26

I can send you contains of blurred area if you needed, but I need to know there are no sensitive data about our system.

Bug in documentation

Hello!

you have an error in the documentation. node-agent

set flags
--no-parse-logs Disable container logs parsing

but if set this flag , node-agent error:

coroot-node-agent: error: unknown long flag '--no-parse-logs', try --help

but in your code it says flag node-agent

DisableLogParsing = kingpin.Flag("disable-log-parsing", "Disable container log parsing").Default("false").Bool()

please correct the documentation node-agent flags

you have a great project!

On EKS 1.22 missing some node_cloud_info needed for cost analysis

Hello,
using latest node agent helm chart I can't get cost analysis, after some research on prometheus it's seem some label are missing based on your documentation

Here some label missing on node_cloud_info :

  • provider
  • instance_type
  • instance_life_cycle
  • region
  • availability_zone_id

My current label:

{
    account_id="xxxxx",
    availability_zone_id="xxxxxx",
    instance="xxxxxxxx:80",
    instance_id="i-00fxxxxxxxxxxx",
    local_ipv4="10.xxxxxxxx",
    machine_id="ec2708xxxxxxxxxxxxxxxx",
}

Some complementary information:
image

Fails to start with docker run

Given the documented docker run line

docker run -it --name coroot-node-agent     --privileged --pid host     -v /sys/kernel/debug:/sys/kernel/debug:rw     -v /sys/fs/cgroup:/host/sys/fs/cgroup:ro     ghcr.io/coroot/coroot-node-agent --cgroupfs-root=/host/sys/fs/cgroup

on a ubuntu 2004 host (not running in k8s) the container starts and immediately exits.

I0609 20:47:54.413530  101556 cilium.go:29] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct4_global: no such file or directory
I0609 20:47:54.413651  101556 cilium.go:35] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct6_global: no such file or directory
I0609 20:47:54.413676  101556 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v2: no such file or directory
I0609 20:47:54.413704  101556 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v3: no such file or directory
I0609 20:47:54.413730  101556 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v2: no such file or directory
I0609 20:47:54.413764  101556 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v3: no such file or directory
I0609 20:47:54.414233  101556 main.go:81] agent version: 1.8.6
I0609 20:47:54.414328  101556 main.go:87] hostname: alpha-pg-1
I0609 20:47:54.414356  101556 main.go:88] kernel version: 5.4.0-1092-kvm
I0609 20:47:54.414461  101556 main.go:71] machine-id:  5d42852a98ec471c9d4c9ee29536a7f6
I0609 20:47:54.414509  101556 tracing.go:29] no OpenTelemetry collector endpoint configured
I0609 20:47:54.414945  101556 metadata.go:66] cloud provider:
I0609 20:47:54.415018  101556 collector.go:157] instance metadata: <nil>
I0609 20:47:57.420953  101556 containerd.go:37] using /run/containerd/containerd.sock
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist

I've verified that the paths are correct, that indeed /sys/kernel/debug and /sys/fs/cgroup does exist on the host.

On the host there is nothing in the root@db:~# ls /proc/1/root/sys/fs/bpf/ directory.

I'm using the latest docker image

Error in Docker Desktop WSL2 while install coroot-node-agent

Hello! I installed coroot via helm chart: helm install --namespace coroot --create-namespace coroot coroot/coroot

My env: Docker Desktop v4.22 in Windows 11, WSL2

After run all pods coroot-node-agent don`t start, errors in log:

Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct4_global: no such file or directory Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct6_global: no such file or directory Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v2: no such file or directory Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v3: no such file or directory Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v2: no such file or directory Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v3: no such file or directory

Screenshot_1

Please tell what the problem? Probably, on WSL2 it could not start by design (ebpf and so on)?

Fails to start with SELinux

Hello!
Add permission for SELinux to the installation script

semanage fcontext -a -t bin_t "/usr/bin/coroot-node-agent" && restorecon -Rv /usr/bin/coroot-node-agent

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.