kindlingproject / kindling Goto Github PK

eBPF-based Cloud Native Monitoring Tool

Home Page: http://kindling.harmonycloud.cn

License: Apache License 2.0

Go 40.11% Dockerfile 0.07% Shell 0.43% C 1.07% C++ 2.42% JavaScript 0.91% TypeScript 19.37% CMake 0.25% Makefile 0.24% HTML 0.02% Less 35.12%

kubernetes monitoring observability profiling ebpf

kindling's Introduction

Kindling

Visit our Kindling website for more information.

What is Kindling

Kindling is an eBPF-based cloud native monitoring tool, which aims to help users understand the app behavior from kernel to code stack. With trace profiling, we hope the user can understand the application's behavior easily and find the root cause in seconds. Besides trace profiling, Kindling provides an easy way to get an overview of network flows in the Kubernetes environment, and many built-in network monitor dashboards like TCP retransmit, DNS, throughput, and TPS. Not only as a network monitor, but Kindling is also trying to analyze one detail RPC call and get the following information, which can be found in network explore in chrome. And the usage is the same as network exploration in chrome, with which users can tell which part to dig into to find the root cause of the problem in the production environment.

What is Kindling Trace-profiling

With traces, metrics, and logs, many issues still can’t be understood easily. Trace-profiling is trying to integrate the OnCPU and OffCPU events within the traces and collect the logs output during the trace execution timeframe.

OnCPU events are just like the flame graph, but the code has been collected at the thread level instead of the process level. And the trace was executed by one thread, so users can understand how the trace was executing on the CPU.

OffCPU events are the opposite of OnCPU events. As for the trace analysis, most traces spend a lifetime waiting for the locks, database query, remote process call, file reading, or file writing. All of these events cause the thread in waiting status, and they are considered as OffCPU events.

So for trace profiling, how all threads were executed is recorded and can be replayed.

The exact thread which executed the trace span is highlighted.
The logs printed by each thread are collected and correlated to the relative thread with its timestamp.
The code execution flame graph is correlated to the time series where the CPU is busy.
The network-related metrics are correlated to the time series where the network syscalls are executing.
The file-related metrics are correlated to the time series where the file syscalls are executing.

Architecture

From a high-level view, the agent runs as DeamonSet in Kubernetes. It collects all syscalls and some other tracepoints. We use different exporters for different distributions. For example, we build a Prometheus exporter to export the data which can be stored in Prometheus and displayed in Grafana Plugin. But for the trace profiling module, the UI is different, that's a standalone module.

Linux kernel version support

The Kindling eBPF module depends on the kernel version which is newer than 4.14. But for trace-profiling, the kernel version has to be newer than 4.17, hoping with more work, the trace-profiling can work on older kernel versions. As an eBPF constraint, the eBPF module can't work for older kernel versions. But for the users who want to try the functionality with the old kernel, we use the kernel module from Sysdig open-source project with enhancement and verification. Basically, the idea is to use a kernel module for tracking the kernel tracepoints. Thanks to Sysdig open-source project, it provides a tracepoint instrument framework for older kernel versions.

For now, the kernel module works as expected as the eBPF module during our test except for trace-profiling, but it is recommended to use the eBPF module in the production environment as it is safer than the kernel module. In order to help the older kernel version user experience the eBPF magic, we also support the kernel model. And you are welcome to report issues with the kernel module. For functionality, the kernel module and eBPF module capture the same data and behave exactly the same.

Why do we build Kindling?

When we talk about observability, we already have plenty of tools to use, like Skywalking for tracing, ELK for logging, and Prometheus for metrics. Why do we need to build an eBPF-based monitoring tool?

The majority issue for user adoption of k8s is the complexity. For the applications on Kubernetes, we don't know the network flows between the services until we can instrument the apps. We can't tell which part to blame when there is a product issue arise. Do we configure Kubernetes correctly? Are there any bugs in the virtual network like Calico or Flannel that caused this problem? Does the application code cause this issue?

We are a company based in Hangzhou, China, and used to provide Kubernetes distribution for our customers. Our customers used to have those questions, and we don't have proper solutions to answer those questions.

APM(Application Performance Monitoring) is good for those applications in Java which can be instrumented automatically, while for the Go programs, the code has to be rewritten for instrumenting. And even if we adopt the APM solution, we still can't tell whether an issue is caused by network problems, and many issues can’t be pinpoint the root cause easily.

We found it may be helpful that we triage the issue first by checking the issue from the network view to identify issues roughly like "oh, it's a network problem, the code works fine, and we should dig into the configuration of calico" or "the infrastructure works fine, so the app code should be blamed, let's dig into the logs or the APM dashboard for further information".

After we triage the issue, we need to pinpoint the root cause of the issue. That's why we need the trace-profiling module.

Why eBPF?

The libpcap way of analyzing the flows in the Kubernetes environment is too expensive for the CPU and network. The eBPF way of data capture cost much less than libpcap. eBPF is the most popular technology to track the Linux kernel where the virtual network, built by veth-pair and iptables, works. So eBPF is a proper technique to be used for tracking how the kernel responds to application requests.

Core Features

With the trace-profiling module, we can understand how ElasticSearch works easily. The following image shows how ElasticSearch is executing the bulk insert operation.

The next image shows a dependency map in Kubernetes.

Kindling can be easily integrated with Prometheus, and we uses PromQL to query the data in the frontend, so it should be adopted easily. But due to the cardinality constraint of Prometheus, we group the detailed data into buckets which throw away the detailed information.

Get started

You can deploy Kindling easily, check out the Installation Guide for details.

Documentation

The Kindling documentation is available on our Kindling website

Contributing

Contributions are welcome, you can contribute in many ways: report issues, help us reproduce issues, fix bugs, add features, give us advice on GitHub discussion, and so on. If you are interested in joining us to unveil the eBPF in the Kubernetes area, you can start by reading the Contributing Guide.

Contact

If you have questions or ideas, please feel free to reach out to us in the following ways:

Check out the discussions
Join the Kindling Slack channel
Join the WeChat Group (in Chinese)

License

Kindling is distributed under Apache License, Version2.0.

kindling's People

Stargazers

Watchers

Forkers

dxsup nejan2020 sanyangji hocktea214 handong890 jundizhou j-lena sugary199 laashub-soa tydhot fwxiong derekhjray jw10041229 bobsongplus geegang cjwkof kindling-robot supercxx yaofighting yhsmer jiangxiongwei linux-kern antime1 isgasho thenicetgp julylijj teanix lokichoggio thousandxu chengyuanlicy danielqsj sunnyboy-wyh lyedc joan-jmf lijun-cloud-it kayzzzz hombre9 savagecm lj2018110133 ivwsigemq gxg001 lan-ce-lot zhy76 ytghost jeromeji shenkonghui blue-troy fengjixuchui kelianchun zhaodoublehuan pluviophile225 sjanulonoks songlin-liu llhhbc juewu4072 seabergzzl tianyuansun quinlan-z weijilab lenshood yanhongchang paul1989889 hwz779866221 lambertzhaglog wangfeng1983 xxxywyh lifehacking pghildiyal chenmol luis-sousa-pinto lymanle ydmsama gophper wuhua988 awesomegolang jamestiotio web-logs2 ldnn linajiang xujianming2017 charygao fuwx295 recyin620 fallingleaflun cf1998 zhaoxiangyua akainocat flyyi wll03

kindling's Issues

lower kernel version not support metric of srtt

Describe the bug
lower kernel version not support metric of srtt

How to reproduce?
compiler probe in 3.10.0-229.el7.x86_64 && 3.10.0-123.el7.x86_64

Logs
Please attach the logs by running the following command:

kubectl logs -f kindling-agent-xxx(replace with your podname) -n kindling -c kindling-probe
kubectl logs -f kindling-agent-xxx(replace with your podname) -n kindling -c kindling-collector

14:42
Compile probe for 3.10.0-123.el7.x86_64
make -C /usr/src/kernels/3.10.0-123.el7.x86_64/ M=/source/driver modules
make[1]: Entering directory /host/usr/src/kernels/3.10.0-123.el7.x86_64' CC [M] /source/driver/main.o CC [M] /source/driver/dynamic_params_table.o CC [M] /source/driver/fillers_table.o CC [M] /source/driver/flags_table.o CC [M] /source/driver/ppm_events.o CC [M] /source/driver/ppm_fillers.o /source/driver/ppm_fillers.c: In function 'f_tcp_rcv_established_e': /source/driver/ppm_fillers.c:5224:15: error: 'struct tcp_sock' has no member named 'srtt_us' u32 srtt = ts->srtt_us >> 3; ^ make[2]: *** [/source/driver/ppm_fillers.o] Error 1 make[1]: *** [_module_/source/driver] Error 2 make[1]: Leaving directory /host/usr/src/kernels/3.10.0-123.el7.x86_64'
make: *** [all] Error 2
mv: cannot stat 'kindling-falcolib-probe.ko': No such file or directory
make -C /usr/src/kernels/3.10.0-123.el7.x86_64/ M=/source/driver clean
make[1]: Entering directory /host/usr/src/kernels/3.10.0-123.el7.x86_64' CLEAN /source/driver/.tmp_versions make[1]: Leaving directory /host/usr/src/kernels/3.10.0-123.el7.x86_64'

Discuss on list_max_size

In this commit 115cb2f, I add a control list_max_size that limits the size of convert list for kindling events. If current list size exceeds the max_size, the convertor will not convert anymore and drop the events until the list is empty after sending.
The argument list_max_size is used to control the CPU and memory usage, so the recommended value or default value should be considered. If list_max_size is set bigger, more CPU usage for marshaling the list once and more memory usage for storing the list.
Besides, we should record the number of events dropped because of full list, maybe as a self-monitor metric.

Failed to compile the agent-libs when the kernel version is below 3.10.0-693

Describe the bug
Failed to compile the agent-libs when the kernel version is below 3.10.0-693. There is an error complained about.

How to reproduce?
Just compile the agent-libs repository following the instructions at Installation on the kernel version of 3.10.0-514 or below.

Logs

Compile probe for 3.10.0-514.el7.x86_64
make -C /usr/src/kernels/3.10.0-514.el7.x86_64/ M=/source/driver modules
make[1]: Entering directory '/host/usr/src/kernels/3.10.0-514.el7.x86_64'
  CC [M]  /source/driver/main.o
In file included from include/linux/compiler.h:54,
                 from include/linux/kprobes.h:32,
                 from /source/driver/main.c:13:
include/linux/compiler-gcc.h:106:1: fatal error: linux/compiler-gcc8.h: No such file or directory
 #include gcc_header(__GNUC__)
 ^~~~
compilation terminated.
make[2]: *** [scripts/Makefile.build:342: /source/driver/main.o] Error 1
make[1]: *** [Makefile:1300: _module_/source/driver] Error 2
make[1]: Leaving directory '/host/usr/src/kernels/3.10.0-514.el7.x86_64'
make: *** [Makefile:17: all] Error 2
mv: cannot stat 'kindling-falcolib-probe.ko': No such file or directory
make -C /usr/src/kernels/3.10.0-514.el7.x86_64/ M=/source/driver clean
make[1]: Entering directory '/host/usr/src/kernels/3.10.0-514.el7.x86_64'
  CLEAN   /source/driver/.tmp_versions
make[1]: Leaving directory '/host/usr/src/kernels/3.10.0-514.el7.x86_64'
make -C /usr/src/kernels/3.10.0-514.el7.x86_64/ M=$PWD
make[1]: Entering directory '/host/usr/src/kernels/3.10.0-514.el7.x86_64'
clang -I./arch/x86/include -Iarch/x86/include/generated  -Iinclude -I./arch/x86/include/uapi -Iarch/x86/include/generated/uapi -I./include/uapi -Iinclude/generated/uapi -in
clude ./include/linux/kconfig.h \

Environment (please complete the following information)

Kindlinng-falcon-lib version: Based on the latest codes on the branch of kindling-dev. The commit is 8c130f7.
Node Kernel version: 3.10.0-514.el7.x86_64 or below
Node OS version: CentOS7

Additional context
By the way, 3.10.0-693 is ok.

Add a real-time circuit breaker and downgrade mechanism to avoid high cardinality metrics data and excessive trace data

Is your feature request related to a problem? Please describe.
Scenario 1: The service running on the server depends on the external database. When the database is damaged and the server has a steady stream of requests, a large amount of meaningless abnormal trace data will be generated and recorded on a specific topology.
Scenario 2: In a large-scale k8s cluster, unmergeable URL and SQL statements may lead to excessive metric data. When the subsequent processing components such as Receiver/Prometheus cannot bear the pressure, unforeseen data loss will occur.

Describe the solution you'd like
In Scenario 1, We need a mechanism that automatically turns on under high pressure to ensure that meaningless and repetitive traces are not saved in large numbers, to ensure that those more valuable data (such as topology) can get enough resources for processing. This consists of two parts: judging whether the current Trace pressure affects the operation of other parts, and judging whether a newly generated Trace data is worth processing and recording.
In Scenario 2, We need a controllable service degradation logic to gradually reduce the impact of a divergent dimension (such as URL and SQL) on the entire system. The order of degradation can range from a divergent dimension of service to one or more dimensions of the entire monitored cluster. We also need to determine which dimensions of data are more valuable to determine the order of demotion.

Describe alternatives you've considered
In Scenario 2, we can also add some logic to converge already divergent dimensions

conntracker panic: interface {} is nil, not *simplelru.entry

Logs

panic: interface conversion: interface {} is nil, not *simplelru.entry

goroutine 292 [running]:
github.com/hashicorp/golang-lru/simplelru.(*LRU).removeElement(0xc000dae4a0, 0xc00105b230)
	/root/go/pkg/mod/github.com/hashicorp/[email protected]/simplelru/lru.go:172 +0x152
github.com/hashicorp/golang-lru/simplelru.(*LRU).removeOldest(0xc000dae4a0)
	/root/go/pkg/mod/github.com/hashicorp/[email protected]/simplelru/lru.go:165 +0x4c
github.com/hashicorp/golang-lru/simplelru.(*LRU).Add(0xc000dae4a0, 0x1be7fa0, 0xc000a6db60, 0x1942260, 0xc002cd7340, 0xc000099e00)
	/root/go/pkg/mod/github.com/hashicorp/[email protected]/simplelru/lru.go:67 +0x346
github.com/Kindling-project/kindling/collector/metadata/conntracker.(*conntrackCache).Add.func2(0xc004cac508, 0xc004cac4c8, 0xa6d301)
	/source/kindling/collector/metadata/conntracker/conntracker_cache.go:73 +0x15d
github.com/Kindling-project/kindling/collector/metadata/conntracker.(*conntrackCache).Add(0xc00105b200, 0xc004cac400, 0xc00089a100)
	/source/kindling/collector/metadata/conntracker/conntracker_cache.go:76 +0xe4
github.com/Kindling-project/kindling/collector/metadata/conntracker.(*Conntracker).updateCache(0xc00105b290, 0xc004cac400, 0x0)
	/source/kindling/collector/metadata/conntracker/conntracker.go:152 +0x8e
github.com/Kindling-project/kindling/collector/metadata/conntracker.(*Conntracker).poll.func2(0xc00089a120, 0xc00105b290)
	/source/kindling/collector/metadata/conntracker/conntracker.go:127 +0x51
created by github.com/Kindling-project/kindling/collector/metadata/conntracker.(*Conntracker).poll
	/source/kindling/collector/metadata/conntracker/conntracker.go:125 +0x13e

Kprobe and kretprobe cannot be mounted on the same function

Describe the bug
Kprobe and kretprobe cannot be mounted on the same function

How to reproduce?
add this code on probe.c

BPF_KPROBE(tcp_connect)
{
        struct sysdig_bpf_settings *settings;
        enum ppm_event_type evt_type;
        settings = get_bpf_settings();
        if (!settings)
                return 0;



        struct sock *sk = (struct sock *)_READ(ctx->di);
        const struct inet_sock *inet = inet_sk(sk);
        u16 sport = 0;
        u16 dport = 0;
        bpf_probe_read(&sport, sizeof(sport), (void *)&inet->inet_sport);
        bpf_probe_read(&dport, sizeof(dport), (void *)&inet->inet_dport);
        bpf_printk("tcp connect happend , sport:%d , dport:%d \n",ntohs(sport),ntohs(dport));

}
BPF_KRET_PROBE(tcp_connect)
{
        struct sysdig_bpf_settings *settings;
        enum ppm_event_type evt_type;
        settings = get_bpf_settings();
        if (!settings)
                return 0;


        struct sock *sk = (struct sock *)_READ(ctx->di);
        const struct inet_sock *inet = inet_sk(sk);
        u16 sport = 0;
        u16 dport = 0;
        bpf_probe_read(&sport, sizeof(sport), (void *)&inet->inet_sport);
        bpf_probe_read(&dport, sizeof(dport), (void *)&inet->inet_dport);
        bpf_printk("tcp_connnect_ret  happend , sport:%d , dport:%d \n",ntohs(sport),ntohs(dport));
}

Logs

terminate called after throwing an instance of 'scap_open_exception'
  what():  failed to create kprobe 'tcp_connect' error 'Device or resource busy'

已放弃

Environment (please complete the following information)

Kindling agent version:all

Performance benchmark of conntrack module

We need a report on how conntrack module performs to indicate whether there is a bottleneck.

Can't receive events after the probe is restarted

Describe the bug
Can't receive events after the probe is restarted.
What you expected to happen
The collector is supposed to receive events from the probe.
How to reproduce
Start kindling and restart the probe, then you will see the collector can't receive any events.
Additional context
The collector starts to receive events after sending the subscribing events. The root cause of this bug is that after the probe is restarted, it loses information about subscribing in which case it won't send events.
Therefore, to fix this bug, the feature that the collector can resend the subscribing events is requested.

Incorrect workload values in Topology Dashboard's workload combo box

Describe the bug
The option in the workload combo box is not incomplete

What did you expect to see?
There are four options: bookdemo 、bookdemo2 、bookdemo3 and jmeterv2

What did you see instead?
There are only there options: bookdemo 、bookdemo2 and bookdemo3

Screenshots

In my Kubernetes, the resources are:

Can not get topology metrics from the Pod restarted after kindling-agent started

Describe the bug

I was trying to use the JMeter, which works as deployment, to test my application.

Kindling-agent succeeded to show the topology from JMeter to my application firstly, however, after DELETE THE POD of JMeter, and waiting for its restart, Kindling-agent didn't show this topology from the new Pod to my application as expected.

How to reproduce?

Start a Server Workload in your cluster
Start a Client Workload in your cluster
Start the Kindling agent in your cluster and you will get the topology metrics
Delete the Client Pod and wait for its restart
Check the Kindling agent metrics and there is no topology from Client to Server even after a long time
Delete the Kindling agent which is on the node of Client, and check the metrics again, the topology which from Client to Server would show quickly.

What did you expect to see?

Get the topology from kindling-agent which come from the new Client POD to Server POD without restarting the agent

What did you see instead?

No expected topology

Additional context
I check the log from the collector, it seems that only the TCP-RTT event is sent to the collector about the new Client POD, which means this problem also happens in entity metrics

What do the mounts use for listed in run_docker.sh?

Recently I have compiled the probe following the instruction but found something that made me confused.

The first thing is that I don't have the directories listed as follows:
https://github.com/Kindling-project/kindling/blob/b1a367330c84c633145484f139d514d017591983/probe/scripts/run_docker.sh#L8-L12

And the second thing is that I was using podman to build the container instead of docker, so there is no file var/run/docker.sock:
https://github.com/Kindling-project/kindling/blob/b1a367330c84c633145484f139d514d017591983/probe/scripts/run_docker.sh#L26

So I had to remove these mounts to make the command not complain about errors and worried that I was breaking something. But fortunately, it still worked and nothing bad happened. I compiled the probe, built the container, pushed it to my repository (oh, right, it is a bit inconvenient to push, we’ll talk about it later), and ran it successfully in my cluster. The building process was very smooth following the instructions. 

As I said above, it seems like these mounts are useless or not always necessary. I want to know what these mounts use for? Is there anything I missed? IMO, if they are not necessary always, we better simplify them to suit most scenarios. If there are some scenarios requiring these mounts, we could write them down and tell users in a clearer way. 

The last thing I have to complain about is that there is nowhere telling me where I could find and modify the address of the container repository. I rummaged through the repository and finally found it in kindling/probe/src/probe/BUILD.bazel. I think we should write this down explicitly to make the documentation more friendly by reducing the burden on users.

collector is killed without prompt after the probe exits

The lifecycle section in our deployment yaml provided runs a shell script once the probe restarts.
https://github.com/Kindling-project/kindling/blob/0d8c821f1593f0908c35e32dccf5908a32335a19/deploy/kindling-deploy.yml#L22-L27

The script intends to force the collector to restart by executing kill -9. This is needed to avoid #34 happening temporarily.
https://github.com/Kindling-project/kindling/blob/0d8c821f1593f0908c35e32dccf5908a32335a19/probe/deploy/post_start.sh#L1-L5

But the problem here is that there isn't any prompt about when and why the collector is terminated because it is killed by force with kill -9. No matter whether we use this method to fix #34, the collector should be (and be able to) shut down gracefully, in which way the stateful data in collector could be dealt with properly and users could know why and when the collector is shut down.

So there are two parts of work to do about this issue.

The collector is able to be shut down gracefully and properly deal with the stateful data. Logs are needed to tell users what happened.
The script at the lifecycle section should executes kill instead of kill -9.

Modify kindling_trace_request_duration_nanoseconds labels to make querying more convenient

Hi，
Can you change the values of request_reqxfer_status, request_processing_status, response_rspxfer_status in kindling_trace_request_duration_nanoseconds from green, yellow, red to numbers 1, 2, and 3. Instead of using label_replace to replace one by one. This makes it more convenient to perform related color control distinctions in grafana's dashboard.

project must have first commit

Change the repository of the bpf-compiler image

Describe the bug
content "registry.us-west-1.aliyuncs.com/arms-docker-repo/bpf-compiler:kindling-without-extra" in run_docker.sh is not updated synchronously
Replace this image with ”kindlingproject/kindling-compiler“

modify the command of cloning agent-libs

When I followed the instructions to compile probe, I found that I couldn't pull the branch code
installation/Installation.md Build local kernel modules and eBPF modules

git clone https://github.com/Kindling-project/agent-libs/commits/kindling-dev

it works if we replace it with the following command
git clone -b kindling-dev https://github.com/Kindling-project/agent-libs

Installation documentation improvement

This issue intends to track the improvement needed to make for our installation documentation.

#91
#77
Simplify the Installation steps of the Grafana plugin
How to build the container

We also need to talk about how to synchronize the documentation on the official website with those in this repository.

Send the detailed request trace data when using OTLP as an exporter

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
The real trace data should be sent when using OTLP as an exporter since it supports trace data transmission. This feature should be implemented by formatprocessor.

Topology view shows many unexpected IP that visit my Kafka pod

What did you expect to see?
A pod visit this Kafka-0.

What did you see instead?
lots of IP visit Kafka

Screenshots

Collector needs more valuable logs

Is your feature request related to a problem? Please describe.
There are too few useful logs in collector, and almost no logs are typed

Describe alternatives you've considered
add some logs.For example,count the number of traces.

New function: TCP connection monitoring

Describe the solution you'd like
Monitor whether the TCP connection is successful.
Monitor the cause of TCP connection failure.
Monitoring TCP connection time.
Describe alternatives you've considered
I will provide two kinds of events at the bottom:
First, TCP_ Connect. The occurrence of this event indicates that a TCP connection has occurred. It can be used together with syscall-connect to determine the starting point of analysis. At the same time, the event will also give some reasons for the failure of the company
Second, TCP_ finish_ Connect. This event only appears when the connection is successfully established. You can judge whether the connection is successful or not through this event
Additional context
If anyone is interested, you can supplement the code in the collector and grafana-dashboard

What are the minimal set of Linux capabilities the agent requests?

There are some security concerns regarding why the agent must use root privileges. It's typical that the users only want to give the minimal set of privileges if possible. We need to explain which privileges are necessary and why. Then we could set Linux capabilities to restrict the range of privileges.

https://github.com/Kindling-project/kindling/blob/fe3467ba0192895762ec2d9eeb89c5bd712de160/deploy/kindling-deploy.yml#L56

https://github.com/Kindling-project/kindling/blob/fe3467ba0192895762ec2d9eeb89c5bd712de160/deploy/kindling-deploy.yml#L102

Confusion about NOT_FOUND_EXTERNAL and NOT_FOUND_INTERNAL IP types

In the topology view, I found IP node tags with two types: NOT_FOUND_EXTERNAL and NOT_FOUND_INTERNAL. What do those types stand for and why Kindling provide two different IP types.

When compiling probe, ebpf modules should not be compiled by lower versions

Describe the bug
When compiling the probe, only the kernel module will succeed in the low version kernel, and ebpf will fail, but it will not affect the use We need to optimize this. Ebpf will not be compiled in the lower version

What did you expect to see?
compiler success in lower version,not so much error log

Logs
In file included from include/linux/sched.h:33:
include/linux/signal.h:220:10: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
case 2: set->sig[1] = 0;
^ ~
./arch/x86/include/asm/signal.h:23:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from /source/driver/bpf/probe.c:16:
In file included from include/linux/sched.h:33:
include/linux/signal.h:232:10: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
case 2: set->sig[1] = -1;
^ ~
./arch/x86/include/asm/signal.h:23:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from /source/driver/bpf/probe.c:20:
/source/driver/bpf/bpf_helpers.h:13:10: error: use of undeclared identifier 'BPF_FUNC_map_lookup_elem'
(void *)BPF_FUNC_map_lookup_elem;
^
/source/driver/bpf/bpf_helpers.h:16:10: error: use of undeclared identifier 'BPF_FUNC_map_update_elem'
(void *)BPF_FUNC_map_update_elem;
^
/source/driver/bpf/bpf_helpers.h:18:10: error: use of undeclared identifier 'BPF_FUNC_map_delete_elem'
(void *)BPF_FUNC_map_delete_elem;
^
/source/driver/bpf/bpf_helpers.h:20:10: error: use of undeclared identifier 'BPF_FUNC_probe_read'
(void *)BPF_FUNC_probe_read;
^
/source/driver/bpf/bpf_helpers.h:22:10: error: use of undeclared identifier 'BPF_FUNC_ktime_get_ns'
(void *)BPF_FUNC_ktime_get_ns;
^
/source/driver/bpf/bpf_helpers.h:24:10: error: use of undeclared identifier 'BPF_FUNC_trace_printk'
(void *)BPF_FUNC_trace_printk;
^
/source/driver/bpf/bpf_helpers.h:26:10: error: use of undeclared identifier 'BPF_FUNC_tail_call'
(void *)BPF_FUNC_tail_call;
^
/source/driver/bpf/bpf_helpers.h:28:10: error: use of undeclared identifier 'BPF_FUNC_get_smp_processor_id'
(void *)BPF_FUNC_get_smp_processor_id;
^
/source/driver/bpf/bpf_helpers.h:30:10: error: use of undeclared identifier 'BPF_FUNC_get_current_pid_tgid'
(void *)BPF_FUNC_get_current_pid_tgid;
^
/source/driver/bpf/bpf_helpers.h:32:10: error: use of undeclared identifier 'BPF_FUNC_get_current_uid_gid'
(void *)BPF_FUNC_get_current_uid_gid;
^
/source/driver/bpf/bpf_helpers.h:34:10: error: use of undeclared identifier 'BPF_FUNC_get_current_comm'
(void *)BPF_FUNC_get_current_comm;
^
/source/driver/bpf/bpf_helpers.h:36:10: error: use of undeclared identifier 'BPF_FUNC_perf_event_read'
(void *)BPF_FUNC_perf_event_read;
kubectl logs -f kindling-agent-xxx(replace with your podname) -n kindling -c kindling-probe
kubectl logs -f kindling-agent-xxx(replace with your podname) -n kindling -c kindling-collector

**Environment (please complete the following information)**
- Kindling agent version :all
- Kindlinng-falcon-lib version: all
- Node OS version : Below centos7.6
- Node Kernel version: below 3.10.0-1062
- Kubernetes version: all
- Prometheus version : all
- Grafana version : all

Sometimes Pod IP can't be found querying conntrack module with Service ClusterIP and Port

Describe the bug
Sometimes Pod IP can't be found querying conntrack module with Service ClusterIP and Port.
Expected behavior
Pod IP should be found.
How to reproduce
Just run Kindling and you can see there are some metrics whose dst_pod_ip is missing.
Screenshots

Logs
No logs needed I think.
Environment (please complete the following information)

Kindling agent version: 0.1.0
Kindling-falconlib version: Integrated with the probe
Node OS version: Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)
K8s cluster version: v1.20.4-aliyun.1
Node Kernel version: 4.19.91-23.al7.x86_64

Can't get network destination pod name from network dashbord

Describe the bug
Hi, Kindling guys . In network details dashboard I found my pods' connections encounter pkg retransmit, but when i want to know connection's pod details ,i get the source pod name from the while destination pod name is missing. By the way, the network destination must be a pod , cause i get the destination workload name.

What did you expect to see?
i want to get the network connection's source pod name and also destination pod name.

What did you see instead?
A clear and concise description of what you saw instead.

Screenshots

Resolving container id costs too much cpu

We find that when the kindling-probe run inside container, the cpu cost increases to 50%+, which is about 20% on vm. After analyzing by Perf, we locate the problem:
The function matchContainer->matches_cgroups() costs too much cpu on resolve container id. It is called every time when we convert to kindling event.
Besides, uprobe_data will fail to fetch container id because the thread info does not contain the info after kindling using minimal falcolib.

The collector was OOMKilled when receiving massive events

As we all know, the collector consumes events from the probe via the UNIX socket domain using ZeroMQ. I have found that if the probe sends too many events and the collector is unable to keep up with the rate of incoming messages, the collector will use more and more memory until it is OOMKilled. And even if I stopped the test load on this node which means the collector was able to keep up with the events, the memory usage would not be deallocated. I know there is an option ZMQ_HWM which should have limited the memory allocated, but after I set this option, the problem was still there.

This seems like what ZeroMQ expects to do, so we should adapt the usage of ZeroMQ to prevent this happening. See more at zeromq/libzmq#4218

Improve logs of the probe

Now it's kind of casual for the logs probe prints, especially when there is an error happening. Have a look at the following logs.
probe-logs.txt

There are several problems that have been pointed out.

What does it mean and why is it necessary?

There is no information about what the probe is doing.

kindling-falcolib-probe/
kindling-falcolib-probe/4.18.0-147.el8.x86_64.ko
kindling-falcolib-probe/4.18.0-147.el8.x86_64.o
kindling-falcolib-probe/4.18.0-193.el8.x86_64.ko
kindling-falcolib-probe/4.18.0-193.el8.x86_64.o
kindling-falcolib-probe/4.18.0-240.el8.x86_64.ko
kindling-falcolib-probe/4.18.0-240.el8.x86_64.o
kindling-falcolib-probe/4.18.0-305.3.1.el8.x86_64.ko
kindling-falcolib-probe/4.18.0-305.3.1.el8.x86_64.o
kindling-falcolib-probe/4.18.0-80.el8.x86_64.ko
kindling-falcolib-probe/4.18.0-80.el8.x86_64.o
kindling-falcolib-probe/3.10.0-1062.9.1.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-1062.9.1.el7.x86_64.o
kindling-falcolib-probe/3.10.0-1062.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-1062.el7.x86_64.o
kindling-falcolib-probe/3.10.0-1127.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-1127.el7.x86_64.o
kindling-falcolib-probe/3.10.0-1160.11.1.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-1160.11.1.el7.x86_64.o
kindling-falcolib-probe/3.10.0-1160.15.2.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-1160.15.2.el7.x86_64.o
kindling-falcolib-probe/3.10.0-1160.24.1.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-1160.24.1.el7.x86_64.o
kindling-falcolib-probe/3.10.0-1160.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-1160.el7.x86_64.o
kindling-falcolib-probe/3.10.0-693.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-862.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-957.21.3.el7.x86_64.ko
kindling-falcolib-probe/3.10.0-957.21.3.el7.x86_64.o
kindling-falcolib-probe/4.19.159.mizar.ko
kindling-falcolib-probe/4.19.159.mizar.o
kindling-falcolib-probe/4.19.91-23.al7.x86_64.ko
kindling-falcolib-probe/4.19.91-23.al7.x86_64.o
kindling-falcolib-probe/4.19.91-24.1.al7.x86_64.ko
kindling-falcolib-probe/4.19.91-24.1.al7.x86_64.o

What is `driver`?

Actually, we don't mention the term driver anywhere in our documentation.

Unable to load the driver

What is `device`? Why does it fail to initialize? What should I do later?

This means the kernel version is not supported by default, and the users should compile the drivers by themselves. We should address this information clearly.

kindling probe init err: error opening device /host/dev/kindling-falcolib0. Make sure you have root credentials and that the kindling-falcolib-probe module is loaded.

fatal error: concurrent map read and map write

Describe the bug
The collector crashed due to the error of concurrent map read and map write.

How to reproduce?
Can't always reproduce. Try running the agent in a busy environment and the error may occur.

Logs
long_text_2022-03-16-11-45-33.txt

The probe terminated by error of "CPU x configuration change detected."

Describe the bug
The probe terminated by error twice on the same node as not expected.

How to reproduce?
I can't reproduce the error every time.

What did you expect to see?
The probe can run without termination.

What did you see instead?
Last time I just started the probe on my node and it terminated unexpectedly. And after restarting twice, it runs stably for a long time.

Screenshots

Logs
probe.txt

Environment (please complete the following information)

Kindling agent version: v0.1.0
Node OS version: Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)
Node Kernel version: 4.19.91-23.al7.x86_64
Kubernetes version: v1.20.4-aliyun.1

The external network traffics are not being traced

Describe the bug
Hi, I want to identify the specific source that leads to the heavy traffics on my ingress workload. But in the topology dashboard, i can't see the external IP to my ingress workload, it seems that the external network traffics are not being traced
How to reproduce?
Visit your ingress from outside the Kubernetes cluster

What did you expect to see?
External IP to ingress must be displayed

What did you see instead?
No node sends traffic to my ingress workload
Screenshots

"controller already started" reported frequently

The following logs have been reported recently.

2022/03/11 01:43:36 controller already started
2022/03/11 01:43:50 controller already started
2022/03/11 01:43:51 controller already started
2022/03/11 01:44:05 controller already started
2022/03/11 01:44:06 controller already started
2022/03/11 01:44:20 controller already started
2022/03/11 01:44:21 controller already started
2022/03/11 01:44:35 controller already started
2022/03/11 01:44:36 controller already started
2022/03/11 01:44:50 controller already started
2022/03/11 01:44:51 controller already started
2022/03/11 01:45:05 controller already started
2022/03/11 01:45:06 controller already started
2022/03/11 01:45:20 controller already started
2022/03/11 01:45:21 controller already started
2022/03/11 01:45:35 controller already started
2022/03/11 01:45:36 controller already started

This is printed by opentelemetry-go only if methods Start and Collect are both used.

// ErrControllerStarted indicates that a controller was started more
// than once.
var ErrControllerStarted = fmt.Errorf("controller already started")
...
// Note that it is not necessary to Start a controller when only
// pulling data; use the Collect() and ForEach() methods directly in
// this case.
func (c *Controller) Start(ctx context.Context) error {
...
}

I have confirmed that opentelemetry-go/prometheus-exporter would use the method Collect, so it is unnecessary to use the Start method. The following codes cause this bug.
https://github.com/Kindling-project/kindling/blob/c55ce70bb8ab968c9da5873227ef82e6bdd7ef79/collector/observability/telemetry.go#L65-L68

Besides, as we can see, the logs printed are not in the same format as ours. Opentelemetry-go provides a global method to set user-defined error handler which should be used to print more information.

Bind a larger port instead of 8080

Is your feature request related to a problem? Please describe.
collector will bind 8080, but 8080 is a common port

Describe the solution you'd like
bind larger port

Which kernel versions have been supported with pre-compiled probe?

We have to write down clearly which kernel versions have been supported with pre-compiled probe and with Pixie internal kernel headers.

Performance optimization

Backgroud

Hi, folks. Congratulations on our first milestone. A version that implements the basic requirements is available now. But there are still a lot of things to do before Kindling is ready for production. The first barrier is about performance. We have conducted a series of tests to find out where the bottleneck is and fortunately, we now know clearly which parts are needed to be profiled further. Please refer to the following list for more information.

Probe

Emitting events via Sysdig decreases the QPS of the application beyond our exception when QPS is higher than 30k+
Marshalling the event model costs too much CPU resources
Processing gRPC requests from uProbe events costs too much CPU and MEM resources. OOMkilled happens frequently when setting MEM limits to 200MB.

Collector

Unmarshalling the event model costs too much CPU resources
Exporter performs badly when aggregating metrics

The probe terminated by error: libprotobuf CHECK failed

Describe the bug
The probe terminated by error.

How to reproduce?
I can't reproduce the bug.

Screenshots

Logs
The log file is attached. There is some important information which is as follows.
probe-crash.txt

[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/message_lite.cc:360] CHECK failed: target + size == res:
terminate called after throwing an instance of 'google::protobuf::FatalException'

Environment (please complete the following information)

Kindling agent version: v0.1.0
Node OS version: Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)
Node Kernel version: 4.19.91-23.al7.x86_64
Kubernetes version: v1.20.4-aliyun.1

Miss syscall events for network analysis

Describe the bug
Miss syscall events writev, readv, pwritev, preadv, pwrite, pread, which may be used for network analysis.

What did you expect to see?
Please subscribe these syscall_exit events in kindling-collector config. However, I haven't meet the last four yet, which can be subscribed later depending on the need.

The probe should be integrated with falco-agent-libs by using fork repository

Wrong usage of zap.logger.debug

Describe the bug
In the performance test, we found a huge cost by String(), which was caused by the lines below.

p.telemetry.Logger.Debug("dstNodeIp or srcNodeIp is empty which is not expected, skip: ", zap.String("gaugeGroup", gaugeGroup.String()))

Here is the flame graph.

To avoid this cost, When we use zap.logger.debug to add some logs in debug level, we should use the zap.logger.Check() function, like blow

if ce := p.telemetry.Logger.Check(zapcore.DebugLevel, "dstNodeIp or srcNodeIp is empty which is not expected, skip: "); ce != nil {
	ce.Write(zap.String("gaugeGroup", gaugeGroup.String()))
}

I will add a PR to fix this kind of problem in our project now.

Error happened when compiling eBPF modules on kernel 4.19.24-9.al7.x86_64

Describe the bug
Error happened when compiling eBPF modules on the kernel of 4.19.24-9.al7.x86_64.

How to reproduce?
Just compile the agent-libs repository following the instructions at Installation on the kernel version of 4.19.24-9.al7.x86_64. This kernel could be downloaded at Aliyun Mirrors.

What did you expect to see?
No errors happened.

What did you see instead?
Several errors happened and no eBPF product was compiled. But the kernel module is OK.

Screenshots

Environment (please complete the following information)

Kindlinng-falcon-lib version: Based on the latest codes on the branch of kindling-dev. The commit is 8c130f7.
Node Kernel version: 4.19.24-9.al7.x86_64

Kprobe initialization failed when starting two probes

Describe the bug
Kprobe initialization failed when starting two probes,I use ebpf moudel

How to reproduce?
start two probes

What did you expect to see?
two probes all start succeed

Screenshots
If applicable, add screenshots to help explain your problem.

What config did you use?
Config: (e.g. the yaml config file)

Logs
Please attach the logs by running the following command:

kubectl logs -f kindling-agent-xxx(replace with your podname) -n kindling -c kindling-probe
kubectl logs -f kindling-agent-xxx(replace with your podname) -n kindling -c kindling-collector

KINDLING_PROBE_VERSION: v0.1-2021-1221
kindling probe init err: failed to create kprobe 'tcp_drop' error 'Device or resource busy'

Environment (please complete the following information)

Kindling agent version 0.2.0
Kindlinng-falcon-lib version 0.2.0
Node OS version centos7.8
Node Kernel version 3.10.0-1023

agent-libs supports the uprobe framework

Is your feature request related to a problem? Please describe.
Support uprobe instead of Pixie

Describe alternatives you've considered
The introduction of Pixie has brought us many problems, including performance and function. Therefore, we have to spend energy on making agent-libs support uprobe. First, ebpf writes the code first, and the kernel module can be done later

cannot attach uprobe, probe entry may not exist

Describe the bug
cannot attach uprobe, probe entry may not exist
kindling-falcolib-probe/
kindling-falcolib-probe/4.14.105-19-0012.o
kindling-falcolib-probe/4.14.105-19-0012.ko
kindling-falcolib-probe/4.14.105-19-0021.o
kindling-falcolib-probe/4.14.105-19-0021.ko

Mounting debugfs
BPF probe located, it's now possible to start kindling
Load probe succeeded, and will create /opt/kernel-support for kubernetes
Start kindling probe...
KINDLING_PROBE_VERSION: v0.1-2021-1221
kernel version is 4.14.105
cannot attach uprobe, probe entry may not exist
F0309 06:40:45.162115 188024 socket_trace_bpf_tables.cc:61] Check failed: _s.ok() Bad Status: Internal : Unable to attach uprobe for binary /host/data/docker/overlay2/e8bbbb6c17e84d88e6722bca0e12c54735a827dbaede1a050ed9ceff264d2c63/merged/pl/kindling_probe symbol addr 18391e0 offset 0 using conn_cleanup_uprobe
*** Check failure stack trace: ***
@ 0x5291d4d google::LogMessage::Fail()
@ 0x5291187 google::LogMessage::SendToLog()
@ 0x5291a2e google::LogMessage::Flush()
@ 0x5294b8c google::LogMessageFatal::~LogMessageFatal()
@ 0x18395f0 px::stirling::ConnInfoMapManager::ConnInfoMapManager()
@ 0x18c030c __gnu_cxx::new_allocator<>::construct<>()
@ 0x18c009d std::allocator_traits<>::construct<>()
@ 0x18bfe1e std::_Sp_counted_ptr_inplace<>::_Sp_counted_ptr_inplace<>()
@ 0x18bfbee std::__shared_count<>::__shared_count<>()
@ 0x18bfb36 std::__shared_ptr<>::__shared_ptr<>()
@ 0x18bfacd std::shared_ptr<>::shared_ptr<>()
@ 0x18bfa36 std::allocate_shared<>()
@ 0x186f4a4 std::make_shared<>()
@ 0x1842953 px::stirling::SocketTraceConnector::InitImpl()
@ 0x1c81a85 px::stirling::SourceConnector::Init()
@ 0x15ff1e4 px::stirling::StirlingImpl::AddSource()
@ 0x15fedc3 px::stirling::StirlingImpl::Init()
@ 0x160344f px::stirling::Stirling::Create()
@ 0x15a2159 main
@ 0x7feb4f471e0b __libc_start_main
seen by driver: 2499
seen by driver: 148878
seen by driver: 225847
crash signum:6 si_code:-6
@ 0x15a0b0e _start
[New LWP 188327]
Couldn't get CS register: No such process.
Couldn't get registers: No such process.

How to reproduce?
Use own compiled kernel produce kindling-falcolib-probe.tar.gz start kindling-probe

Logs
Please attach the logs by running the following command:

[root@wc-testk8s-0-131 deploy]# kubectl logs -f kindling-agent-7vfj6 -c kindling-probe -n kindling
kindling-falcolib-probe/
kindling-falcolib-probe/4.14.105-19-0012.o
kindling-falcolib-probe/4.14.105-19-0012.ko
kindling-falcolib-probe/4.14.105-19-0021.o
kindling-falcolib-probe/4.14.105-19-0021.ko
* Mounting debugfs
* BPF probe located, it's now possible to start kindling
* Load probe succeeded, and will create /opt/kernel-support for kubernetes
Start kindling probe...
KINDLING_PROBE_VERSION: v0.1-2021-1221
kernel version is 4.14.105
cannot attach uprobe, probe entry may not exist
F0309 06:40:45.162115 188024 socket_trace_bpf_tables.cc:61] Check failed: _s.ok() Bad Status: Internal : Unable to attach uprobe for binary /host/data/docker/overlay2/e8bbbb6c17e84d88e6722bca0e12c54735a827dbaede1a050ed9ceff264d2c63/merged/pl/kindling_probe symbol  addr 18391e0 offset 0 using conn_cleanup_uprobe
*** Check failure stack trace: ***
    @          0x5291d4d  google::LogMessage::Fail()
    @          0x5291187  google::LogMessage::SendToLog()
    @          0x5291a2e  google::LogMessage::Flush()
    @          0x5294b8c  google::LogMessageFatal::~LogMessageFatal()
    @          0x18395f0  px::stirling::ConnInfoMapManager::ConnInfoMapManager()
    @          0x18c030c  __gnu_cxx::new_allocator<>::construct<>()
    @          0x18c009d  std::allocator_traits<>::construct<>()
    @          0x18bfe1e  std::_Sp_counted_ptr_inplace<>::_Sp_counted_ptr_inplace<>()
    @          0x18bfbee  std::__shared_count<>::__shared_count<>()
    @          0x18bfb36  std::__shared_ptr<>::__shared_ptr<>()
    @          0x18bfacd  std::shared_ptr<>::shared_ptr<>()
    @          0x18bfa36  std::allocate_shared<>()
    @          0x186f4a4  std::make_shared<>()
    @          0x1842953  px::stirling::SocketTraceConnector::InitImpl()
    @          0x1c81a85  px::stirling::SourceConnector::Init()
    @          0x15ff1e4  px::stirling::StirlingImpl::AddSource()
    @          0x15fedc3  px::stirling::StirlingImpl::Init()
    @          0x160344f  px::stirling::Stirling::Create()
    @          0x15a2159  main
    @     0x7feb4f471e0b  __libc_start_main
seen by driver: 2499
seen by driver: 148878
seen by driver: 225847
crash signum:6 si_code:-6
    @          0x15a0b0e  _start
[New LWP 188327]
Couldn't get CS register: No such process.
Couldn't get registers: No such process.

Environment (please complete the following information)

KINDLING_PROBE_VERSION: v0.1-2021-1221
kernel version is 4.14.105

listenWorkers of Netlink exited with errors when under high load

Describe the bug
The listenWorkers of Netlink exited with the error of recvmsg: no buffer space available and there are no conntrack flows got later.

How to reproduce?

Start the agent.
Start a load under 10k QPS and it should create thousands of conntrack flows.
The error will occur immediately.

What did you expect to see?
No errors were complained about.

What did you see instead?

error netlink.Receive error in listenWorker 0, exiting: netlink receive: recvmsg: no buffer space available occured during receiving message from conntracker socketerror netlink.Receive error in listenWorker 3, exiting: netlink receive: recvmsg: no buffer space available occured during receiving message from conntracker socketerror netlink.Receive error in listenWorker 2, exiting: netlink receive: recvmsg: no buffer space available occured during receiving message from conntracker socketerror netlink.Receive error in listenWorker 1, exiting: netlink receive: recvmsg: no buffer space available occured during receiving message from conntracker socket

What config did you use?
Config: (e.g. the yaml config file)

analyzers:
  mockanalyzer:
    num: 10
  networkanalyzer:
    connect_timeout: 100
    request_timeout: 1
    response_slow_threshold: 500
    enable_conntrack: true
    conntrack_max_state_size: 131072
    conntrack_rate_limit: 500
    proc_root: /proc
    protocol_parser: [ http, mysql, dns, redis, kafka ]
    protocol_config:
      - key: "mysql"
        slow_threshold: 100
        disable_discern: false
      - key: "kafka"
        slow_threshold: 100
      - key: "cassandra"
        ports: [ 9042 ]
        slow_threshold: 100
      - key: "s3"
        ports: [ 9190 ]
        slow_threshold: 100
      - key: "dns"
        ports: [ 53 ]
        slow_threshold: 100

Environment (please complete the following information)

Kindling agent version: d5f04e5
Node OS version: CentOS 7
Node Kernel version: 3.10.0-1127.el7.x86_64

Performance tuning for the collector

We have conducted a lot of benchmarks for the collector and found several essential points which cause a great impact on the performance. This issue is created to record the process of tuning.

What is the requirements for the version of Prometheus and Grafana?

If I want to deploy the KINDLING, what is the minimum requirements for the version of Prometheus and Grafana?

Print version information when the program starts

Is your feature request related to a problem? Please describe.
Whenever an error occurs, we want to know clearly which version of codes the agent is on. But now there is no such information printed by the collector and only an outdated KINDLING_PROBE_VERSION: v0.1-2021-1221 printed by the probe. We can't know the exact versions of them now.

Describe the solution you'd like
The best version information is the combination of the release version and the commit revision number, IMO. We could insert these numbers as environment variables when we build the containers and print them periodically while the agent is running.

Additional context
Any ideas about how we could get this information are appreciated.

Make the event processing chain configurable

Currently, we build the processing chain by hard coding, which makes changes inconvenient. Every time we add a new component to the chain, we have to modify the source codes and make the building procedure more verbose and difficult to read. What looks like this:

// buildPipeline builds a event processing pipeline based on hard-code.
func (a *Application) buildPipeline() error {
	// TODO: Build pipeline via configuration to implement dependency injection
	// Initialize exporters
	otelExporterFactory := a.componentsFactory.Exporters[otelexporter.Otel]
	otelExporter := otelExporterFactory.NewFunc(otelExporterFactory.Config, a.telemetry.Telemetry)
	// Initialize all processors
	// 1. Kindling Metric Format Processor
	formatProcessorFactory := a.componentsFactory.Processors[kindlingformatprocessor.ProcessorName]
	formatProcessor := formatProcessorFactory.NewFunc(formatProcessorFactory.Config, a.telemetry.Telemetry, otelExporter)
	// 2. Kubernetes metadata processor
	k8sProcessorFactory := a.componentsFactory.Processors[k8sprocessor.K8sMetadata]
	k8sMetadataProcessor := k8sProcessorFactory.NewFunc(k8sProcessorFactory.Config, a.telemetry.Telemetry, formatProcessor)
        // other initialization
        ...
	a.analyzerManager = analyzerManager
	udsReceiverFactory := a.componentsFactory.Receivers[udsreceiver.Uds]
	udsReceiver := udsReceiverFactory.NewFunc(udsReceiverFactory.Config, a.telemetry.Telemetry, analyzerManager)
	a.receiver = udsReceiver
	return nil
}

Although our processing chain is inspired by the pipeline of Opentelemetry-Collector, its building procedure can't be applied to ours directly. Because the pipeline is one direction for every processor while the processing chain is more like a directed acyclic graph in which every node has multiple fan-out directions.

I will work on this issue.

Unit test fails for networkAnalyzer due to getRecord()

Describe the bug
The unit test fails for networkAnalyzer due to the method getRecord().

=== RUN   TestHttpProtocol
=== RUN   TestHttpProtocol/slowData
=== RUN   TestHttpProtocol/errorData
=== RUN   TestHttpProtocol/normal
--- PASS: TestHttpProtocol (0.01s)
    --- PASS: TestHttpProtocol/slowData (0.00s)
    --- PASS: TestHttpProtocol/errorData (0.00s)
    --- PASS: TestHttpProtocol/normal (0.00s)
=== RUN   TestMySqlProtocol
=== RUN   TestMySqlProtocol/query-split
=== RUN   TestMySqlProtocol/query
--- PASS: TestMySqlProtocol (0.00s)
    --- PASS: TestMySqlProtocol/query-split (0.00s)
    --- PASS: TestMySqlProtocol/query (0.00s)
=== RUN   TestRedisProtocol
=== RUN   TestRedisProtocol/get
--- PASS: TestRedisProtocol (0.00s)
    --- PASS: TestRedisProtocol/get (0.00s)
=== RUN   TestDnsProtocol
=== RUN   TestDnsProtocol/multi
    network_analyzer_test.go:258: [Check request_sent_time] want=5000, got=405000
    network_analyzer_test.go:258: [Check waiting_ttfb_time] want=970000, got=570000
    network_analyzer_test.go:258: [Check content_download_time] want=30000, got=530000
    network_analyzer_test.go:258: [Check request_total_time] want=1005000, got=1505000
    network_analyzer_test.go:258: [Check request_io] want=42, got=84
    network_analyzer_test.go:258: [Check response_io] want=89, got=162
    network_analyzer_test.go:258: [Check request_sent_time] want=4000, got=405000
    network_analyzer_test.go:258: [Check waiting_ttfb_time] want=1020000, got=570000
    network_analyzer_test.go:258: [Check content_download_time] want=80000, got=530000
    network_analyzer_test.go:258: [Check request_total_time] want=1104000, got=1505000
    network_analyzer_test.go:258: [Check request_io] want=42, got=84
    network_analyzer_test.go:258: [Check response_io] want=73, got=162
--- FAIL: TestDnsProtocol (0.00s)
    --- FAIL: TestDnsProtocol/multi (0.00s)
=== RUN   TestKafkaProtocol
=== RUN   TestKafkaProtocol/produce-split
=== RUN   TestKafkaProtocol/fetch-split
--- PASS: TestKafkaProtocol (0.00s)
    --- PASS: TestKafkaProtocol/produce-split (0.00s)
    --- PASS: TestKafkaProtocol/fetch-split (0.00s)
FAIL
exit status 1
FAIL    github.com/Kindling-project/kindling/collector/analyzer/network 0.037s

How to reproduce?
Just run go test -v under the directory of collector/analyzer/network/.

Environment (please complete the following information)
Based on the latest commit 4e6c9d9.

What causes this issue
This is caused by typos in the function getRecord(). This function is used to get the final GaugeGroup from a messagePair instead of messagePairs. It is different from the function getRecords() which does that from messagePairs. The difference exists because getRecord() is used for the protocols, like DNS, which could send multiple real requests before receiving the responses, and getRercords() is used for protocols, like HTTP, which can only send one real request before receiving the response. See the following diagrams for details.

The protocols, like DNS, could send multiple real requests before receiving the responses.

While the protocols, like HTTP, only send one real request before receiving a response.

But for our data structure messagePairs, there could be multiple request events and response events in one messagePairs. For HTTP, it means one real request has multiple parts, while for DNS, it means multiple real requests.

Therefore, the difference between DNS and HTTP brings different parsing methods, getRecord and getRecords. This issue is caused by the misunderstanding of what getRecord does.

kindlingproject / kindling Goto Github PK

kindling's Introduction

Kindling

What is Kindling

What is Kindling Trace-profiling

Architecture

Linux kernel version support

Why do we build Kindling?

Why eBPF?

Core Features

Get started

Documentation

Contributing

Contact

License

kindling's People

Stargazers

Watchers

Forkers

kindling's Issues

What does it mean and why is it necessary?

What is driver?

What is device? Why does it fail to initialize? What should I do later?

Backgroud

Probe

Collector

Recommend Projects

Recommend Topics

Recommend Org

What is `driver`?

What is `device`? Why does it fail to initialize? What should I do later?