microsoft / retina Goto Github PK

eBPF distributed networking observability tool for Kubernetes

License: MIT License

Makefile 1.57% Go 91.87% Smarty 0.16% Shell 1.47% Dockerfile 0.58% C 2.16% Python 1.18% PowerShell 0.16% JavaScript 0.65% CSS 0.20%

ebpf kubernetes networking observability

retina's People

Stargazers

Watchers

Forkers

rbtr therockstardba matmerr pauldotyu anwarchk anubhabmajumdar throwoutofcoffeeexception nathanb9 iamrajhans snguyen64 joeyleeeeeee97 dibyanshuaman dfcarpenter garciaolais zpxlz jerry1984 pratikdhanave sarracen1a yogi31102000 nzb15555196162 withlin yanyinglin weiyilai eltociear amargherio miteshjalan qqq-tech bokeumeom cloudonly pascal-h-kim pp861 caffeelake misterypoem jade2290 cybercuisine tardigrade34 zaid-maker charliechap3 decentralizedbug fredatgithub hesam7771 mdeora webclinic017 dmitrysmirnov931 akshaysangma ehsansoraya sam145g mohammadrezasoraya nddq iman12g mohammadsaleh123 salman1455 omid1234g mamad123g aboli123g tabasom123 zahra12g amir123k mahor12 mahan12k omidreza123 amirali4321 abolfazl555 mahoor47 pink-black-bear yibit kuyeol oshni26 hungary3cata suslik432 clarysf ysfyf kotamadelin joshdayax alxsbr2411 ysfadlaa minmin2411 hutansilon sungaiglasis gunungtravia nothingfool eamon-cai mikelmanro mehdibtc pariigh rezabehnoud linecode tdfgssbb kewsfa narsis77 qdsfss padre33 fsdeazv fateme211 sjanulonoks carpe-wang alivanak137777 alivanak77 iron-chest nayihz

retina's Issues

Distribute kubectl-retina in krew

kubectl-retina is currently distributed as a binary in the release artifacts.
It would significantly improve the UX to distribute it via Krew.

Tasks

Beta Give feedback

chore: package kubectl retina as tar.gz (zip for windows) via goreleaser #126

type/fix
docs: add instructions to install the CLI through Krew #193

area/documentation priority/1 type/enhancement
feat: use krew-release-bot to automate releasing new version to krew-index #194

area/infra type/enhancement
Options

API Server Latency buckets are skewed in large scale clusters

Describe the bug
A clear and concise description of what the bug is.

On this large-scale cluster, all latencies are in the +inf category. Discussed increasing max bucket so that we have better information if latency is larger than 4.5ms (highest bucket currently).

Discussed making bucket width 1ms, starting at 0.5ms so that we don't get a lot of 0ms counts

cc @huntergregory @anubhabMajumdar

Sign container images

Is your feature request related to a problem? Please describe.
Images pushed to GHCR are not signed.

Describe the solution you'd like
Images pushed to GHCR should be signed to verify integrity and establish chain of trust.

Additional context
GitHub recommends https://github.com/sigstore/cosign-installer
per https://github.blog/2021-12-06-safeguard-container-signing-capability-actions/
so this does not seem like it would be very complicated to enable.
Open to alternatives from anyone with experience signing images in GHA.

File to output capture to Blob storage using Blob SAS URL "Failed to validate blob url" and "Failed to output network traffic"

Describe the bug
Maybe doing things wrong here, but the capture isn't uploaded to the SAS URL. Seems to be complaining. Followed the documentation. Seems to be a little bit limited in description.

Messages are "Failed to validate blob url" and "Failed to output network traffic"

Related to net/url: invalid control character in URL"

Tried may things below the anonymised logs.

ts=2024-03-24T11:28:37.948Z level=error caller=outputlocation/blob.go:55 msg="Failed to validate blob url" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%2FTfiMWf16D53VxyzPHs%3D\\n\": net/url: invalid control character in URL"
ts=2024-03-24T11:28:37.948Z level=error caller=captureworkload/main.go:57 msg="Failed to output network traffic" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="location \"BlobUpload\" output error: parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%!F(MISSING)TfiMgf16D53VxyzPHs%!D(MISSING)\\n\": net/url: invalid control character in URL\n"

To Reproduce
Steps to reproduce the behavior:

Created a default storage container
Created a new private blob container called captures
Generated a SAS on that container with required write,read,list permissions (also tried full)
Copied the Blob SAS URL value to my local txt file
Create secret.

kubectl create secret generic capture-blob-storage --from-file=blob-upload-url=./blob-upload-url.txt

Create first capture

# Getting the first available node
if [[ -z $1 ]]; then 
  target=`kubectl get nodes -o 'jsonpath={.items[0].metadata.name}'`
else
  target=$1
fi

cat <<EOF | kubectl create -f -
apiVersion: retina.sh/v1alpha1
kind: Capture
metadata:
  name: my-first-capture
spec:
  captureConfiguration:
    captureOption:
      duration: 30s
    captureTarget:
      nodeSelector:
        matchLabels:
          kubernetes.io/hostname: ${target}
  outputConfiguration:
    hostPath: "/tmp/retina"
    blobUpload: capture-blob-storage
EOF

Show the logs of the pod that executed the job.

ts=2024-03-24T11:28:37.948Z level=error caller=outputlocation/blob.go:55 msg="Failed to validate blob url" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%2FTfiMWf16D53VxyzPHs%3D\\n\": net/url: invalid control character in URL"
ts=2024-03-24T11:28:37.948Z level=error caller=captureworkload/main.go:57 msg="Failed to output network traffic" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="location \"BlobUpload\" output error: parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%!F(MISSING)TfiMgf16D53VxyzPHs%!D(MISSING)\\n\": net/url: invalid control character in URL\n"

Expected behavior
Upload and store my capture file.

Screenshots
If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

OS: AKSUbuntu-2204gen2containerd-202403.13.0
Kubernetes Version: 1.29.0
Host: AKS (Default dev pattern)
Retina Version: Latest main build 24/3)

Additional context
Add any other context about the problem here.

Support capturing/tracing unix domain sockets with kprobe ebpf

Today Retina only watches for events from either tc prog or some drop reason kprobes, Retina should be watching for events of unix domain socket as well. This will need additional work to understand how to distinguish src and dest pod/container/process.

For starters, attaching to below kprobes:
kprobe/unix_stream_sendmsg
kprobe/unix_dgram_sendmsg
fentry/unix_stream_sendmsg
fentry/unix_dgram_sendmsg

Example:
https://github.com/Asphaltt/sockdump

docs: update the screenshots referring to "kappie"

Describe the bug
Screenshots on https://retina.sh/docs/troubleshooting/basic-metrics refer to "kappie." These should be re-done and reference the actual name.

Make Signing and Signed-off-By requirements clear in Contributing docs and README

Contributing documentation needs to be updated to explain that this project requires:

all commits to be cryptographically signed
all commits to have a DCO Signed-off-By line (this is automatic if commits are made through the GitHub UI)

can't build binary by make install-kubectl-retina no rule to make target

Describe the bug
Can't build binary as described in the documentation. Seems missing in Makefile make install-kubectl-retina

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/microsoft/retina.git
make install-kubectl-retina

make: *** No rule to make target 'install-kubectl-retina'. Stop

Expected behavior
Should build the main repo binary.

Platform (please complete the following information):

OS: Linux
Kubernetes Version: 1.29
Host: Cloud Shell (Bash)
Retina Version: Latest (main)

[Proposal] Options to limit data generation by plugins

Is your feature request related to a problem? Please describe.

Retina currently lacks sufficient options to control how many events we generate from the plugins. This impact the scale at which retina can operate.

Describe the solution you'd like

Provide an option for users to control what and how much events the plugins generate
The option should be generic enough - shouldn't expose plugin implementational details
Provide reasonable number of options that differ from each other in key aspects, but not overwhelm the user

Mechanisms

List of ways to reduce events:

Less bpf programs: Simply attach fewer bpf programs
Sampling: Not all events may be needed, sampling can cut down events at userspace level
Filter: Let the bpf code filter packets in the bpf and decide if needed by userspace
Protocol specific: Not all packets of a protocol is needed. Example - only interested in TCP drops and TCP connect

Plugin Modes

Annotate

All Drops, All DNS, All TCP/UDP for annotated NS/pods (both at stack and network)

Verbosity

No L34 - All Drops and All DNS
All - All Drops, All DNS, All TCP/UDP for everything
Medium - All Drops, All DNS, only packets sent/received(?) by pods at a configurable(?) sampling rate
Low - All Drops, All DNS, only TCP connect request to Pods

Additional context

Implementational Details

We shouldn't push down configurations to plugins. The managers or upper level should make decisions based on configuratiosn
Existing configurations should be refactored to support above configurations

can't build binary by `make retina-binary`

Describe the bug

# make retina-binary
package command-line-arguments
        imports github.com/microsoft/retina/pkg/plugin/packetforward
        imports github.com/microsoft/retina/pkg/plugin/packetforward/_cprog: C source files not allowed when not using cgo or SWIG: packetforward.c
package command-line-arguments
        imports github.com/microsoft/retina/pkg/plugin/packetforward
        imports github.com/microsoft/retina/pkg/plugin/packetforward/_cprog: C source files not allowed when not using cgo or SWIG: packetforward.c
prog.go:12:2: no required module provides package github.com/golang/mock/mockgen/model: go.mod file not found in current directory or any parent directory; see 'go help modules'
prog.go:14:2: no required module provides package github.com/microsoft/retina/pkg/plugin/packetforward: go.mod file not found in current directory or any parent directory; see 'go help modules'
2024/03/23 12:35:28 Loading input failed: exit status 1
exit status 1
pkg/plugin/packetforward/types_linux.go:31: running "go": exit status 1

To Reproduce
Just run make retina-binary on main branch.

Expected behavior
Create an output directory and build retina binary successfully.

Screenshots

Platform (please complete the following information):

OS: Ubuntu
Kubernetes Version: [e.g. 1.22]
Host: [e.g. AKS, KIND, self-host, etc]
Retina Version:

Additional context
Add any other context about the problem here.

Add support to track DNS drops

Is your feature request related to a problem? Please describe.
We currently track DNS request/response. Add support to measure DNS drops as well.

Maybe:

create new plugin that attaches itself to specific kernel hook points, or
keep track of DNS requests in user space using TTL cache

retina-agent panics when running locally in a Kind cluster

make helm-install-advanced-local-context

Logs:

ts=2024-03-21T20:58:50.234Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=16 hostname=backstage-worker podname=retina-agent-88dzr version=v0.0.1 apiserver=https://10.96.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start plugin manager, plugin exited: failed to start plugin packetparser: interface eth0 of type device not found" errorVerbose="interface eth0 of type device not found\nfailed to start plugin packetparser\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:174\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nfailed to start plugin manager, plugin exited\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:186\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:108\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager [recovered]
	panic: Error running controller manager

goroutine 138 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
	/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0x242fc60?, 0xc003192120?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x0?, {0x0?, 0x0?, 0xc00318e020?})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0031941a0, {0xc003190380, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000493640?, {0x2b48afa?, 0x0?}, {0xc003190380, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000d01cc0, {0x2f057d0?, 0xc000836320?})
	/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x28c
created by main.main in goroutine 1
	/go/src/github.com/microsoft/retina/controller/main.go:286 +0x2825

Missing 'helm-install-with-operator' option in Makefile

Describe the bug
I would loke to install Retina with 'Basic Mode (with Capture support)'. However, the option is not defined within the makefile.
https://retina.sh/docs/installation/setup

Basic Mode (with Capture support)
make helm-install-with-operator

To Reproduce
Steps to reproduce the behavior:

Clone this repository
try to run 'make helm-install-with-operator'
See error: "make: *** No rule to make target 'helm-install-with-operator'. Stop."

Expected behavior
successfly make

Screenshots

Platform (please complete the following information):

OS: [e.g. AzureLinux]
Kubernetes Version: [e.g. 1.22]
Host: [e.g. AKS, KIND, self-host, etc]
Retina Version:

Additional context
Add any other context about the problem here.

Unable to fork /bin/clang.

Describe the bug
While I am setting up retina in K8s infra. I am facing below error.

ts=2024-03-23T16:55:10.369Z level=info caller=server/server.go:79 msg="gracefully shutting down HTTP server..." goversion=go1.21.8 os=linux arch=arm64 numcores=8 hostname=ip-10-149-82-88.ec2.internal podname=retina-agent-4hldd version=v0.0.1 apiserver=https://172.20.0.1:443 plugins=packetforward
ts=2024-03-23T16:55:10.369Z level=info caller=server/server.go:71 msg="HTTP server stopped with err: http: Server closed" goversion=go1.21.8 os=linux arch=arm64 numcores=8 hostname=ip-10-149-82-88.ec2.internal podname=retina-agent-4hldd version=v0.0.1 apiserver=https://172.20.0.1:443 plugins=packetforward
ts=2024-03-23T16:55:10.369Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=arm64 numcores=8 hostname=ip-10-149-82-88.ec2.internal podname=retina-agent-4hldd version=v0.0.1 apiserver=https://172.20.0.1:443 plugins=packetforward error="failed to reconcile plugin packetforward: fork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory\nfailed to reconcile plugin packetforward\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:169\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:108\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"
panic: Error running controller manager [recovered]
	panic: Error running controller manager

goroutine 102 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
	/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x1e8
panic({0x1eb6b00?, 0x4000334180?})
	/usr/local/go/src/runtime/panic.go:914 +0x218
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x1?, {0x4000aaace8?, 0x0?, 0x0?})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x78
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0x40002341a0, {0x400007ff00, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x2c0
go.uber.org/zap.(*Logger).Panic(0x40005f2080?, {0x25cf876?, 0x0?}, {0x400007ff00, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x54
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0x4000cbf450, {0x298b488?, 0x400046f450?})
	/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x22c
created by main.main in goroutine 1
	/go/src/github.com/microsoft/retina/controller/main.go:286 +0x2190

OS: AmzonLinux
Kubernetes Version: 1.28
Host: EKS
Retina Version: 1.0.0

I am not running any operator. for now I am running only DaemonSet and in configmap I am adding plugins

config.yaml: |-
    apiServer:
      host: 0.0.0.0
      port: 10093
    logLevel: debug
    enabledPlugin: ["packetforward"]
    metricsInterval: 10
    enableTelemetry: false
    enablePodLevel: false
    remoteContext: false
    enableAnnotations: false

💡 Proposal to support other O/S arch binaries for release artefacts for Retina. ☕️🥷

Support release artefacts for other distros.

Please support release artefacts with other Operating System support For easy consumptions from other tools. Opening this as placeholder. thanks heaps.

Describe the solution you'd like

When the release artifcats are created. - It will be really helpful for other tools to consume this binary for other operating distress like windows, Darwin (Mac O/S) et. al.

Additional context

Will be worth to include other archs in the artefacts like, currently it only support 1.

duplicate windows metric: remove `windows_hns_stats`, keep `forward_count`

Describe the bug
networkobservability_windows_hns_stats is a copy of networkobservability_forward_count:

retina/pkg/plugin/windows/hnsstats/hnsstats_windows.go

Line 145 in 078d15b

 metrics.ForwardBytesCounter.WithLabelValues(ingressLabel).Set(float64(stats.hnscounters.BytesReceived)) 

# HELP networkobservability_forward_count Total forwarded packets
# TYPE networkobservability_forward_count gauge
networkobservability_forward_count{direction="egress"} 176730
networkobservability_forward_count{direction="ingress"} 520660
# HELP networkobservability_windows_hns_stats Include many different metrics from packets sent/received to closed connections
# TYPE networkobservability_windows_hns_stats gauge
networkobservability_windows_hns_stats{direction="win_packets_recv_count"} 520660
networkobservability_windows_hns_stats{direction="win_packets_sent_count"} 176730

remove irrelevant metric labels for dns and apiserver latency

If my understanding is correct, these labels seem irrelevant to these metrics.

DNS

Seems like DNS request doesn't need num_response, response, or return_code.

networkobservability_dns_request_count{num_response="0",query="wpad.svc.cluster.local.",query_type="A",response="",return_code=""} 354

API Server Latency

Remove count label from adv_node_apiserver_no_response. Only possible time series for this metric right now is:

adv_node_apiserver_no_response{count="no_response"}

See latency.go

AzBlob (and future OutputLocation) Config should be passed through ConfigMaps and Secrets

Is your feature request related to a problem? Please describe.
Currently the AzBlob OutputLocation is provided via a URL with an embedded SAS token. This URL is used as the Volume name in the Capture Pods that are created, which conveniently distributes the access credentials.
This is technically a credential leak - the SAS token is readable to anyone with Pod:read instead of Secret:read.

Describe the solution you'd like
The AzBlob config and secrets should be provided via Kubernetes objects for the same (ConfigMaps and Secrets).

Additional context
The rework done here should be portable to other (future) OutputLocation implementations such as S3 (#201)

retina-agent pod fails to start HTTP server when running controller manager

Describe the bug
make helm-install-advanced-remote-context

To Reproduce
Steps to reproduce the behavior:

Create a default AKS cluster with network monitoring enabled using the documentation
Download the retina Git repo and run make helm-install-advanced-remote-context on the cluster.
Retina operator pod works fine.
Retina agent pod on all the nodes erroring with the error below:

ts=2024-03-22T06:49:25.754Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-nodepool###redacted###003 podname=retina-agent-m2tll version=v0.0.1 apiserver=https://myakscluster-###redacted###.azmk8s.io:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start HTTP server: context canceled" errorVerbose="context canceled\nfailed to start HTTP server\ngithub.com/microsoft/retina/pkg/server.(*Server).Start\n\t/go/src/github.com/microsoft/retina/pkg/server/server.go:91\ngithub.com/microsoft/retina/pkg/managers/servermanager.(*HTTPServer).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/servermanager/servermanager.go:43\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func2\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:111\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager [recovered]
        panic: Error running controller manager

goroutine 87 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
        /go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0x242fc60?, 0xc000d2e6e0?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x0?, {0x0?, 0x0?, 0xc000775540?})
        /go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0007d3380, {0xc000d40440, 0x1, 0x1})
        /go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000bb91c0?, {0x2b48afa?, 0x0?}, {0xc000d40440, 0x1, 0x1})
        /go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000a637c0, {0x2f057d0?, 0xc000447f90?})
        /go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x28c
created by main.main in goroutine 1
        /go/src/github.com/microsoft/retina/controller/main.go:286 +0x2825

Expected behavior

Retina agent pods should be up and running.
adv_network_observability metrics should be available in Grafana

Screenshots
If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

OS: Ubuntu
Kubernetes Version: 1.27.9
Host: [e.g. AKS, KIND, self-host, etc]: AKS
Retina Version: latest as per doc on retina.sh

Additional context
NA.

Allow specifying additionalLabels sourced from pod labels

Is your feature request related to a problem? Please describe.
We would love to expose some pod metadata in the metrics reported by Retina pods, but we are limited to working with the (IP, port, direction, pod_name) set. Can we make it possible to specify any pod label name in additionalLabels to instruct Retina to populate that label as the metric label?

Describe the solution you'd like
For example, if my pod has a label importantData: 123 that I want to report alongside the pod name, I would add a customLabel_importantData entry under the additionalLabels key in the MetricsConfigurationCRD and will in return start receiving Prometheus metrics annotated with {importantData="123"} labels.

Describe alternatives you've considered
N/A

Additional context
Sometimes a pod name does not contain all information needed to attribute a time series to a source.

retina-agent pod initialization failed

Describe the bug

installation commands: make helm-install-with-operator

retina-agent pod status as follows:

# k -n kube-system get pods retina-agent-5lwhj
NAME                 READY   STATUS                  RESTARTS        AGE
retina-agent-5lwhj   0/1     Init:CrashLoopBackOff   6 (3m37s ago)   11m

init-retina containers logs is:

# k -n kube-system logs retina-agent-5lwhj init-retina
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T01:19:37.004Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T01:19:37.005Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map [recovered]
	panic: Failed to initialize filter map

goroutine 1 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
	/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0xb338a0?, 0xc000219130?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc00013bb60?})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00024e0d0, {0xc00023a9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6b9e0?, {0xc69630?, 0xd6b900?}, {0xc00023a9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc000593ec8)
	/go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
	/go/src/github.com/microsoft/retina/init/retina/main_linux.go:33 +0x214

Expected behavior
retina-agent pod status is normal.

Platform (please complete the following information):

OS: CentOS Linux 7 (Core)
Kernel Version: 5.4.207-1.el7.elrepo.x86_64
Kubernetes Version: 1.22.2
Host: local kubernets
Retina Version: v0.0.1

Refactor usages of `unsafe`

There are usages of unsafe which could be made safe

retina/pkg/utils/utils_linux.go

Line 41 in 078d15b

return *(*uint16)(unsafe.Pointer(&b[0]))

may be refactored as

func htons(i uint16) uint16 {
  b := make([]byte, 2)
  binary.BigEndian.PutUint16(b, i)
  return binary.BigEndian.Uint16(b)
}

Windows has no drops metric for bytes

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.

Metric	Windows	Linux
`forward_count` (packets)	✔️	✔️
`forward_bytes`	✔️	✔️
`drop_count` (packets)	✔️	✔️
`drop_bytes`	❌	✔️

Is it possible to get packet size for drops like we do forwards?

retina/pkg/plugin/windows/hnsstats/hnsstats_windows.go

Line 142 in ac26b1d

 metrics.ForwardBytesCounter.WithLabelValues(egressLabel).Set(float64(stats.hnscounters.BytesSent)) 

Explore mTLS request observability support

From HN thread:

Speaking of observability tools. Anybody here know how to gather more in-depth metrics on mTLS requests? Have an internal (self signed) CA and just want to know which issued certs are presented to nodes. Would be nice to get cert serial number and other metadata as well

This is an interesting ask that can help with basic visibility into mTLS issues. Explore how can this be solved and what TLS funcs can be plugged into with eBPF to solve this issue

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Windows Agent CrashLoopBackOff metricsconfiguration cache sync timeout

Logs

2023-11-01T16:51:15.739Z        info    hnsstats        hnsstats/hnsstats_windows.go:138        emitting label win_bytes_recv_count for value 866111686
2023-11-01T16:51:25.464Z        fatal   main    controller/main.go:284  unable to start manager{error 26 0  failed to wait for metricsconfiguration caches to sync: timed out waiting for cache to be synced}

issues with published kubectl-retina binary

Describe the bug
With the https://github.com/microsoft/retina/releases/download/v0.0.2/kubectl-retina-darwin-arm64-v0.0.2.tar.gz binary, the kubectl-retina version command returns "undefined".

This doesn't happen if I run the CLI directly from the branch (e.g. go run).

It also might be related to the fact that when running kubectl-retina capture, the created job uses an incorrect image (undefined tag). E.g. ghcr.io/microsoft/retina/retina-agent:undefined

retina-operator shows up as unhealthy in Prometheus targets

Describe the bug
The retina-operator pod shows up as unhealthy in the prometheus targets list.

To Reproduce
Using an AKS cluster with 2 nodes.

Install Retina:

helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
    --version v0.0.2 \
    --namespace kube-system \
    --set image.tag=v0.0.2 \
    --set operator.tag=v0.0.2 \
    --set image.pullPolicy=Always \
    --set logLevel=info \
    --set os.windows=true \
    --set operator.enabled=true \
    --set operator.enableRetinaEndpoint=true \
    --skip-crds \
    --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\,packetparser\]" \
    --set enablePodLevel=true \
    --set enableAnnotations=true

Install Prometheus (follow the instructions in the doc):

helm install prometheus -n kube-system -f deploy/prometheus/values.yaml prometheus-community/kube-prometheus-stack

Open Prometheus and go to localhost:9090/targets

Expected behavior
retina-pods should be all green. The retina operator pod either shouldn't be included in the targets list or the endpoint/port should be fixed (in case the operator is serving metrics as well). (note that the operator pod is up and running)

Actual behavior

The two retina agent pods are up and running, however the retina-operator pod shows up in red (unhealthy).

Screenshots

Platform (please complete the following information):

OS:Linux (AKS)
Kubernetes Version: 1.27.9
Host:AKS
Retina Version:v.0.0.2

Support scheduling distributed captures

Is your feature request related to a problem? Please describe.
Network issues is seen during peak hours and doing packet capture at specific time is not easy as at times it is our midnight.
Having ability to schedule TCP packet capture would be a huge help .

`make helm-install` does not install the latest image

Describe the bug

The latest release is currently v0.0.2, butmake helm-install with v0.0.2 code will install v0.0.1 image.

To Reproduce
Steps to reproduce the behavior:

$ git clone https://github.com/microsoft/retina.git
$ make helm-install
$ k get ds -n kube-system retina-agent -o yaml | grep image:
        image: ghcr.io/microsoft/retina/retina-agent:v0.0.1
      - image: ghcr.io/microsoft/retina/retina-init:v0.0.1

Expected behavior

Install latest image of current repo.

Screenshots

Platform (please complete the following information):

I don't think this information is needed for this issue.

Additional context

Pod level metrics for drop may miss localCtx labels

Describe the bug
Consider the following scenario:

Server is running in a pod on node-1
We annotate the server to observe dropped packets
Apply a network policy for ingress
When a pod on other nodes try to connect to server, the IPTABLE rule on that node will drop the packet. However, for local context, our filter_map won't have the IP of the server, hence we will not generate any event. Thus, we will see drop_count increase at the node level, but no pod level labels for the drops
True for external connections as well
This is due to how Azure NPM works today. May need to add disclaimer to account for this behavior.

cc @anubhabMajumdar

Removing namespace annotation does not remove IP from filtermap.

Description

When removing a namespace annotation, the corresponding IP is not removed from the filtermap, leading to the continuous generation of metrics.

Steps to Reproduce

Create pods in the namespace.
Make sure Retina is installed with enabled pod level and annotations set to true in the config.
Annotate the namespace.
Generate traffic between pods in ns using something like Kappinger.
Confirm that the filtermap is updated and metrics are being generated.
Remove the namespace annotation.
Check the filtermap. The metrics are still being generated even after removal of the annotation.

Expected Behavior

Upon removal of the namespace annotation, the associated IP should be removed from the filtermap, and metrics generation should cease.

Actual Behavior

Metrics continue to be generated after removing the namespace annotation. Reconciliation has been observed in the namespace controller, with no apparent errors.

Additional Information

Here is some logs of an automated test. Manual test on single ns or pod should produce same results.

Annotated a ns

Annotating namespace {"namespace": "test-drops-annotation-metrics-1696500004"}

Confirmed it was annotated

Annotated namespaces before removal {"annotatedns": [{"metadata":{"name":"test-drops-annotation-metrics-1696500004","uid":"643a065c-3915-4b2d-9636-a2e8f624ff6c","resourceVersion":"5093791","creationTimestamp":"2023-10-06T21:23:08Z","labels":{"e2e":"true","kubernetes.io/metadata.name":"test-drops-annotation-metrics-1696500004"},"annotations":{"retina.io/v1alpha1":"observe"},"managedFields":[{"manager":"dropreason.test","operation":"Update","apiVersion":"v1","time":"2023-10-06T21:23:27Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:retina.io/v1alpha1":{}},"f:labels":{".":{},"f:e2e":{},"f:kubernetes.io/metadata.name":{}}}}}]},"spec":{"finalizers":["kubernetes"]},"status":{"phase":"Active"}}]}

Removed annotation and confirmed it was removed.

Annotated namespaces after removal {"annotatedns": null}

Metrics still being generated.
drop packet {"labels": {"Metric":"networkobservability_adv_drop_count","Labels":["direction","egress","reason","IPTABLE_RULE_DROP","ip","","namespace","test-drops-annotation-metrics-","podname","client","workloadKind","","workloadName",""]}, "value": 33}

Pod logs

After removing the namespace, the log from the name reconcile is present showing that the namespace has been removed.

retina/pkg/controllers/daemon/namespace/namespace_controller.go

Line 60 in ac26b1d

 r.l.Info("Namespace does not have annotation", zap.String("namespace", namespace.Name), zap.Any("annotations", namespace.GetAnnotations())) 

2023-10-06T21:23:47.630Z info NamespaceReconciler namespace/namespace_controller.go:60 Namespace does not have annotation {"namespace": "test-drops-annotation-metrics-1696500004", "annotations": null}

metrics still being generated.

2023-10-06T21:23:47.860Z        debug   MetricModule.dropreason-metricsmodule   metrics/drops.go:160    drop count metric is added in EGRESS in local ctx       {"labels": ["IPTABLE_RULE_DROP", "egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}
2023-10-06T21:23:47.860Z        debug   MetricModule.dropreason-metricsmodule   metrics/drops.go:160    drop count metric is added in EGRESS in local ctx       {"labels": ["IPTABLE_RULE_DROP", "egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}
2023-10-06T21:23:47.862Z        debug   MetricModule.dropreason-metricsmodule   metrics/drops.go:160    drop count metric is added in EGRESS in local ctx       {"labels": ["IPTABLE_RULE_DROP", "egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}

Cache finding pod IP and enriching it

 Cache   cache/cache.go:155      pod found for IP        {"ip": "10.224.0.32", "pod Name": "test-drops-annotation-metrics-1696500004/client"}
2023-10-06T21:23:53.403Z        debug   enricher        enricher/enricher.go:132        enriched flow   {"flow": "time:{seconds:965422847940441}  verdict:DROPPED  IP:{source:\"10.224.0.32\"  destination:\"10.224.0.62\"  ipVersion:IPv4}  l4:{TCP:{source_port:61582  destination_port:20480}}  source:{namespace:\"test-drops-annotation-metrics-1696500004\"  labels:\"pod=client\"  pod_name:\"client\"}  traffic_direction:INGRESS  trace_observation_point:TO_HOST  extensions:{[type.googleapis.com/utils.RetinaMetadata]:{bytes:60}}"}
2023-10-06T21:23:53.403Z        debug   MetricModule    metrics/forward.go:160  forward count metric in EGRESS in local ctx     {"labels": ["egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}

cc: @jimassa @anubhabMajumdar

make helm-install (malformed module path "io/fs")

Describe the bug
When i follow the setup of the installation for basic mode

To Reproduce
Clone the repo and run make helm-install

Expected behavior
No error to create the package of helm

Screenshots

Platform (please complete the following information):

OS: Ubuntu
Retina Version: the latest version of the repo

Additional context
The error
 seilor   retina   main ≡  ~2    make helm-install ﳑ in bash at 11:31:08
cd crd && make manifests && make generate
make[1]: Entering directory '/home/seilor/retina/crd'
make[2]: Entering directory '/home/seilor/retina'
cd /home/seilor/retina/hack/tools; go mod download; go build -tags=tools -o bin/controller-gen sigs.k8s.io/controller-tools/cmd/controller-gen
build sigs.k8s.io/controller-tools/cmd/controller-gen: cannot load io/fs: malformed module path "io/fs": missing dot in first path element
make[2]: *** [Makefile:90: /home/seilor/retina/hack/tools/bin/controller-gen] Error 1
make[2]: Leaving directory '/home/seilor/retina'
make[1]: *** [Makefile:21: /home/seilor/retina/hack/tools/bin/controller-gen] Error 2
make[1]: Leaving directory '/home/seilor/retina/crd'
make: *** [Makefile:391: manifests] Error 2

Evaluate security context/caps

Retina has CAP_NET_ADMIN, SYS_ADMIN, and others.
Evaluate the caps and make sure we are adding the minimum required permissions

edge cases for MetricConfiguration CRD

If enableAnnotations=false, then advanced metrics won't show up unless a MetricConfiguration CRD exists (metric modules are initialized via metricModule's Reconcile()). We should initialize with default context options instead.

Also, should we support MetricConfiguration CRD when enableAnnotations=true? Right now, we do not:

retina/controller/main.go

Lines 254 to 268 in 9248f0d

 if config.EnableAnnotations { 

 mainLogger.Info("Initializing MetricsConfig namespaceController") 

 namespaceController := namespacecontroller.New(mgr.GetClient(), controllerCache, metricsModule) 

 if err := namespaceController.SetupWithManager(mgr); err != nil { 

 mainLogger.Fatal("unable to create namespaceController", zap.Error(err)) 

 } 

 go namespaceController.Start(ctx) 

 } else { 

 mainLogger.Info("Initializing MetricsConfig controller") 

 metricsConfigController := mcc.New(mgr.GetClient(), mgr.GetScheme(), metricsModule) 

 if err := metricsConfigController.SetupWithManager(mgr); err != nil { 

 mainLogger.Fatal("unable to create metricsConfigController", zap.Error(err)) 

 } 

 } 

 }

Investigate and optimize slow Windows CodeQL runs

Windows CodeQL runs take almost 10x longer than the Linux variant and are the biggest chunk of CI time by far.
Why? Can this be improved?

Duplicate import of flow library

Duplicate import -

retina/pkg/module/metrics/latency.go

Line 15 in debc188

v1 "github.com/cilium/cilium/api/v1/flow"

Helm repository for the retina

Is your feature request related to a problem? Please describe.
Currently, setting up installations in a cluster involves several steps to obtain the manifests for the helm. This process includes acquiring various tools locally and executing multiple commands, leading to complexity and potential errors.

Describe the solution you'd like
I envision a straightforward installation process for Retina. Simplifying the setup to just a few steps would greatly enhance user experience.

Describe alternatives you've considered
One feasible alternative is to establish a dedicated repository for Retina manifests. By adding this repository to Helm, users could easily access and install Retina with minimal effort:

helm repo add retinarepository
helm install retina

Additional context
Implementing this solution could significantly reduce the occurrence of issues that need to be investigated, streamlining the deployment process for Retina.

pods and dns dashboards are not part of the published Grafana dashboards

Describe the bug

Only the Clusters dashboard is included in the https://grafana.com/grafana/dashboards/18814-kubernetes-networking-clusters/ dashboard mentioned in the docs.

To Reproduce

Install Retina and import the Grafana dashboard.

Expected behavior
DNS, Pods and Clusters dashboard should be imported.

Actual behavior
Onl the Clusters dashboard is included.

Platform (please complete the following information):

OS: Azure Linux (AKS)
Kubernetes Version:
Host: AKS
Retina Version: 0.0.2

Fix type/reason for drop

Fix the type for Drop.
Ref:

retina/pkg/utils/flow_utils.go

Line 106 in 078d15b

Type: int32(api.MessageTypeTrace),

Also, fix the dropreason number - https://github.com/cilium/cilium/blob/d13b89dc5d91b674272ded11104372e16fe937aa/api/v1/flow/flow.pb.go#L430

incorrect metric names in `pod-level.json` file

Describe the bug
The metrics in pod-level.json file (/deploy/grafana/dashboards) are prefixed with retina and none of the graphs display values. The actual metrics from Prometheus are prefixed with networkobservability.

For example:

"sum(irate(**retina**_adv_forward_count{source_podname=~\"$pod\"}[1m]))",

Add capture spec: `outputConfiguration.s3Upload`

Is your feature request related to a problem? Please describe.

In some environments, blob storage can be difficult to use. It would be nice to support s3 upload for capture.

Describe the solution you'd like

Add an s3Upload spec to the following existing outputconfiguration.

spec.outputConfiguration: Indicates where the captured data will be stored. It includes the following properties:

blobUpload: Specifies a secret containing the blob SAS URL for storing the capture data.
hostPath: Stores the capture files into the specified host filesystem.
persistentVolumeClaim: Mounts a PersistentVolumeClaim into the Pod to store capture files.

s3Upload: Specifies a s3 upload url for storing the capture data.

The s3Upload spec might require the following additional fields

spec:
  outputConfiguration:
    s3Upload:
      endpoint: {{ s3-url }}
      bucket: {{ bucket_name }}
      accessKey: {{ access_key }}
      secretKey: {{ secret_key }}

Describe alternatives you've considered

Additional context

If this feature makes sense and I can be assigned to it, I'd like to work on implementing it.

Don't check endieness on the host

Describe the bug
Currently Retina checks host endianness to enrich port/IP addresses. That is not correct, network is big-endian. Handle conversion in bpf code.

To Reproduce

Ref:

retina/pkg/utils/utils_linux.go

Line 78 in 61e1152

func determineEndian() binary.ByteOrder {

Expected behavior
Network is big-endian. Handle conversion as such bpf code.

Platform (please complete the following information):

OS: Linux
Kubernetes Version: N/A
Host: Linux/all-arch
Retina Version: v0.0.1

Tried to install on docker-desktop

Describe the bug
Error: failed to start container "init-retina": Error response from daemon: path /sys/fs/bpf is mounted on /sys but it is not a shared mount

Gettings above error when I tried on install on mac
Mac details
Darwin XXX 22.6.0 Darwin Kernel Version 22.6.0: Mon Feb 19 19:45:09 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T6000 arm64

To Reproduce
Can be installed on docker-desktop

VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name)
helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
    --version $VERSION \
    --namespace kube-system \
    --set image.tag=$VERSION \
    --set operator.tag=$VERSION \
    --set image.pullPolicy=Always \
    --set logLevel=info \
    --set os.windows=true \
    --set operator.enabled=true \
    --set operator.enableRetinaEndpoint=true \
    --skip-crds \
    --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\,packetparser\]" \
    --set enablePodLevel=true \
    --set enableAnnotations=true

After that the target is not up

Run e2e tests on PRs raised from forks

Is your feature request related to a problem? Please describe.

Currently, we are unable to run our e2e tests in pr open from forks because github does not allow sharing secrets to forks for security reasons. For this reason, we only run e2e when pr is added to merge queue.
Our current implementation uses AKS for our e2e, but it can be implemented in any Kubernetes cluster.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Dropped flows should identify CPU

Is your feature request related to a problem? Please describe.

Recently, we saw an issue where inside a particular node, packets were dropped on a particular CPU. However, kubectl top was not showing CPU running hot (because only few CPUs out of 32 were over burdened). Add the CPU number to the flow extension for drop.

cc: @anubhabMajumdar

Fix timestamp of flows

Describe the bug
Currently the timestamp of Packetparser and Dropreason generated flows are wrong, and not current. They need to be set in userspace and not parsed from bpf events.

Ref:

retina/pkg/plugin/packetparser/packetparser_linux.go

Line 537 in 2986359

int64(bpfEvent.Ts),

To Reproduce
Steps to reproduce the behavior:

Run Retina in advanced-local-context mode
Check flows generated by Packetparser plugin

Expected behavior
Timestamp should be current.

Platform (please complete the following information):

OS: Linux
Kubernetes Version: N/A
Host: Any Linux hosts
Retina Version: v0.0.4

Make it possible to load your own module

Is your feature request related to a problem? Please describe.
We are using Retina's packetparser plugin to collect information about packets. We love the fact that under the hood, packets are treated as "events" that are sent to userspace and are annotated with Kubernetes information in enricher. However, the following step of using the forward (or some other) module does not work well for us – we don't want to expose and collect Prometheus metrics and instead want to continue treating packets as "events" and insert them into a ClickHouse instance with a SQL query.

Retina provides wonderful infrastructure to capture and annotate packet data, but the data ingestion pipeline, which is typically the part you need to customise the most, can't be modified without forking Retina, unless I am missing something obvious.

Would you consider making it possible to create your own modules with custom ProcessFlow implementations?

Describe the solution you'd like
Make it easy to write a custom module that implements the AdvMetricsInterface interface and is loaded into Retina.

Describe alternatives you've considered
N/A

Additional context
Please let me know if there is a simpler way to write a custom data exporter. If there isn't, let me know what a good solution would be, and I'd be happy to contribute to the project.

optimize build pipeline to minimize code duplication.

Is your feature request related to a problem? Please describe.
There are 3 pipelines that build the images and use the same gh action code

Describe the solution you'd like
A good solution will be to create a template that will reuse the code and pass parameters based on what is needed in each pipeline.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add Infiniband support for reading and exposing counters and stats

Is your feature request related to a problem? Please describe.
Add support to read Infiniband ports counters and stats from mlnx drivers.

https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters

We can extend linux util plugin to start reading Linux port counters and status parameters located under /sys/class/infiniband/ and /sys/class/net locations, and expose them as metrics.

Ideally we should introduce a new plugin to read and convert these metrics.

Some references:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_infiniband_and_rdma_networks/understanding-infiniband-and-rdma_configuring-infiniband-and-rdma-networks
https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters

Enrich flows based on non-primary Pod IPs

Is your feature request related to a problem? Please describe.
Currently retina only enriches flows based on the primary IP of the Pod, since we only add primary IP to the ipToEpKey map in cache, we should support adding all IPs so we can get the EP for secondary IPs as well.

retina/pkg/controllers/cache/cache.go

Lines 204 to 208 in 5cfb7ef

 ip, err := ep.PrimaryIP() 

 if err != nil { 

 c.l.Error("updateEndpoint: error getting primary IP for pod", zap.String("pod", ep.Key()), zap.Error(err)) 

 return err 

 }

	if config.EnableAnnotations {
	mainLogger.Info("Initializing MetricsConfig namespaceController")
	namespaceController := namespacecontroller.New(mgr.GetClient(), controllerCache, metricsModule)
	if err := namespaceController.SetupWithManager(mgr); err != nil {
	mainLogger.Fatal("unable to create namespaceController", zap.Error(err))
	}
	go namespaceController.Start(ctx)
	} else {
	mainLogger.Info("Initializing MetricsConfig controller")
	metricsConfigController := mcc.New(mgr.GetClient(), mgr.GetScheme(), metricsModule)
	if err := metricsConfigController.SetupWithManager(mgr); err != nil {
	mainLogger.Fatal("unable to create metricsConfigController", zap.Error(err))
	}
	}
	}

	ip, err := ep.PrimaryIP()
	if err != nil {
	c.l.Error("updateEndpoint: error getting primary IP for pod", zap.String("pod", ep.Key()), zap.Error(err))
	return err
	}

microsoft / retina Goto Github PK

retina's People

Stargazers

Watchers

Forkers

retina's Issues

Tasks

Mechanisms

Plugin Modes

Annotate

Verbosity

Implementational Details

Support release artefacts for other distros.

DNS

API Server Latency

Tasks

Logs

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Information

Pod logs

Recommend Projects

Recommend Topics

Recommend Org