microsoft / retina Goto Github PK
View Code? Open in Web Editor NEWeBPF distributed networking observability tool for Kubernetes
Home Page: https://retina.sh
License: MIT License
eBPF distributed networking observability tool for Kubernetes
Home Page: https://retina.sh
License: MIT License
kubectl-retina is currently distributed as a binary in the release artifacts.
It would significantly improve the UX to distribute it via Krew.
Describe the bug
A clear and concise description of what the bug is.
On this large-scale cluster, all latencies are in the +inf category. Discussed increasing max bucket so that we have better information if latency is larger than 4.5ms (highest bucket currently).
Discussed making bucket width 1ms, starting at 0.5ms so that we don't get a lot of 0ms counts
Is your feature request related to a problem? Please describe.
Images pushed to GHCR are not signed.
Describe the solution you'd like
Images pushed to GHCR should be signed to verify integrity and establish chain of trust.
Additional context
GitHub recommends https://github.com/sigstore/cosign-installer
per https://github.blog/2021-12-06-safeguard-container-signing-capability-actions/
so this does not seem like it would be very complicated to enable.
Open to alternatives from anyone with experience signing images in GHA.
Describe the bug
Maybe doing things wrong here, but the capture isn't uploaded to the SAS URL. Seems to be complaining. Followed the documentation. Seems to be a little bit limited in description.
Messages are "Failed to validate blob url" and "Failed to output network traffic"
Related to net/url: invalid control character in URL"
Tried may things below the anonymised logs.
ts=2024-03-24T11:28:37.948Z level=error caller=outputlocation/blob.go:55 msg="Failed to validate blob url" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%2FTfiMWf16D53VxyzPHs%3D\\n\": net/url: invalid control character in URL"
ts=2024-03-24T11:28:37.948Z level=error caller=captureworkload/main.go:57 msg="Failed to output network traffic" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="location \"BlobUpload\" output error: parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%!F(MISSING)TfiMgf16D53VxyzPHs%!D(MISSING)\\n\": net/url: invalid control character in URL\n"
To Reproduce
Steps to reproduce the behavior:
kubectl create secret generic capture-blob-storage --from-file=blob-upload-url=./blob-upload-url.txt
# Getting the first available node
if [[ -z $1 ]]; then
target=`kubectl get nodes -o 'jsonpath={.items[0].metadata.name}'`
else
target=$1
fi
cat <<EOF | kubectl create -f -
apiVersion: retina.sh/v1alpha1
kind: Capture
metadata:
name: my-first-capture
spec:
captureConfiguration:
captureOption:
duration: 30s
captureTarget:
nodeSelector:
matchLabels:
kubernetes.io/hostname: ${target}
outputConfiguration:
hostPath: "/tmp/retina"
blobUpload: capture-blob-storage
EOF
ts=2024-03-24T11:28:37.948Z level=error caller=outputlocation/blob.go:55 msg="Failed to validate blob url" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%2FTfiMWf16D53VxyzPHs%3D\\n\": net/url: invalid control character in URL"
ts=2024-03-24T11:28:37.948Z level=error caller=captureworkload/main.go:57 msg="Failed to output network traffic" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-agentpool-35551448-vmss000000 podname=my-first-capture-99wqd-wxpx5 error="location \"BlobUpload\" output error: parse \"https://1234cap.blob.core.windows.net/captures?sp=racwdli&st=2024-03-24T11:25:58Z&se=2024-03-24T19:25:58Z&spr=https&sv=2022-11-02&sr=c&sig=0vksBBdje4XlXxxjOJdztOZN%!F(MISSING)TfiMgf16D53VxyzPHs%!D(MISSING)\\n\": net/url: invalid control character in URL\n"
Expected behavior
Upload and store my capture file.
Screenshots
If applicable, add screenshots to help explain your problem.
Platform (please complete the following information):
Additional context
Add any other context about the problem here.
Today Retina only watches for events from either tc prog or some drop reason kprobes, Retina should be watching for events of unix domain socket as well. This will need additional work to understand how to distinguish src and dest pod/container/process.
For starters, attaching to below kprobes:
kprobe/unix_stream_sendmsg
kprobe/unix_dgram_sendmsg
fentry/unix_stream_sendmsg
fentry/unix_dgram_sendmsg
Example:
https://github.com/Asphaltt/sockdump
Describe the bug
Screenshots on https://retina.sh/docs/troubleshooting/basic-metrics refer to "kappie." These should be re-done and reference the actual name.
Contributing documentation needs to be updated to explain that this project requires:
Describe the bug
Can't build binary as described in the documentation. Seems missing in Makefile make install-kubectl-retina
To Reproduce
Steps to reproduce the behavior:
git clone https://github.com/microsoft/retina.git
make install-kubectl-retina
make: *** No rule to make target 'install-kubectl-retina'. Stop
Expected behavior
Should build the main repo binary.
Platform (please complete the following information):
Is your feature request related to a problem? Please describe.
Retina currently lacks sufficient options to control how many events we generate from the plugins. This impact the scale at which retina can operate.
Describe the solution you'd like
List of ways to reduce events:
All Drops, All DNS, All TCP/UDP for annotated NS/pods (both at stack and network)
Additional context
Describe the bug
# make retina-binary
package command-line-arguments
imports github.com/microsoft/retina/pkg/plugin/packetforward
imports github.com/microsoft/retina/pkg/plugin/packetforward/_cprog: C source files not allowed when not using cgo or SWIG: packetforward.c
package command-line-arguments
imports github.com/microsoft/retina/pkg/plugin/packetforward
imports github.com/microsoft/retina/pkg/plugin/packetforward/_cprog: C source files not allowed when not using cgo or SWIG: packetforward.c
prog.go:12:2: no required module provides package github.com/golang/mock/mockgen/model: go.mod file not found in current directory or any parent directory; see 'go help modules'
prog.go:14:2: no required module provides package github.com/microsoft/retina/pkg/plugin/packetforward: go.mod file not found in current directory or any parent directory; see 'go help modules'
2024/03/23 12:35:28 Loading input failed: exit status 1
exit status 1
pkg/plugin/packetforward/types_linux.go:31: running "go": exit status 1
To Reproduce
Just run make retina-binary
on main branch.
Expected behavior
Create an output
directory and build retina binary successfully.
Screenshots
Platform (please complete the following information):
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
We currently track DNS request/response. Add support to measure DNS drops as well.
Maybe:
create new plugin that attaches itself to specific kernel hook points, or
keep track of DNS requests in user space using TTL cache
make helm-install-advanced-local-context
Logs:
ts=2024-03-21T20:58:50.234Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=16 hostname=backstage-worker podname=retina-agent-88dzr version=v0.0.1 apiserver=https://10.96.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start plugin manager, plugin exited: failed to start plugin packetparser: interface eth0 of type device not found" errorVerbose="interface eth0 of type device not found\nfailed to start plugin packetparser\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:174\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nfailed to start plugin manager, plugin exited\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:186\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:108\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager [recovered]
panic: Error running controller manager
goroutine 138 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0x242fc60?, 0xc003192120?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x0?, {0x0?, 0x0?, 0xc00318e020?})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0031941a0, {0xc003190380, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000493640?, {0x2b48afa?, 0x0?}, {0xc003190380, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000d01cc0, {0x2f057d0?, 0xc000836320?})
/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x28c
created by main.main in goroutine 1
/go/src/github.com/microsoft/retina/controller/main.go:286 +0x2825
Describe the bug
I would loke to install Retina with 'Basic Mode (with Capture support)'. However, the option is not defined within the makefile.
https://retina.sh/docs/installation/setup
Basic Mode (with Capture support)
make helm-install-with-operator
To Reproduce
Steps to reproduce the behavior:
Expected behavior
successfly make
Platform (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
While I am setting up retina in K8s infra. I am facing below error.
ts=2024-03-23T16:55:10.369Z level=info caller=server/server.go:79 msg="gracefully shutting down HTTP server..." goversion=go1.21.8 os=linux arch=arm64 numcores=8 hostname=ip-10-149-82-88.ec2.internal podname=retina-agent-4hldd version=v0.0.1 apiserver=https://172.20.0.1:443 plugins=packetforward
ts=2024-03-23T16:55:10.369Z level=info caller=server/server.go:71 msg="HTTP server stopped with err: http: Server closed" goversion=go1.21.8 os=linux arch=arm64 numcores=8 hostname=ip-10-149-82-88.ec2.internal podname=retina-agent-4hldd version=v0.0.1 apiserver=https://172.20.0.1:443 plugins=packetforward
ts=2024-03-23T16:55:10.369Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=arm64 numcores=8 hostname=ip-10-149-82-88.ec2.internal podname=retina-agent-4hldd version=v0.0.1 apiserver=https://172.20.0.1:443 plugins=packetforward error="failed to reconcile plugin packetforward: fork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory\nfailed to reconcile plugin packetforward\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:169\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:108\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"
panic: Error running controller manager [recovered]
panic: Error running controller manager
goroutine 102 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x1e8
panic({0x1eb6b00?, 0x4000334180?})
/usr/local/go/src/runtime/panic.go:914 +0x218
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x1?, {0x4000aaace8?, 0x0?, 0x0?})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x78
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0x40002341a0, {0x400007ff00, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x2c0
go.uber.org/zap.(*Logger).Panic(0x40005f2080?, {0x25cf876?, 0x0?}, {0x400007ff00, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x54
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0x4000cbf450, {0x298b488?, 0x400046f450?})
/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x22c
created by main.main in goroutine 1
/go/src/github.com/microsoft/retina/controller/main.go:286 +0x2190
I am not running any operator. for now I am running only DaemonSet and in configmap I am adding plugins
config.yaml: |-
apiServer:
host: 0.0.0.0
port: 10093
logLevel: debug
enabledPlugin: ["packetforward"]
metricsInterval: 10
enableTelemetry: false
enablePodLevel: false
remoteContext: false
enableAnnotations: false
Describe the solution you'd like
Additional context
Describe the bug
networkobservability_windows_hns_stats
is a copy of networkobservability_forward_count
:
# HELP networkobservability_forward_count Total forwarded packets
# TYPE networkobservability_forward_count gauge
networkobservability_forward_count{direction="egress"} 176730
networkobservability_forward_count{direction="ingress"} 520660
# HELP networkobservability_windows_hns_stats Include many different metrics from packets sent/received to closed connections
# TYPE networkobservability_windows_hns_stats gauge
networkobservability_windows_hns_stats{direction="win_packets_recv_count"} 520660
networkobservability_windows_hns_stats{direction="win_packets_sent_count"} 176730
If my understanding is correct, these labels seem irrelevant to these metrics.
Seems like DNS request doesn't need num_response
, response
, or return_code
.
networkobservability_dns_request_count{num_response="0",query="wpad.svc.cluster.local.",query_type="A",response="",return_code=""} 354
Remove count
label from adv_node_apiserver_no_response
. Only possible time series for this metric right now is:
adv_node_apiserver_no_response{count="no_response"}
See latency.go
Is your feature request related to a problem? Please describe.
Currently the AzBlob OutputLocation is provided via a URL with an embedded SAS token. This URL is used as the Volume
name in the Capture Pods that are created, which conveniently distributes the access credentials.
This is technically a credential leak - the SAS token is readable to anyone with Pod:read
instead of Secret:read
.
Describe the solution you'd like
The AzBlob config and secrets should be provided via Kubernetes objects for the same (ConfigMaps and Secrets).
Additional context
The rework done here should be portable to other (future) OutputLocation implementations such as S3 (#201)
Describe the bug
make helm-install-advanced-remote-context
To Reproduce
Steps to reproduce the behavior:
make helm-install-advanced-remote-context
on the cluster.ts=2024-03-22T06:49:25.754Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-nodepool###redacted###003 podname=retina-agent-m2tll version=v0.0.1 apiserver=https://myakscluster-###redacted###.azmk8s.io:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start HTTP server: context canceled" errorVerbose="context canceled\nfailed to start HTTP server\ngithub.com/microsoft/retina/pkg/server.(*Server).Start\n\t/go/src/github.com/microsoft/retina/pkg/server/server.go:91\ngithub.com/microsoft/retina/pkg/managers/servermanager.(*HTTPServer).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/servermanager/servermanager.go:43\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func2\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:111\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager [recovered]
panic: Error running controller manager
goroutine 87 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0x242fc60?, 0xc000d2e6e0?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x0?, {0x0?, 0x0?, 0xc000775540?})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0007d3380, {0xc000d40440, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000bb91c0?, {0x2b48afa?, 0x0?}, {0xc000d40440, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000a637c0, {0x2f057d0?, 0xc000447f90?})
/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x28c
created by main.main in goroutine 1
/go/src/github.com/microsoft/retina/controller/main.go:286 +0x2825
Expected behavior
Screenshots
If applicable, add screenshots to help explain your problem.
Platform (please complete the following information):
Additional context
NA.
Is your feature request related to a problem? Please describe.
We would love to expose some pod metadata in the metrics reported by Retina pods, but we are limited to working with the (IP, port, direction, pod_name) set. Can we make it possible to specify any pod label name in additionalLabels
to instruct Retina to populate that label as the metric label?
Describe the solution you'd like
For example, if my pod has a label importantData: 123
that I want to report alongside the pod name, I would add a customLabel_importantData
entry under the additionalLabels
key in the MetricsConfigurationCRD
and will in return start receiving Prometheus metrics annotated with {importantData="123"}
labels.
Describe alternatives you've considered
N/A
Additional context
Sometimes a pod name does not contain all information needed to attribute a time series to a source.
Describe the bug
installation commands: make helm-install-with-operator
retina-agent pod status as follows:
# k -n kube-system get pods retina-agent-5lwhj
NAME READY STATUS RESTARTS AGE
retina-agent-5lwhj 0/1 Init:CrashLoopBackOff 6 (3m37s ago) 11m
init-retina containers logs is:
# k -n kube-system logs retina-agent-5lwhj init-retina
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T01:19:37.004Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T01:19:37.005Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map [recovered]
panic: Failed to initialize filter map
goroutine 1 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0xb338a0?, 0xc000219130?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc00013bb60?})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00024e0d0, {0xc00023a9c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6b9e0?, {0xc69630?, 0xd6b900?}, {0xc00023a9c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc000593ec8)
/go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
/go/src/github.com/microsoft/retina/init/retina/main_linux.go:33 +0x214
Expected behavior
retina-agent pod status is normal.
Platform (please complete the following information):
There are usages of unsafe which could be made safe
retina/pkg/utils/utils_linux.go
Line 41 in 078d15b
may be refactored as
func htons(i uint16) uint16 {
b := make([]byte, 2)
binary.BigEndian.PutUint16(b, i)
return binary.BigEndian.Uint16(b)
}
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.
Metric | Windows | Linux |
---|---|---|
forward_count (packets) |
✔️ | ✔️ |
forward_bytes |
✔️ | ✔️ |
drop_count (packets) |
✔️ | ✔️ |
drop_bytes |
❌ | ✔️ |
Is it possible to get packet size for drops like we do forwards?
From HN thread:
Speaking of observability tools. Anybody here know how to gather more in-depth metrics on mTLS requests? Have an internal (self signed) CA and just want to know which issued certs are presented to nodes. Would be nice to get cert serial number and other metadata as well
This is an interesting ask that can help with basic visibility into mTLS issues. Explore how can this be solved and what TLS funcs can be plugged into with eBPF to solve this issue
2023-11-01T16:51:15.739Z info hnsstats hnsstats/hnsstats_windows.go:138 emitting label win_bytes_recv_count for value 866111686
2023-11-01T16:51:25.464Z fatal main controller/main.go:284 unable to start manager{error 26 0 failed to wait for metricsconfiguration caches to sync: timed out waiting for cache to be synced}
Describe the bug
With the https://github.com/microsoft/retina/releases/download/v0.0.2/kubectl-retina-darwin-arm64-v0.0.2.tar.gz binary, the kubectl-retina version
command returns "undefined".
This doesn't happen if I run the CLI directly from the branch (e.g. go run
).
It also might be related to the fact that when running kubectl-retina capture
, the created job uses an incorrect image (undefined
tag). E.g. ghcr.io/microsoft/retina/retina-agent:undefined
Describe the bug
The retina-operator
pod shows up as unhealthy in the prometheus targets list.
To Reproduce
Using an AKS cluster with 2 nodes.
helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
--version v0.0.2 \
--namespace kube-system \
--set image.tag=v0.0.2 \
--set operator.tag=v0.0.2 \
--set image.pullPolicy=Always \
--set logLevel=info \
--set os.windows=true \
--set operator.enabled=true \
--set operator.enableRetinaEndpoint=true \
--skip-crds \
--set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\,packetparser\]" \
--set enablePodLevel=true \
--set enableAnnotations=true
helm install prometheus -n kube-system -f deploy/prometheus/values.yaml prometheus-community/kube-prometheus-stack
localhost:9090/targets
Expected behavior
retina-pods
should be all green. The retina operator pod either shouldn't be included in the targets list or the endpoint/port should be fixed (in case the operator is serving metrics as well). (note that the operator pod is up and running)
Actual behavior
The two retina agent pods are up and running, however the retina-operator
pod shows up in red (unhealthy).
Platform (please complete the following information):
Is your feature request related to a problem? Please describe.
Network issues is seen during peak hours and doing packet capture at specific time is not easy as at times it is our midnight.
Having ability to schedule TCP packet capture would be a huge help .
Describe the bug
The latest release is currently v0.0.2, butmake helm-install
with v0.0.2 code will install v0.0.1 image.
To Reproduce
Steps to reproduce the behavior:
$ git clone https://github.com/microsoft/retina.git
$ make helm-install
$ k get ds -n kube-system retina-agent -o yaml | grep image:
image: ghcr.io/microsoft/retina/retina-agent:v0.0.1
- image: ghcr.io/microsoft/retina/retina-init:v0.0.1
Expected behavior
Install latest image of current repo.
Screenshots
Platform (please complete the following information):
I don't think this information is needed for this issue.
Additional context
Describe the bug
Consider the following scenario:
Server is running in a pod on node-1
We annotate the server to observe dropped packets
Apply a network policy for ingress
When a pod on other nodes try to connect to server, the IPTABLE rule on that node will drop the packet. However, for local context, our filter_map won't have the IP of the server, hence we will not generate any event. Thus, we will see drop_count increase at the node level, but no pod level labels for the drops
True for external connections as well
This is due to how Azure NPM works today. May need to add disclaimer to account for this behavior.
When removing a namespace annotation, the corresponding IP is not removed from the filtermap, leading to the continuous generation of metrics.
Upon removal of the namespace annotation, the associated IP should be removed from the filtermap, and metrics generation should cease.
Metrics continue to be generated after removing the namespace annotation. Reconciliation has been observed in the namespace controller, with no apparent errors.
Here is some logs of an automated test. Manual test on single ns or pod should produce same results.
Annotated a ns
Annotating namespace {"namespace": "test-drops-annotation-metrics-1696500004"}
Confirmed it was annotated
Annotated namespaces before removal {"annotatedns": [{"metadata":{"name":"test-drops-annotation-metrics-1696500004","uid":"643a065c-3915-4b2d-9636-a2e8f624ff6c","resourceVersion":"5093791","creationTimestamp":"2023-10-06T21:23:08Z","labels":{"e2e":"true","kubernetes.io/metadata.name":"test-drops-annotation-metrics-1696500004"},"annotations":{"retina.io/v1alpha1":"observe"},"managedFields":[{"manager":"dropreason.test","operation":"Update","apiVersion":"v1","time":"2023-10-06T21:23:27Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:retina.io/v1alpha1":{}},"f:labels":{".":{},"f:e2e":{},"f:kubernetes.io/metadata.name":{}}}}}]},"spec":{"finalizers":["kubernetes"]},"status":{"phase":"Active"}}]}
Removed annotation and confirmed it was removed.
Annotated namespaces after removal {"annotatedns": null}
Metrics still being generated.
drop packet {"labels": {"Metric":"networkobservability_adv_drop_count","Labels":["direction","egress","reason","IPTABLE_RULE_DROP","ip","","namespace","test-drops-annotation-metrics-","podname","client","workloadKind","","workloadName",""]}, "value": 33}
After removing the namespace, the log from the name reconcile is present showing that the namespace has been removed.
2023-10-06T21:23:47.630Z info NamespaceReconciler namespace/namespace_controller.go:60 Namespace does not have annotation {"namespace": "test-drops-annotation-metrics-1696500004", "annotations": null}
metrics still being generated.
2023-10-06T21:23:47.860Z debug MetricModule.dropreason-metricsmodule metrics/drops.go:160 drop count metric is added in EGRESS in local ctx {"labels": ["IPTABLE_RULE_DROP", "egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}
2023-10-06T21:23:47.860Z debug MetricModule.dropreason-metricsmodule metrics/drops.go:160 drop count metric is added in EGRESS in local ctx {"labels": ["IPTABLE_RULE_DROP", "egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}
2023-10-06T21:23:47.862Z debug MetricModule.dropreason-metricsmodule metrics/drops.go:160 drop count metric is added in EGRESS in local ctx {"labels": ["IPTABLE_RULE_DROP", "egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}
Cache finding pod IP and enriching it
Cache cache/cache.go:155 pod found for IP {"ip": "10.224.0.32", "pod Name": "test-drops-annotation-metrics-1696500004/client"}
2023-10-06T21:23:53.403Z debug enricher enricher/enricher.go:132 enriched flow {"flow": "time:{seconds:965422847940441} verdict:DROPPED IP:{source:\"10.224.0.32\" destination:\"10.224.0.62\" ipVersion:IPv4} l4:{TCP:{source_port:61582 destination_port:20480}} source:{namespace:\"test-drops-annotation-metrics-1696500004\" labels:\"pod=client\" pod_name:\"client\"} traffic_direction:INGRESS trace_observation_point:TO_HOST extensions:{[type.googleapis.com/utils.RetinaMetadata]:{bytes:60}}"}
2023-10-06T21:23:53.403Z debug MetricModule metrics/forward.go:160 forward count metric in EGRESS in local ctx {"labels": ["egress", "10.224.0.32", "test-drops-annotation-metrics-1696500004", "client", "unknown", "unknown"]}
Describe the bug
When i follow the setup of the installation for basic mode
To Reproduce
Clone the repo and run make helm-install
Expected behavior
No error to create the package of helm
Platform (please complete the following information):
Additional context
The error
seilor retina main ≡ ~2 make helm-install ﳑ in bash at 11:31:08
cd crd && make manifests && make generate
make[1]: Entering directory '/home/seilor/retina/crd'
make[2]: Entering directory '/home/seilor/retina'
cd /home/seilor/retina/hack/tools; go mod download; go build -tags=tools -o bin/controller-gen sigs.k8s.io/controller-tools/cmd/controller-gen
build sigs.k8s.io/controller-tools/cmd/controller-gen: cannot load io/fs: malformed module path "io/fs": missing dot in first path element
make[2]: *** [Makefile:90: /home/seilor/retina/hack/tools/bin/controller-gen] Error 1
make[2]: Leaving directory '/home/seilor/retina'
make[1]: *** [Makefile:21: /home/seilor/retina/hack/tools/bin/controller-gen] Error 2
make[1]: Leaving directory '/home/seilor/retina/crd'
make: *** [Makefile:391: manifests] Error 2
Retina has CAP_NET_ADMIN, SYS_ADMIN, and others.
Evaluate the caps and make sure we are adding the minimum required permissions
If enableAnnotations=false
, then advanced metrics won't show up unless a MetricConfiguration CRD exists (metric modules are initialized via metricModule's Reconcile()
). We should initialize with default context options instead.
Also, should we support MetricConfiguration CRD when enableAnnotations=true
? Right now, we do not:
Lines 254 to 268 in 9248f0d
Windows CodeQL runs take almost 10x longer than the Linux variant and are the biggest chunk of CI time by far.
Why? Can this be improved?
Duplicate import -
retina/pkg/module/metrics/latency.go
Line 15 in debc188
Is your feature request related to a problem? Please describe.
Currently, setting up installations in a cluster involves several steps to obtain the manifests for the helm. This process includes acquiring various tools locally and executing multiple commands, leading to complexity and potential errors.
Describe the solution you'd like
I envision a straightforward installation process for Retina. Simplifying the setup to just a few steps would greatly enhance user experience.
Describe alternatives you've considered
One feasible alternative is to establish a dedicated repository for Retina manifests. By adding this repository to Helm, users could easily access and install Retina with minimal effort:
helm repo add retinarepository
helm install retina
Additional context
Implementing this solution could significantly reduce the occurrence of issues that need to be investigated, streamlining the deployment process for Retina.
Describe the bug
Only the Clusters dashboard is included in the https://grafana.com/grafana/dashboards/18814-kubernetes-networking-clusters/ dashboard mentioned in the docs.
To Reproduce
Install Retina and import the Grafana dashboard.
Expected behavior
DNS, Pods and Clusters dashboard should be imported.
Actual behavior
Onl the Clusters dashboard is included.
Platform (please complete the following information):
Fix the type for Drop.
Ref:
retina/pkg/utils/flow_utils.go
Line 106 in 078d15b
Also, fix the dropreason number - https://github.com/cilium/cilium/blob/d13b89dc5d91b674272ded11104372e16fe937aa/api/v1/flow/flow.pb.go#L430
Describe the bug
The metrics in pod-level.json
file (/deploy/grafana/dashboards) are prefixed with retina
and none of the graphs display values. The actual metrics from Prometheus are prefixed with networkobservability
.
For example:
"sum(irate(**retina**_adv_forward_count{source_podname=~\"$pod\"}[1m]))",
Is your feature request related to a problem? Please describe.
In some environments, blob storage can be difficult to use. It would be nice to support s3 upload for capture.
Describe the solution you'd like
Add an s3Upload spec to the following existing outputconfiguration.
spec.outputConfiguration: Indicates where the captured data will be stored. It includes the following properties:
blobUpload: Specifies a secret containing the blob SAS URL for storing the capture data.
hostPath: Stores the capture files into the specified host filesystem.
persistentVolumeClaim: Mounts a PersistentVolumeClaim into the Pod to store capture files.
s3Upload: Specifies a s3 upload url for storing the capture data.
The s3Upload spec might require the following additional fields
spec:
outputConfiguration:
s3Upload:
endpoint: {{ s3-url }}
bucket: {{ bucket_name }}
accessKey: {{ access_key }}
secretKey: {{ secret_key }}
Describe alternatives you've considered
Additional context
If this feature makes sense and I can be assigned to it, I'd like to work on implementing it.
Describe the bug
Currently Retina checks host endianness to enrich port/IP addresses. That is not correct, network is big-endian. Handle conversion in bpf code.
To Reproduce
Ref:
retina/pkg/utils/utils_linux.go
Line 78 in 61e1152
Expected behavior
Network is big-endian. Handle conversion as such bpf code.
Platform (please complete the following information):
Describe the bug
Error: failed to start container "init-retina": Error response from daemon: path /sys/fs/bpf is mounted on /sys but it is not a shared mount
Gettings above error when I tried on install on mac
Mac details
Darwin XXX 22.6.0 Darwin Kernel Version 22.6.0: Mon Feb 19 19:45:09 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T6000 arm64
To Reproduce
Can be installed on docker-desktop
VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name)
helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
--version $VERSION \
--namespace kube-system \
--set image.tag=$VERSION \
--set operator.tag=$VERSION \
--set image.pullPolicy=Always \
--set logLevel=info \
--set os.windows=true \
--set operator.enabled=true \
--set operator.enableRetinaEndpoint=true \
--skip-crds \
--set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\,packetparser\]" \
--set enablePodLevel=true \
--set enableAnnotations=true
After that the target is not up
Is your feature request related to a problem? Please describe.
Currently, we are unable to run our e2e tests in pr open from forks because github does not allow sharing secrets to forks for security reasons. For this reason, we only run e2e when pr is added to merge queue.
Our current implementation uses AKS for our e2e, but it can be implemented in any Kubernetes cluster.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Recently, we saw an issue where inside a particular node, packets were dropped on a particular CPU. However, kubectl top was not showing CPU running hot (because only few CPUs out of 32 were over burdened). Add the CPU number to the flow extension for drop.
cc: @anubhabMajumdar
Describe the bug
Currently the timestamp of Packetparser
and Dropreason
generated flows are wrong, and not current. They need to be set in userspace and not parsed from bpf events.
Ref:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Timestamp should be current.
Platform (please complete the following information):
Is your feature request related to a problem? Please describe.
We are using Retina's packetparser
plugin to collect information about packets. We love the fact that under the hood, packets are treated as "events" that are sent to userspace and are annotated with Kubernetes information in enricher
. However, the following step of using the forward
(or some other) module does not work well for us – we don't want to expose and collect Prometheus metrics and instead want to continue treating packets as "events" and insert them into a ClickHouse instance with a SQL query.
Retina provides wonderful infrastructure to capture and annotate packet data, but the data ingestion pipeline, which is typically the part you need to customise the most, can't be modified without forking Retina, unless I am missing something obvious.
Would you consider making it possible to create your own modules with custom ProcessFlow
implementations?
Describe the solution you'd like
Make it easy to write a custom module that implements the AdvMetricsInterface
interface and is loaded into Retina.
Describe alternatives you've considered
N/A
Additional context
Please let me know if there is a simpler way to write a custom data exporter. If there isn't, let me know what a good solution would be, and I'd be happy to contribute to the project.
Is your feature request related to a problem? Please describe.
There are 3 pipelines that build the images and use the same gh action code
Describe the solution you'd like
A good solution will be to create a template that will reuse the code and pass parameters based on what is needed in each pipeline.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Add support to read Infiniband ports counters and stats from mlnx drivers.
We can extend linux util plugin to start reading Linux port counters and status parameters located under /sys/class/infiniband/ and /sys/class/net locations, and expose them as metrics.
Ideally we should introduce a new plugin to read and convert these metrics.
Some references:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_infiniband_and_rdma_networks/understanding-infiniband-and-rdma_configuring-infiniband-and-rdma-networks
https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters
Is your feature request related to a problem? Please describe.
Currently retina only enriches flows based on the primary IP of the Pod, since we only add primary IP to the ipToEpKey
map in cache, we should support adding all IPs so we can get the EP for secondary IPs as well.
retina/pkg/controllers/cache/cache.go
Lines 204 to 208 in 5cfb7ef
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.