cloudworkz / kube-eagle Goto Github PK

A prometheus exporter created to provide a better overview of your resource allocation and utilization in a Kubernetes cluster.

License: MIT License

Dockerfile 1.36% Go 98.64%

grafana kubernetes prometheus resource-management resources

kube-eagle's People

Contributors

Stargazers

Watchers

kube-eagle's Issues

Allow custom labels

I was thinking about this custom labels definition. For example on DigitalOcean we have node pool name in nodes label as:

Now, this can be used as node pool in dashboard directly.

Is there any way todo this ?

Thx, great work!!

grafana dashboards full of "Test data: random walk"

Hey,

I'm running a cluster on EKS (so no nodepools) and have metrics server and kube-eagle installed which are exposing the metrics correctly.

Now on importing the dashboard in grafana (we're at 6.1.2) I get all the dashboards full of random walk data:

Drilling into the panel I only see it filled with the random walk queries:

Looking into the json model again for exporting the targets are set correctly:

"targets": [
        {
          "expr": "sum(eagle_node_resource_usage_memory_bytes{node=~\"$node_pool.*\", node=~\"$node.*\"}) / sum(eagle_node_resource_allocatable_memory_bytes{node=~\"$node_pool.*\", node=~\"$node.*\"})",
          "format": "time_series",
          "instant": true,
          "intervalFactor": 1,
          "refId": "A"
        }
      ],

Do you have any clue what's going on? I also tried the dance with the node pool variable, but that has no effect on the random data.

EDIT: okay it seems that in each of the panel the datasource is missing:

"datasource": "${DS_PROMETHEUS}",

I manually added it there for each panel and it imports just fine :)

Metrics resource usage CPU and RAM all zeros

My prometheus reports metrics like this for all the nodes equals to 0.

Any reasons?

eagle_node_resource_usage_cpu_cores{endpoint="http",instance="100.126.64.40:8080",job="kube-eagle",namespace="monitoring",node="ip-172-20-95-140.ec2.internal",pod="kube-eagle-d4c4bbf9f-4vgtx",service="kube-eagle"}

kube-eagle on aws

can you tell me if kube-eagle is usable with aws eks ?
regards

Add metrics sharding feature

Currently there is no way to shard metrics. It could be possible to implement something similar to this PR. It would be enough to have static sharding for the starters.

What do you think? Could I raise a PR for this feature?

Lot's of pod restarts

I noticed there are lots of restarts.

5m14s   Warning   Unhealthy          Pod           Liveness probe failed: Get http://10.244.2.95:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
10m     Warning   Unhealthy          Pod           Readiness probe failed: Get http://10.244.2.95:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
22m     Warning   Unhealthy          Pod           Readiness probe failed: Get http://10.244.2.95:8080/health: EOF
22m     Normal    Killing            Pod           Killing container with id docker://kube-eagle:Container failed liveness probe.. Container will be killed and recreated.

Do I need to change smth for health check params to make it work properly?

[feature/bug ?] Autoscaling and new node pool metrics

I have just install kube-eagle and i like it a lot, however when I spaned a new node pool ( with autoscaling or manualy) kube eagle exporter wont install itself on it. In comparason the kube-exporter that deploy prometheus-operator automaticely do it on new node pool.

Is it a bug in my deployment configuration ? Or is it the normal case-utilisation and if it is how can I set up kube eagle so that it scrape my new node pool ?

Cache of old pods

Hi!!

Amazing stuff you got there.
I have a question/problem.

When there's a pod dying and being replaced, there's still data in eagle and prometheus is still pushing it to my grafana so it's misleading how much resources is being used.

Is there a setting somewhere in eagle or if it's in prometheus where data is stored and for how long?

Thanks
Math

Issue deploying Kube-Eagle

After deploying Kube-Eagle using the helm chart, I got the following logs:

{"level":"info","msg":"Starting kube eagle v1.1.0","time":"2019-03-27T20:08:20Z"}
{"level":"info","msg":"Creating InCluster config to communicate with Kubernetes master","time":"2019-03-27T20:08:20Z"}
{"level":"info","msg":"Listening on 0.0.0.0:8080","time":"2019-03-27T20:08:20Z"}
{"level":"warning","msg":"Failed to get podMetricses from Kubernetesthe server could not find the requested resource (get pods.metrics.k8s.io)","time":"2019-03-27T20:09:19Z"}
{"level":"error","msg":"Collector 'container_resources' failed after 0.051743s: the server could not find the requested resource (get pods.metrics.k8s.io)","time":"2019-03-27T20:09:19Z"}
{"level":"warning","msg":"Failed to get podList from Kubernetesthe server could not find the requested resource (get nodes.metrics.k8s.io)","time":"2019-03-27T20:09:19Z"}

[...]

Probably something simple, didn't look too much into it. Maybe a permission issue ? (even though the clusterRoles and everything for Kube-Eagle seem correctly configured)

Can I just get the yaml?

We recommend using our provided helm chart to deploy kube eagle in your cluster:

Why add all that complexity when a clean k8s yaml file will work [better]? Sed is a superior "templating engine" compared to Helm. Can I get the real yaml? I tried downloading the chart and using helm template but it just gives a gzip error.

incorrect data in POD column in CPU/RAM tables on dashboard

In column POD i see only pod for kube-eagle. How i can determine what this is container if i have multiple pods with same container name

Same thing with RAM

I understand that Prometheus scrape metrics from kube-eagle pod, but i think that need add to kube-agle labels with actual pod names where container is started.

What you think about this?

What is node pool and how to set it

So far so good. In Prometheus data is inserted OK. But in grafana not sure what is node_pool what to set it. It is just an empty field for me with a comma.

Cna you guide me, please?

Thx a lot!

InitContainers are not listed

InitContainers aren't listed on the exposed metrics.

Quick debugging showed that the go client lists init container statuses separately from the other containers' statuses.

No data / metrics in Prometheus, error 404 for localhost:8080

I created the kube-eagle with your helm chart.

I have a Prometheus operator created with stable/prometheus-operator chart.

The pod logs :

{"level":"info","msg":"Listening on 0.0.0.0:8080","time":"2019-03-04T13:00:35Z"}
{"level":"info","msg":"Creating InCluster config to communicate with Kubernetes master","time":"2019-03-04T13:00:35Z"}

When I port-forward : kubectl port-forward kube-eagle-69c44869d7-qw7sr 8080:8080

http://localhost:8080 => error 404
http://localhost:8080/health => HTTP 200, text is "Ok"

When I look into my Prometheus, I don't have any metric labeled "eagle_*"

Do I have to add some target in my Prometheus to scrape the kube-eagle pod ?

Pod constantly restarting

Hi,

We're facing an issue when pod is constantly restarting:

NAME                                                   READY   STATUS    RESTARTS   AGE
kube-eagle-6b6c46d47d-pjbzl                            1/1     Running   98         3d19h

And the logs are full of the following messages:

2019/08/19 06:40:08 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader (delegator.go:59)

Any ideas what might be wrong?

Deployment YAML example:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: kube-eagle
  name: kube-eagle
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-eagle
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      labels:
        app: kube-eagle
    spec:
      containers:
      - env:
        - name: LOG_LEVEL
          value: info
        image: quay.io/google-cloud-tools/kube-eagle:1.1.0
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: kube-eagle
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "2"
            memory: 1Gi
          requests:
            cpu: "1"
            memory: 512Mi
      serviceAccount: sa-kube-eagle
      serviceAccountName: sa-kube-eagle

Any ideas how to fix this?

[Feature Request] Running Pods/Max Pods per node

Sometimes I get insufficient pods on monitored clusters.

For this is great to also have a running pods per node / max pods per node.

I managed to add such information by editing "Node CPU" panel and add the following:

sum (label_replace(kubelet_running_pod_count{instance=~"$node_pool.*", instace=~"$node.*"}, "node", "$1", "instance", "(.*)")) by (node) / sum (kube_node_status_allocatable_pods{node=~"$node_pool.*", node=~"$node.*"}) by (node)

It would be great to have a panel or within this panel "running pods", "max pods" , "running/max pods"

Prometheus Scrape Config Example

I am new to Prometheus Operator. So perhaps that will explain my confusion. I deployed kube-eagle via helm and enabled Service Monitor. I assume I need to add a scrape config to Prometheus (helm kube-prometheus-stack). They are all in the same monitoring namespace. For some reason I thought given the service monitor is in the same namespace, Prometheus would pickup it up and start scraping.

Is there an example scrape config?

Grafana template points to a non existing node pool

No big deal but the template defaults to a non existing gke node pool, makes it appear nothing is working. (Selecting All dit it)

Standard deployment

Looking for standard yaml deployment.

Dashboard Fails to Import

Grafana 8.3.5

When I import using the 9871 ID the panels below the top row don't load and when I edit the panels it shows all the queries as datasource Grafana and random walk.

Add feature to expose pricing

In Google Cloud each vCPU and each GB of RAM have a fixed price. Kube Eagle can easily aggregate the allocatable and in use CPU & RAM and add a pricing metric for that.

This way one could get an overview how expensive a namespace or deployment is and what saving potential it has (usage compared to allocatable resources).

Challenges:

How to define the CPU / RAM pricing? Each zone, provider and machine type (e. g. spot instances, on demand, preemptible, commitment) may have a slightly different CPU & RAM pricing

Metrics for specific namespace

Hi.

When I select specific namespace I don't see any metrics about it.

should this even work?

EvictedPods falsify aggregated node metrics

Resource requests and limits falsify the aggregated node metrics.

allocatable vs. capacity

Nice work! Very useful. But it looks like for cpu the capacity is reported instead of the allocatable.

Filter terminated containers

Currently they are showing up in the Container CPU and Container RAM tables.

This is causing me a serious issue as I'm forced to check dozens of terminated containers to find the/a active container when you have containers that cycle frequently.

EDIT: I'm not sure if this makes sense at all, I think I can deal with it. What do you guys think?

Bug: eagle_node_resource_allocatable_cpu_cores shows capacity, not allocatable cores

If node has some reserved resources, its allocatable cpu cores differs from capacity cpu cores (total node cpu cores).

eagle_node_resource_allocatable_cpu_cores should return allocatable cpu cores instead of total cores count.

[feature-request] support node all/allocated gpu numbers

It is a useful feature. From metrics api we can get all resources included GPU(if registerd).

Support for k8s 1.16

K8s 1.16 has removed deprecated versions of api's.
Do you have any plan to make it work for 1.16 too ?

https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/

helm install --name=kube-eagle kube-eagle/kube-eagle

Error: validation failed: unable to recognize "": no matches for kind "Deployment" in version "apps/v1beta2"

Collectors don't have data yet

I see there is no data in collectors yet.

I checked Prometheus and there are no labels eagle_scrape_collector_duration_seconds.

I guess not yet added?

All Namespaces are incorrect, default is listed

✗ kubectl get namespace | wc -l
21

I have nothing in the default namespace, however every Namespace item in container CPU and container RAM is default

Add option to exclude completed pods?

Hi,

We're facing an issue when we have lots of completed pods:

# kubectl get pods --all-namespaces | grep -i completed | wc -l
   11863

And kube-eagle struggles to iterate over that many objects and therefore Kubernetes control plane crashes:

Is there a way to exclude pods in completed state?

Unauthorized

Hello,

I've just tried to install kube-eagle on my cluster with Helm, and I've an Unauthorized in logs:

{"level":"warning","msg":"Failed to get podMetricses from KubernetesUnauthorized","time":"2019-03-15T14:01:05Z"}
{"level":"error","msg":"Collector 'container_resources' failed after 0.456981s: Unauthorized","time":"2019-03-15T14:01:05Z"}
{"level":"warning","msg":"Failed to get podList from KubernetesUnauthorized","time":"2019-03-15T14:01:05Z"}
{"level":"error","msg":"Collector 'node_resource' failed after 0.494984s: Unauthorized","time":"2019-03-15T14:01:05Z"}

kube version

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-28T15:20:58Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.5", GitCommit:"51dd616cdd25d6ee22c83a858773b607328a18ec", GitTreeState:"clean", BuildDate:"2019-01-16T18:14:49Z", GoVersion:"go1.10.7", Compiler:"gc", Platform:"linux/amd64"}

Thanks for your help

cloudworkz / kube-eagle Goto Github PK

kube-eagle's People

Contributors

Stargazers

Watchers

Forkers

kube-eagle's Issues

Recommend Projects

Recommend Topics

Recommend Org