Crane is a FinOps Platform for Cloud Resource Analytics and Economics in Kubernetes clusters. The goal is not only to help users to manage cloud cost easier but also ensure the quality of applications.
There is no real time data provider, implement metrics server as real time data provider, this can reduce the prometheus traffic when the cluster is large
We need to propagation labels and annotations when create hpa inside. the source is from ehpa.
Better to config it in command line arguments like "--ehpa-propagation-label-prefix" and "--ehpa-propagation-annotation-prefix"
currently the recommendation result is present on recommendation.status, we can also put it into target's annotation.
this feature should be an option in recommendation's spec.
Describe the bug
NodeCpuUsagePromQLFmtStr = sum(count(node_cpu_seconds_total{mode="idle",instance="%s"}) by (mode, cpu)) - sum(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[%s]))
NodeMemUsagePromQLFmtStr = sum(node_memory_MemTotal_bytes{instance="%s"} - node_memory_MemAvailable_bytes{instance="%s"})
The above two default queries use a regular expression, which will cause the query result that does not meet the expected expectations. Reproduce steps
Expected behavior
Screenshots
Environment (please complete the following information):
Ext-resource service(The service using EXT resources) is to populate the idle resources of the kubernetes node. If the CPU used by the ext-resource service is calculated to the CPU of the kubernetes node, nodeResourceController will double-compute the CPU used of the ext-resource service when updating the kubernetes node ext resources (the ext-resource of the service requested has been calculated into the allocation by the kubelet)
Crane-Agent should expose the CPU usage metrics of the ext-resource service such as node_ext_cpu_usage_seconds_total
NodeCpuUsagePromQLFmtStr: sum(count(node_cpu_seconds_total{mode="idle",instance=~"%s.*"}) by (mode, cpu)) - sum(irate(node_cpu_seconds_total{mode="idle",instance=~"%s.*"}[%s])) - (sum(irate(node_ext_cpu_usage_seconds_total{node="%s"}[%s])) or Vector(0))
We need UT code for many functions, and many of them are isolated enough so they can be picked up by new comers.
For example match() in pkg/controller/analytics/analytics_controller.go, help needed.
Current Node Resource Controller update Kubernetes node ext resource with the predicted idle resource, all idle resources will be reallocated as ext resource which can be used by lower priority pods, especially for offline job, however this would leads the node resource to be exhausted, thus some of them would be evicted during the the execution.
So the request is to make the idle resource reallocation percentile can be configured, e.g. 4 cpu cores are idle, but only reallocate 2 cores.
Now if we debug the more detail info in crane, we must restart the crane and resetting the loglevel, if we in production environment, we can change the log level dynamically for better debugging
Now, crane read time series from datasource, algorithm treat time series as same values, it do not care about the unit. But high level component care about the value unit.
Such as metric adapter should know the unit of memory time series and cpu for hpa metrics to compute the value
@mfanjie , I don't find the contributor guide in the readme file. Should we add this guide? If it is ok, please assign this issue to me. I will try to add the document.
The current NodeResourceController only calculates the kubernetes node's ext resource based on TSP's prediction data, and does not update when TSP has no data, however, in some cases TSP will not be able to calculate the data and TSP is not sensitive to bursts, so we need to merge real-time data of kubernetes nodes to assist NodeResourceController in calculating ext resource.
If we want to merge real-time data to assist calculations, we first have to put it together The logic of nodeResourceController is implemented in the crane-agent.
Crane-agent uses timeSeriesPredictionInformer to sense changes in TSP and notify NodeResourceManager, NodeResourceManager collect data from other collectors (including collectors of real-time data) and merges them with TSP's data, and finds the maximum value from the merged data to calculate the ext resource
In order to avoid abnormal TSP controller and cause TSP not to be updated for a long time, NodeResourceManager's real-time data Collector will regularly notify NodeResourceManager.NodeResourceManager collects and merges other Collector data (including TSP Collector data). And get the maximum value from the merged data to calculate the ext resource.
Describe the bug
Currently when user delete a tsp or recommendation, the prediction core is still registed the query and compute in background. We need to release it durning related crd's deletion.
Reproduce steps
Expected behavior
Have the ability to let controller unregister their query.
Screenshots
Environment (please complete the following information):
Now there is no checkpoint for each time series in percentile algorithm,we can describe a behavior for evpa crd to support restore algorithm model from prometheus history data or checkpoint store
Describe the bug
1.use nodeName + "_" + string(uuid.NewUUID()) as nodename in podList etc.
2.NewNodeLocal collectors not staring always Reproduce steps
Expected behavior
Screenshots
Environment (please complete the following information):
Rename service for craned, current service name is webhook-service, we need to change it to craned to support further requirements when use craned service.