gocrane / crane Goto Github PK

Crane is a FinOps Platform for Cloud Resource Analytics and Economics in Kubernetes clusters. The goal is not only to help users to manage cloud cost easier but also ensure the quality of applications.

Home Page: https://gocrane.io

License: Apache License 2.0

Dockerfile 0.09% Makefile 0.48% Go 80.23% Shell 0.37% JavaScript 0.36% HTML 0.48% TypeScript 16.79% Less 1.13% SCSS 0.07%

analytics autoscaling cloud-computing cloud-native cncf cost-control cost-estimation cost-optimization finops kubernetes monitoring prediction time-series

crane's People

Contributors

Stargazers

Watchers

Forkers

kitianfresh yan234280533 qmhu mfanjie smileusd leeweir 8710925 guanyuding joker-bai kongjhstudy boblis qzoscar yunwang metahone sawyer523 shi1123 talen0702 wcpddd qingtian666 andy-github-info huzm dyiting fly-open-devops edvcc jeromeji bobcatsii jinze1107 xiaozefeng liufeichen ceidion vk1602-l wenlxie wxytjustb jacky68147527 yekangming gaominghao startstorm zhixunjie shiqiuzhang fhjiangzhe winkyi maomao5987370 desperadochn curry1998 gyliu513 zhangyi2022 evanli18 wu526 julian-chu random-zhu kevin-wei-sudo weida fastopen wanziyu honey-yogurt luyanfei wl4g-k8s sparkler-deng codeprh xushaohui alierkilic ghongli sammyluck cxbiao erick-shi chenhong231 hjxhjh star11sky sunqieng yufeiyu simonxu666j jingchenyanan achilleslil jackchen1602 djlalala chenkaiyue szy441687879 shijieqin attlee-wang belyenochi hzux rtcfoundation codingismyall hualin-bj wangqiongkaka zhengguo-jari loverhythm1990 panbuhei patricklai7528 autumn0207 whitebear009 zvier www6v qiezidong tobrainto banna2019 aland-zhang googege oceanchen2012 handong890

crane's Issues

In case that a time series is almost constant, DSP algorithm should treat it as a periodic series.

Percentile algorithm support to use real time data provider as model update datasource

Describe the feature

There is no real time data provider, implement metrics server as real time data provider, this can reduce the prometheus traffic when the cluster is large

Providing a graphical web page on which the historical, predicted and realtime time series are plotted in a chart. At first it is only for testing purpose.

Refine the algorithm and predictor interface, refine the state management including config and model data

node qos ensurance support memory, disk io, network io

Describe the feature

node qos ensurance support memory, disk io, network io

Propagation labels and annotations from ehpa to hpa

Describe the feature

We need to propagation labels and annotations when create hpa inside. the source is from ehpa.
Better to config it in command line arguments like "--ehpa-propagation-label-prefix" and "--ehpa-propagation-annotation-prefix"

add an introduce video for qos ensurance

Describe the feature

add an introduce video for qos ensurance

Put recommendation result into target's annotation

Describe the feature

currently the recommendation result is present on recommendation.status, we can also put it into target's annotation.
this feature should be an option in recommendation's spec.

Ensurance cpu throttle support

Code Standard for crane-agent

Code Standard For TimeSeriesPrediction

Default PromQL query syntax should not use regular expressions

Describe the bug
NodeCpuUsagePromQLFmtStr = sum(count(node_cpu_seconds_total{mode="idle",instance="%s"}) by (mode, cpu)) - sum(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[%s]))
NodeMemUsagePromQLFmtStr = sum(node_memory_MemTotal_bytes{instance="%s"} - node_memory_MemAvailable_bytes{instance="%s"})
The above two default queries use a regular expression, which will cause the query result that does not meet the expected expectations.
Reproduce steps

Expected behavior

Screenshots

Environment (please complete the following information):

K8S Version: [e.g. 1.19]
Crane Version: [e.g. 0.1.0]
Browser [e.g. chrome, safari]

Documentation Enhancement

We need to refine current readme with organized structure.

add metrics for crane-agent

Describe the feature

add metrics for crane-agent

Recommendation for EHPA

Support cron for ehpa

Describe the feature

There are no cron way to scale for ehpa, support it by an external metric way

modify the names of node qos ensurance examples

What version of Crane?
release 0.1.0

Describe the bug
Examples of node qos ensurance's name are not suitable

To Reproduce
not related

Expected behavior
Replace preferable names for the example

Screenshots
not related

Environment (please complete the following information):

K8S Version: ALL
Crane Version: ALL
Browser ALL

Code standard for data providers

Add crane server backend for crane unified UI

Code Standard for autoscaling

The kubernetes node CPU useage should subtract the CPU usage of services that use EXT resources

Describe the feature

Ext-resource service(The service using EXT resources) is to populate the idle resources of the kubernetes node. If the CPU used by the ext-resource service is calculated to the CPU of the kubernetes node, nodeResourceController will double-compute the CPU used of the ext-resource service when updating the kubernetes node ext resources (the ext-resource of the service requested has been calculated into the allocation by the kubelet)
Crane-Agent should expose the CPU usage metrics of the ext-resource service such as node_ext_cpu_usage_seconds_total
NodeCpuUsagePromQLFmtStr: sum(count(node_cpu_seconds_total{mode="idle",instance=~"%s.*"}) by (mode, cpu)) - sum(irate(node_cpu_seconds_total{mode="idle",instance=~"%s.*"}[%s])) - (sum(irate(node_ext_cpu_usage_seconds_total{node="%s"}[%s])) or Vector(0))

Unit tests for crane

We need UT code for many functions, and many of them are isolated enough so they can be picked up by new comers.
For example match() in pkg/controller/analytics/analytics_controller.go, help needed.

Configurable percent for idle resource reallocation

Describe the feature

Current Node Resource Controller update Kubernetes node ext resource with the predicted idle resource, all idle resources will be reallocated as ext resource which can be used by lower priority pods, especially for offline job, however this would leads the node resource to be exhausted, thus some of them would be evicted during the the execution.
So the request is to make the idle resource reallocation percentile can be configured, e.g. 4 cpu cores are idle, but only reallocate 2 cores.

Add documents for analytics and recommendation

Describe the feature

Controller reconcile should ignore not found error

Describe the bug
Some controller reconcile not ignore not found error.

Hello world samples for Crane document

Dynamic change the klog level by a http request setting

Describe the feature

Now if we debug the more detail info in crane, we must restart the crane and resetting the loglevel, if we in production environment, we can change the log level dynamically for better debugging

Deploying Documentation

Implement a way to let the system know the time series metric unit fetched from datasource

Describe the feature

Now, crane read time series from datasource, algorithm treat time series as same values, it do not care about the unit. But high level component care about the value unit.
Such as metric adapter should know the unit of memory time series and cpu for hpa metrics to compute the value

Support external metric provider for metric-adapter

document for contributor guide

Describe the feature

@mfanjie , I don't find the contributor guide in the readme file. Should we add this guide? If it is ok, please assign this issue to me. I will try to add the document.

thanks.

Code standard for prediction

Can we design a more general thing replacing podqosensurancepolicies & nodeqosensurancepolicies?

Originally posted by @yufeiyu in #5 (comment)

NodeResourceController should merge real-time data from nodes to compute ext resources

Describe the feature

The current NodeResourceController only calculates the kubernetes node's ext resource based on TSP's prediction data, and does not update when TSP has no data, however, in some cases TSP will not be able to calculate the data and TSP is not sensitive to bursts, so we need to merge real-time data of kubernetes nodes to assist NodeResourceController in calculating ext resource.

If we want to merge real-time data to assist calculations, we first have to put it together The logic of nodeResourceController is implemented in the crane-agent.
Crane-agent uses timeSeriesPredictionInformer to sense changes in TSP and notify NodeResourceManager, NodeResourceManager collect data from other collectors (including collectors of real-time data) and merges them with TSP's data, and finds the maximum value from the merged data to calculate the ext resource
In order to avoid abnormal TSP controller and cause TSP not to be updated for a long time, NodeResourceManager's real-time data Collector will regularly notify NodeResourceManager.NodeResourceManager collects and merges other Collector data (including TSP Collector data). And get the maximum value from the merged data to calculate the ext resource.

refactor crane agent

Describe the feature

The main crane agent framework is merged, found some areas can be enhanced, open this ticket to track the refactor effort.

Add feature gate for analysis

add validate for NodeQOSEnsurancePolicy and AvoidanceAction

Describe the feature

add validate for NodeQOSEnsurancePolicy and AvoidanceAction

It should add util functions for the updateNodeCondition, updateNodeTaint and so on.

It should add util functions for the updateNodeCondition and updateNodeTaint.

The file path:
pkg/ensurance/informer/node.go

Support data source with influxDB protocol

how to use

Describe the feature

Do you have instructions

It should used kubebuilder framework to instead the informer factory ?

If used kubebuilder framework to instead the informer factory, it can stay the same like other controllers.

Unregister a query when related crd is in deletion.

Describe the bug
Currently when user delete a tsp or recommendation, the prediction core is still registed the query and compute in background. We need to release it durning related crd's deletion.

Reproduce steps

Expected behavior

Have the ability to let controller unregister their query.

Screenshots

Environment (please complete the following information):

K8S Version: [e.g. 1.19]
Crane Version: [e.g. 0.1.0]
Browser [e.g. chrome, safari]

Use string as recommended value

Support checkpoint for percentile algorithm predictor

Describe the feature

Now there is no checkpoint for each time series in percentile algorithm，we can describe a behavior for evpa crd to support restore algorithm model from prometheus history data or checkpoint store

If a recommendation CR get removed, stop prediction process and release its prediction resource.

Describe the feature

If a recommendation CR get removed, recommendation controller should invoke 'DeleteQuery' to stop prediction and release its resource.

show the throttle status for the throttled pods

Describe the feature

When the pods were throttled, we should put the message to users explicitly.

It can use the event for the pod or the condtion on nep and so on.

crane-agent not work normally when using examples/ensurance

Describe the bug
1.use nodeName + "_" + string(uuid.NewUUID()) as nodename in podList etc.
2.NewNodeLocal collectors not staring always
Reproduce steps

Expected behavior

Screenshots

Environment (please complete the following information):

K8S Version: [e.g. 1.19]
Crane Version: [e.g. 0.1.0]
Browser [e.g. chrome, safari]

Support longer term prometheus query

Describe the bug
Prometheus only supports 11000 data points per query, so crane cannot query for a long period of time, e.g. 14 days.

Reproduce steps

Expected behavior
Prometheus provider should support query that exceeding the limit of 11000 points.

Screenshots

Environment (please complete the following information):

K8S Version: [e.g. 1.19]
Crane Version: [e.g. 0.1.0]
Browser [e.g. chrome, safari]

Add document for qos ensurance

Rename service for craned

Describe the feature

Rename service for craned, current service name is webhook-service, we need to change it to craned to support further requirements when use craned service.

gocrane / crane Goto Github PK

crane's People

Contributors

Stargazers

Watchers

Forkers

crane's Issues

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Describe the feature

Recommend Projects

Recommend Topics

Recommend Org