Giter VIP home page Giter VIP logo

crane-scheduler's People

Contributors

garrybest avatar mfanjie avatar qmhu avatar wuxs avatar xieydd avatar yuleichun-striving avatar yuzhiquan avatar zsnmwy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crane-scheduler's Issues

crane-scheduler api问题

大佬,crane-scheduler有没有提供可供第三方访问的接口或者sdk?现在我想在自己的代码中调用crane-scheduler的一些功能。

first issue !!

hello, I deployed crane-scheduler using helm chart. my prometheus svc addr is as below:

but there are error logs in controller pod:

Post "192.168.15.25/api/v1/query": unsupported protocol scheme ""

can someone explains why it happens. Thanks.

替换默认调度器问题

[root@zcsmaster1 manifests]# cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:

  • command:

- kube-scheduler

- /scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=192.168.40.180
- --config=/etc/kubernetes/kube-scheduler/scheduler-config.yaml

- --kubeconfig=/etc/kubernetes/scheduler.conf

- --leader-elect=true
image: docker.io/gocrane/crane-scheduler:0.0.20
imagePullPolicy: IfNotPresent
livenessProbe:
  failureThreshold: 8
  httpGet:
    host: 192.168.40.180 
    path: /healthz
    port: 10259
    scheme: HTTPS
  initialDelaySeconds: 12
  periodSeconds: 10
  timeoutSeconds: 15
name: kube-scheduler
resources:
  requests:
    cpu: 100m
startupProbe:
  failureThreshold: 24
  httpGet:
    host: 192.168.40.180
    path: /healthz
    port: 10259
    scheme: HTTPS
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 15
volumeMounts:

- mountPath: /etc/kubernetes/scheduler.conf

name: kubeconfig

readOnly: true

- name: scheduler-config
  mountPath: /etc/kubernetes/kube-scheduler
  readOnly: true
- name: dynamic-scheduler-policy
  mountPath: /etc/kubernetes

hostNetwork: true
priorityClassName: system-node-critical
volumes:

- hostPath:

path: /etc/kubernetes/scheduler.conf

type: FileOrCreate

name: kubeconfig

  • name: scheduler-config
    configMap:
    name: scheduler-config
  • name: dynamic-scheduler-policy
    configMap:
    name: dynamic-scheduler-policy
    status: {}

[root@zcsmaster1 manifests]#

您好 ,对默认的调度器进行替换 ,这种方式一直不成功,可以出一下详细的文档吗?
Events:
Type Reason Age From Message


Normal Scheduled 93s default-scheduler Successfully assigned kube-system/kube-scheduler to zcsnode2
Normal Pulled 36s (x4 over 93s) kubelet Container image "docker.io/gocrane/crane-scheduler:0.0.20" already present on machine
Normal Created 36s (x4 over 93s) kubelet Created container kube-scheduler
Normal Started 36s (x4 over 93s) kubelet Started container kube-scheduler
Warning BackOff 3s (x10 over 91s) kubelet Back-off restarting failed container
[root@zcsmaster1 manifests]# kubectl describe pod kube-scheduler -n kube-system

Events:
Type Reason Age From Message


Normal Scheduled 53m default-scheduler Successfully assigned kube-system/crane-scheduler-controller-7845b4cbf7-dhrkm to zcsnode2
Normal Pulled 52m (x2 over 53m) kubelet Container image "docker.io/gocrane/crane-scheduler-controller:0.0.23" already present on machine
Normal Created 52m (x2 over 53m) kubelet Created container controller
Normal Started 52m (x2 over 53m) kubelet Started container controller
Normal Killing 52m kubelet Container controller failed liveness probe, will be restarted
Warning Unhealthy 51m (x5 over 53m) kubelet Liveness probe failed: Get "http://10.244.234.118:8090/healthz": dial tcp 10.244.234.118:8090: connect: connection refused
Warning BackOff 8m29s (x116 over 46m) kubelet Back-off restarting failed container
Warning Unhealthy 3m40s (x138 over 53m) kubelet Readiness probe failed: Get "http://10.244.234.118:8090/healthz": dial tcp 10.244.234.118:8090: connect: connection refused
[root@zcsmaster1 manifests]#

rbac does not have sufficient permissions

What happened?

When the pod uses the waitForFirstConsumer type pvc, the crane-scheduler does not have sufficient permissions to update the annotation of the pvc. Scheduler needs permission to update the pvc.

What did you expect to happen?

Pod was successfully scheduled.

How can we reproduce it (as minimally and precisely as possible)?

Create a pod that uses pvc, and the pvc uses a waitForFirstConsumer type storageclass.

代替原始调度器失败

k8s版本1.21.10
kube-scheduler-master报错信息:

[root@master ~]# kubectl describe pod kube-scheduler-master -n kube-system
Name:                 kube-scheduler-master
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 master/192.168.189.100
Start Time:           Sun, 12 Feb 2023 19:10:12 +0800
Labels:               component=kube-scheduler
                      tier=control-plane
Annotations:          kubernetes.io/config.hash: 456d6f68d333532ade0a5a2a7823efaf
                      kubernetes.io/config.mirror: 456d6f68d333532ade0a5a2a7823efaf
                      kubernetes.io/config.seen: 2023-03-02T16:52:39.104787713+08:00
                      kubernetes.io/config.source: file
Status:               Running
IP:                   192.168.189.100
IPs:
  IP:           192.168.189.100
Controlled By:  Node/master
Containers:
  kube-scheduler:
    Container ID:  docker://80e22d215ac0eccdce39322f85307f46c558d84d70346a12c89ad45150b440c7
    Image:         gocrane/crane-scheduler:0.0.23
    Image ID:      docker-pullable://gocrane/crane-scheduler@sha256:9ba6d11b20794b29d35661998e806b5711b36f49f5b57e8bd32af2ca8426c928
    Port:          <none>
    Host Port:     <none>
    Command:
      kube-scheduler
      --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
      --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
      --bind-address=127.0.0.1
      --kubeconfig=/etc/kubernetes/scheduler.conf
      --leader-elect=true
      --port=0
      --config=/etc/kubernetes/scheduler-config.yaml
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-scheduler": executable file not found in $PATH: unknown
      Exit Code:    127
      Started:      Thu, 02 Mar 2023 16:53:13 +0800
      Finished:     Thu, 02 Mar 2023 16:53:13 +0800
    Ready:          False
    Restart Count:  2
    Requests:
      cpu:        100m
    Liveness:     http-get https://127.0.0.1:10259/healthz delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup:      http-get https://127.0.0.1:10259/healthz delay=10s timeout=15s period=10s #success=1 #failure=24
    Environment:  <none>
    Mounts:
      /etc/kubernetes/scheduler-config.yaml from schedulerconfig (ro)
      /etc/kubernetes/scheduler.conf from kubeconfig (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kubeconfig:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/scheduler.conf
    HostPathType:  FileOrCreate
  schedulerconfig:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/scheduler-config.yaml
    HostPathType:  FileOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute op=Exists
Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   Pulled   17s (x3 over 39s)  kubelet  Container image "gocrane/crane-scheduler:0.0.23" already present on machine
  Normal   Created  17s (x3 over 39s)  kubelet  Created container kube-scheduler
  Warning  Failed   17s (x3 over 39s)  kubelet  Error: failed to start container "kube-scheduler": Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-scheduler": executable file not found in $PATH: unknown
  Warning  BackOff  1s (x6 over 38s)   kubelet  Back-off restarting failed container

无法修改版本,显示找不到scheduler-config,应该是因为kube-scheduler-master无法创建的原因

[root@master ~]# KUBE_EDITOR="sed -i 's/v1beta2/v1beta1/g'" kubectl edit cm scheduler-config -n crane-system && KUBE_EDITOR="sed -i 's/0.0.23/0.0.20/g'" kubectl edit deploy crane-scheduler -n crane-system
Error from server (NotFound): configmaps "scheduler-config" not found

修改的kube-scheduler.yaml内容,镜像提前拉取到本地

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0
    - --config=/etc/kubernetes/scheduler-config.yaml
    # image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.21.10
    image: gocrane/crane-scheduler:0.0.23
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/scheduler-config.yaml
      name: schedulerconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /etc/kubernetes/scheduler-config.yaml
      type: FileOrCreate
    name: schedulerconfig
status: {}

修改的scheduler-config.yaml的内容

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
  - schedulerName: default-scheduler
    plugins:
      filter:
        enabled:
          - name: Dynamic
      score:
        enabled:
          - name: Dynamic
            weight: 3
    pluginConfig:
      - name: Dynamic
        args:
          policyConfigPath: /etc/kubernetes/policy.yaml

kube-scheduler.yaml 该如何修改

[root@zcsmaster1 manifests]# cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:

  • command:
    • /scheduler
    • --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    • --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    • --bind-address=127.0.0.1
      #KubeSchedulerConfiguration文件在容器中的路径
    • --kubeconfig=/etc/kubernetes/policy.yaml
    • --config=/etc/kubernetes/scheduler-config.yaml
    • --leader-elect=true
      image: docker.io/gocrane/crane-scheduler:0.0.20
      imagePullPolicy: IfNotPresent
      livenessProbe:
      failureThreshold: 8
      httpGet:
      host: 127.0.0.1
      path: /healthz
      port: 10259
      scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
      name: kube-scheduler
      resources:
      requests:
      cpu: 100m
      startupProbe:
      failureThreshold: 24
      httpGet:
      host: 127.0.0.1
      path: /healthz
      port: 10259
      scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
      volumeMounts:
    • mountPath: /etc/kubernetes
      name: kubeconfig
      readOnly: true
      hostNetwork: true
      priorityClassName: system-node-critical
      volumes:
  • hostPath:
    path: /etc/kubernetes/
    type: Directory
    name: kubeconfig
    status: {}

请问那边有问题

State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 18 Oct 2022 09:49:30 +0800
Finished: Tue, 18 Oct 2022 09:49:30 +0800
Ready: False
Restart Count: 0

crane-scheduler日志报错

k8s: v1.21.5

E0625 08:39:09.924340 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed E0625 08:39:09.924391 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed" pod="dev-app/base-v1-web-5f9b4fb6fc-wqbcl" E0625 08:39:09.940324 1 scheduler.go:379] scheduler cache AssumePod failed: pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed E0625 08:39:09.940364 1 factory.go:338] "Error scheduling pod; retrying" err="pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed" pod="dev-app/jz-digital-attendance-mobile-web-5f69dd6645-hpgqm" W0625 08:39:23.173299 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget E0625 08:40:09.914104 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed E0625 08:40:09.914157 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed" pod="dev-app/xiaofang-auth-admin-web-64bbd7fd4-sct6d" E0625 08:40:39.915075 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed E0625 08:40:39.915127 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed" pod="dev-app/base-v1-web-5f9b4fb6fc-wqbcl" E0625 08:40:39.927121 1 scheduler.go:379] scheduler cache AssumePod failed: pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed E0625 08:40:39.927172 1 factory.go:338] "Error scheduling pod; retrying" err="pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed" pod="dev-app/jz-digital-attendance-mobile-web-5f69dd6645-hpgqm" E0625 08:41:09.915082 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed E0625 08:41:09.915123 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed" pod="dev-app/xiaofang-auth-admin-web-64bbd7fd4-sct6d" E0625 08:42:09.915837 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed E0625 08:42:09.915879 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed" pod="dev-app/base-v1-web-5f9b4fb6fc-wqbcl" E0625 08:42:09.925737 1 scheduler.go:379] scheduler cache AssumePod failed: pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed E0625 08:42:09.925772 1 factory.go:338] "Error scheduling pod; retrying" err="pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed" pod="dev-app/jz-digital-attendance-mobile-web-5f69dd6645-hpgqm" E0625 08:42:09.936894 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed E0625 08:42:09.936970 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed" pod="dev-app/xiaofang-auth-admin-web-64bbd7fd4-sct6d" E0625 08:43:09.917671 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed E0625 08:43:09.917714 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed" pod="dev-app/base-v1-web-5f9b4fb6fc-wqbcl" E0625 08:43:39.918409 1 scheduler.go:379] scheduler cache AssumePod failed: pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed E0625 08:43:39.918449 1 factory.go:338] "Error scheduling pod; retrying" err="pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed" pod="dev-app/jz-digital-attendance-mobile-web-5f69dd6645-hpgqm" E0625 08:43:39.930036 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed E0625 08:43:39.930072 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed" pod="dev-app/xiaofang-auth-admin-web-64bbd7fd4-sct6d" E0625 08:43:50.288516 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed E0625 08:43:50.302026 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed" pod="dev-app/xiaofang-auth-admin-web-64bbd7fd4-sct6d" E0625 08:44:09.919255 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed E0625 08:44:09.919303 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed" pod="dev-app/base-v1-web-5f9b4fb6fc-wqbcl" E0625 08:44:39.920148 1 scheduler.go:379] scheduler cache AssumePod failed: pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed E0625 08:44:39.920193 1 factory.go:338] "Error scheduling pod; retrying" err="pod 50450750-6476-4e89-8232-f3f756483a11 is in the cache, so can't be assumed" pod="dev-app/jz-digital-attendance-mobile-web-5f69dd6645-hpgqm" E0625 08:45:09.920842 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed E0625 08:45:09.920881 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0a64a369-5b40-41f3-b354-d056f79b5a81 is in the cache, so can't be assumed" pod="dev-app/xiaofang-auth-admin-web-64bbd7fd4-sct6d" E0625 08:45:09.931887 1 scheduler.go:379] scheduler cache AssumePod failed: pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed E0625 08:45:09.931959 1 factory.go:338] "Error scheduling pod; retrying" err="pod 0ad913e1-30bb-48e7-b563-78ee26bee313 is in the cache, so can't be assumed" pod="dev-app/base-v1-web-5f9b4fb6fc-wqbcl"

节点资源超卖

使用crane scheduler,不改变pod的默认调度器,仅加入filter和score plugin,是否可以使节点部署超过request的pod

没有根据节点实际负载调度Pod

crane版本:helm scheduler-0.2.2
k8s版本:1.24

使用一台k8s节点,规格16核32GB
节点Annotations负载

Annotations:        alpha.kubernetes.io/provided-node-ip: 172.30.64.34
                    cpu_usage_avg_5m: 0.63012,2023-10-17T15:04:32Z
                    cpu_usage_max_avg_1d: 0.63666,2023-10-17T14:03:36Z
                    cpu_usage_max_avg_1h: 0.63654,2023-10-17T15:01:29Z
                    mem_usage_avg_5m: 0.21519,2023-10-17T15:04:34Z
                    mem_usage_max_avg_1d: 0.21614,2023-10-17T14:02:41Z
                    mem_usage_max_avg_1h: 0.21700,2023-10-17T15:01:53Z
                    node.alpha.kubernetes.io/ttl: 0
                    node_hot_value: 0,2023-10-17T15:04:34Z

节点Requests

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         15367m (96%)   1770m (11%)
  memory                      25943Mi (91%)  1500Mi (5%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0

创建一个测试服务,共有8个pod副本,每个pod压测产生2核1GB负载,requests为3核5GB

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-nginx
  namespace: demo
  labels:
    app: demo-nginx
spec:
  replicas: 8
  selector:
    matchLabels:
      app: demo-nginx
  template:
    metadata:
      labels:
        app: demo-nginx
    spec:
      schedulerName: crane-scheduler
      containers:
      - name: demo-nginx
        image: xxxxxx/stress:latest
        command: ["stress", "-c", "1","--vm", "1", "--vm-bytes", "1G", "--vm-keep"]
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 3
            memory: 5Gi

实际只运行了5个,实际应该运行至少6个Pod

$  kgp -A -o wide|grep demo-nginx
demo               demo-nginx-69db9d45df-4j2rh                                  1/1     Running                  0                  3h20m   172.30.64.199   ip-172-30-64-34.ap-northeast-1.compute.internal    <none>           <none>
demo               demo-nginx-69db9d45df-4jc5h                                  0/1     Pending                  0                  3h20m   <none>          <none>                                             <none>           <none>
demo               demo-nginx-69db9d45df-6p4jz                                  0/1     Pending                  0                  3h20m   <none>          <none>                                             <none>           <none>
demo               demo-nginx-69db9d45df-7fdn2                                  1/1     Running                  0                  3h20m   172.30.64.111   ip-172-30-64-34.ap-northeast-1.compute.internal    <none>           <none>
demo               demo-nginx-69db9d45df-b75mz                                  1/1     Running                  0                  3h20m   172.30.64.78    ip-172-30-64-34.ap-northeast-1.compute.internal    <none>           <none>
demo               demo-nginx-69db9d45df-vsp6g                                  1/1     Running                  0                  3h20m   172.30.64.97    ip-172-30-64-34.ap-northeast-1.compute.internal    <none>           <none>
demo               demo-nginx-69db9d45df-xxrsb                                  1/1     Running                  0                  3h20m   172.30.64.10    ip-172-30-64-34.ap-northeast-1.compute.internal    <none>           <none>
demo               demo-nginx-69db9d45df-zgkjr                                  0/1     Pending                  0                  8m56s   <none>          <none>                                             <none>           <none>

predicate配置

$ k get cm dynamic-scheduler-policy -n crane-system -o yaml 
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: scheduler.policy.crane.io/v1alpha1
    kind: DynamicSchedulerPolicy
    spec:
      syncPolicy:
        ##cpu usage
        - name: cpu_usage_avg_5m
          period: 3m
        - name: cpu_usage_max_avg_1h
          period: 15m
        - name: cpu_usage_max_avg_1d
          period: 3h
        ##memory usage
        - name: mem_usage_avg_5m
          period: 3m
        - name: mem_usage_max_avg_1h
          period: 15m
        - name: mem_usage_max_avg_1d
          period: 3h

      predicate:
        ##cpu usage
        - name: cpu_usage_avg_5m
          maxLimitPecent: 0.90
        - name: cpu_usage_max_avg_1h
          maxLimitPecent: 0.95
        ##memory usage
        - name: mem_usage_avg_5m
          maxLimitPecent: 0.90
        - name: mem_usage_max_avg_1h
          maxLimitPecent: 0.95

      priority:
        ###score = sum(() * weight) / len,  0 <= score <= 10
        ##cpu usage
        - name: cpu_usage_avg_5m
          weight: 0.2
        - name: cpu_usage_max_avg_1h
          weight: 0.3
        - name: cpu_usage_max_avg_1d
          weight: 0.5
        ##memory usage
        - name: mem_usage_avg_5m
          weight: 0.2
        - name: mem_usage_max_avg_1h
          weight: 0.3
        - name: mem_usage_max_avg_1d
          weight: 0.5

      hotValue:
        - timeRange: 5m
          count: 20
        - timeRange: 1m
          count: 10

crane-scheduler是根据节点实际负载调度Pod,为什么节点内存负载是0.21,CPU负载是0.63,且未触发predicate指标阀值,实际只运行了5个pod,按照节点剩余25GB((1-0.21)*32)内存、5核CPU((1-0.63)*16)资源计算,至少可运行6个以上Pod

binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding

调度失败

I1121 03:06:27.345666 1 plugins.go:92] [crane] Node[dev-monitoring]'s finalscore is 69, while score is 69 and hotvalue is 0.000000
I1121 03:06:27.345752 1 plugins.go:92] [crane] Node[dev-qchen]'s finalscore is 81, while score is 81 and hotvalue is 0.000000
I1121 03:06:27.345751 1 plugins.go:92] [crane] Node[bqdev02]'s finalscore is 72, while score is 72 and hotvalue is 0.000000
I1121 03:06:27.345775 1 plugins.go:92] [crane] Node[bqdev01]'s finalscore is 85, while score is 85 and hotvalue is 0.000000
I1121 03:06:27.345780 1 plugins.go:92] [crane] Node[bqdev03]'s finalscore is 74, while score is 74 and hotvalue is 0.000000
I1121 03:06:27.345787 1 plugins.go:92] [crane] Node[dev-node4]'s finalscore is 67, while score is 67 and hotvalue is 0.000000
I1121 03:06:27.345797 1 plugins.go:92] [crane] Node[dev-xyli]'s finalscore is 67, while score is 67 and hotvalue is 0.000000
I1121 03:06:27.345790 1 plugins.go:92] [crane] Node[dev-master3]'s finalscore is 79, while score is 79 and hotvalue is 0.000000
I1121 03:06:27.345810 1 plugins.go:92] [crane] Node[dev-node5]'s finalscore is 83, while score is 83 and hotvalue is 0.000000
I1121 03:06:27.345821 1 plugins.go:92] [crane] Node[dev-whliao]'s finalscore is 73, while score is 73 and hotvalue is 0.000000
E1121 03:06:27.358217 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"cpu-stress-59f8597545-7bdrq\": pod cpu-stress-59f8597545-7bdrq is already assigned to node \"dev-node5\"" plugin="DefaultBinder" pod="crane-system/cpu-stress-59f8597545-7bdrq"
E1121 03:06:27.358235 1 scheduler.go:610] "scheduler cache ForgetPod failed" err="pod c2cae006-2ae2-4ca6-b2f6-6af43faaa972 wasn't assumed so cannot be forgotten"
E1121 03:06:27.358250 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"cpu-stress-59f8597545-7bdrq\": pod cpu-stress-59f8597545-7bdrq is already assigned to node \"dev-node5\"" pod="crane-system/cpu-stress-59f8597545-7bdrq"
I1121 03:06:27.358258 1 factory.go:238] "Pod has been assigned to node. Abort adding it back to queue." pod="crane-system/cpu-stress-59f8597545-7bdrq" node="dev-node5"

scheduler-controller无法正常运行

controller一直重启无法正常Running,查看日志提示如下,但是describe node看是已经成功添加Annotations的
I0201 18:00:43.153543 1 node.go:75] Finished syncing node event "node-2/mem_usage_max_avg_1d" (20.320214ms)
I0201 18:00:43.175135 1 node.go:75] Finished syncing node event "master/mem_usage_max_avg_1d" (21.563645ms)
I0201 18:00:43.197964 1 node.go:75] Finished syncing node event "node-1/mem_usage_max_avg_1d" (22.784592ms)
I0201 18:00:53.119482 1 node.go:75] Finished syncing node event "node-2/cpu_usage_avg_5m" (2.02963ms)
W0201 18:00:53.119507 1 node.go:61] failed to sync this node ["node-2/cpu_usage_avg_5m"]: can not annotate node[node-2]: failed to get data cpu_usage_avg_5m{node-2=}:
I0201 18:00:53.120460 1 node.go:75] Finished syncing node event "master/cpu_usage_avg_5m" (939.612µs)
W0201 18:00:53.120483 1 node.go:61] failed to sync this node ["master/cpu_usage_avg_5m"]: can not annotate node[master]: failed to get data cpu_usage_avg_5m{master=}:

crane-scheduler 更新api

image从 CSIStorageCapacity 移除 storage.k8s.io/v1beta1,迁移清单和 API 客户端以使用自 v1.24 起可用的 storage.k8s.io/v1 API 版本。 所有现有的已持久保存的对象都可以通过这个新的 API 进行访问。

自建Prometheus获取不到聚合指标

1、看crane-scheduler-controller日志发现聚合指标的监控项指标都获取不到
W0626 20:55:02.198329 1 node.go:61] failed to sync this node ["k8s-node4/mem_usage_avg_5m"]: can not annotate node[k8s-node4]: failed to get data mem_usage_avg_5m{k8s-node4=}:
2、
fe3d166c668c1cc8739fbaf5d2ce873

replace the k8s scheduler with crane scheduler,new pod pending

I replace the k8s scheduler with crane scheduler, and then created a new pod. I found the new pod always “Pending”, also no related events info.

... ...
QoS Class: Burstable
Node-Selectors:
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:

I can only found some useful info in logs, as as follows:
I0614 02:18:42.492977 1 eventhandlers.go:118] "Add event for unscheduled pod" pod="kube-system/kubernetes-dashboard-jqhhq"

I wonder if the new pod is pop from the ‘SchedulingQueue’, and how I solved the problem

插件打分无效

如图,分数最高的主机为1.250,但是实际pod却部署到了0.21机器里,求解?

helm安装Crane-scheduler 作为第二个调度器,使用官网示例测试pod没有被调度,一直卡在”Pending“状态

helm安装Crane-scheduler 作为第二个调度器,使用官网示例测试pod没有被调度,一直卡在”Pending“状态:
1、部署yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-stress
spec:
selector:
matchLabels:
app: cpu-stress
replicas: 1
template:
metadata:
labels:
app: cpu-stress
spec:
schedulerName: crane-scheduler
hostNetwork: true
tolerations:
- key: node.kubernetes.io/network-unavailable
operator: Exists
effect: NoSchedule
containers:
- name: stress
image: docker.io/gocrane/stress:latest
command: ["stress", "-c", "1"]
resources:
requests:
memory: "1Gi"
cpu: "1"
limits:
memory: "1Gi"
cpu: "1"
2、pod详情:
Name: cpu-stress-cc8656b6c-b5hhz
Namespace: default
Priority: 0
Node:
Labels: app=cpu-stress
pod-template-hash=cc8656b6c
Annotations:
Status: Pending
IP:
IPs:
Controlled By: ReplicaSet/cpu-stress-cc8656b6c
Containers:
stress:
Image: docker.io/gocrane/stress:latest
Port:
Host Port:
Command:
stress
-c
1
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 1
memory: 1Gi
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9nwd5 (ro)
Volumes:
kube-api-access-9nwd5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors:
Tolerations: node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
3、crane-scheduler日志:
I0824 00:50:47.247851 1 serving.go:331] Generated self-signed cert in-memory
W0824 00:50:48.025758 1 options.go:330] Neither --kubeconfig nor --master was specified. Using default API client. This might not work.
W0824 00:50:48.073470 1 authorization.go:47] Authorization is disabled
W0824 00:50:48.073495 1 authentication.go:40] Authentication is disabled
I0824 00:50:48.073517 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0824 00:50:48.080823 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0824 00:50:48.080862 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0824 00:50:48.080915 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0824 00:50:48.080927 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0824 00:50:48.080957 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0824 00:50:48.080968 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0824 00:50:48.081199 1 secure_serving.go:197] Serving securely on [::]:10259
I0824 00:50:48.081270 1 tlsconfig.go:240] Starting DynamicServingCertificateController
W0824 00:50:48.091287 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0824 00:50:48.146624 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
I0824 00:50:48.182865 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0824 00:50:48.183903 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0824 00:50:48.184059 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0824 00:50:48.284088 1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-scheduler...
W0824 00:57:30.128689 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0824 01:02:45.130884 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0824 01:08:48.133483 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0824 01:14:31.135801 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0824 01:20:24.138959 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0824 01:30:10.141873 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
4、crane-scheduler-controlle日志:
I0824 08:46:16.647776 1 server.go:61] Starting Controller version v0.0.0-master+$Format:%H$
I0824 08:46:16.648237 1 leaderelection.go:248] attempting to acquire leader lease crane-system/crane-scheduler-controller...
I0824 08:46:16.706891 1 leaderelection.go:258] successfully acquired lease crane-system/crane-scheduler-controller
I0824 08:46:16.807546 1 controller.go:72] Caches are synced for controller
I0824 08:46:16.807631 1 node.go:46] Start to reconcile node events
I0824 08:46:16.807653 1 event.go:30] Start to reconcile EVENT events
I0824 08:46:16.885698 1 node.go:75] Finished syncing node event "node6/cpu_usage_avg_5m" (77.952416ms)
I0824 08:46:16.973162 1 node.go:75] Finished syncing node event "node4/cpu_usage_avg_5m" (87.371252ms)
I0824 08:46:17.045250 1 node.go:75] Finished syncing node event "master2/cpu_usage_avg_5m" (72.023298ms)
I0824 08:46:17.109260 1 node.go:75] Finished syncing node event "master3/cpu_usage_avg_5m" (63.673389ms)
I0824 08:46:17.192332 1 node.go:75] Finished syncing node event "node1/cpu_usage_avg_5m" (83.005155ms)
I0824 08:46:17.529495 1 node.go:75] Finished syncing node event "node2/cpu_usage_avg_5m" (337.099052ms)
I0824 08:46:17.927163 1 node.go:75] Finished syncing node event "node3/cpu_usage_avg_5m" (397.603044ms)
I0824 08:46:18.327978 1 node.go:75] Finished syncing node event "node5/cpu_usage_avg_5m" (400.749476ms)
I0824 08:46:18.746391 1 node.go:75] Finished syncing node event "master1/cpu_usage_avg_5m" (418.360885ms)
I0824 08:46:19.129081 1 node.go:75] Finished syncing node event "node6/cpu_usage_max_avg_1h" (382.635495ms)
I0824 08:46:19.524508 1 node.go:75] Finished syncing node event "node4/cpu_usage_max_avg_1h" (395.361539ms)
I0824 08:46:19.948035 1 node.go:75] Finished syncing node event "master2/cpu_usage_max_avg_1h" (423.453672ms)
I0824 08:46:20.332014 1 node.go:75] Finished syncing node event "master3/cpu_usage_max_avg_1h" (383.909395ms)
I0824 08:46:20.737296 1 node.go:75] Finished syncing node event "node1/cpu_usage_max_avg_1h" (405.102002ms)
I0824 08:46:21.245055 1 node.go:75] Finished syncing node event "node2/cpu_usage_max_avg_1h" (507.697871ms)
I0824 08:46:21.573490 1 node.go:75] Finished syncing node event "node3/cpu_usage_max_avg_1h" (328.368489ms)
I0824 08:46:21.937814 1 node.go:75] Finished syncing node event "node5/cpu_usage_max_avg_1h" (364.254837ms)
I0824 08:46:22.335988 1 node.go:75] Finished syncing node event "master1/cpu_usage_max_avg_1h" (397.952357ms)
I0824 08:46:22.724851 1 node.go:75] Finished syncing node event "master2/cpu_usage_max_avg_1d" (388.771915ms)
I0824 08:46:23.126059 1 node.go:75] Finished syncing node event "master3/cpu_usage_max_avg_1d" (401.156708ms)
I0824 08:46:23.528329 1 node.go:75] Finished syncing node event "node6/cpu_usage_max_avg_1d" (402.208827ms)
I0824 08:46:23.937560 1 node.go:75] Finished syncing node event "node4/cpu_usage_max_avg_1d" (409.165081ms)
I0824 08:46:24.331730 1 node.go:75] Finished syncing node event "node5/cpu_usage_max_avg_1d" (394.024206ms)
I0824 08:46:24.730137 1 node.go:75] Finished syncing node event "master1/cpu_usage_max_avg_1d" (398.33551ms)
I0824 08:46:25.127074 1 node.go:75] Finished syncing node event "node1/cpu_usage_max_avg_1d" (396.798913ms)
I0824 08:46:25.528844 1 node.go:75] Finished syncing node event "node2/cpu_usage_max_avg_1d" (401.701104ms)
I0824 08:46:25.932684 1 node.go:75] Finished syncing node event "node3/cpu_usage_max_avg_1d" (403.762529ms)
I0824 08:46:26.330458 1 node.go:75] Finished syncing node event "node4/mem_usage_avg_5m" (397.710372ms)
I0824 08:46:26.736576 1 node.go:75] Finished syncing node event "master2/mem_usage_avg_5m" (406.060927ms)

开启多副本不生效

crane-scheduler-controller版本:0.0.23
craned版本:0.5.0
k8s版本:1.21.10
docker版本:19.3.14
系统版本:Ubuntu 20.04.3 LTS
image
image
scheduler pod日志:
E0920 09:23:09.244915 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:23:09.244970 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"
E0920 09:24:33.207711 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:24:33.207754 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"
E0920 09:25:10.865375 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:25:10.865408 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"
E0920 09:26:07.905010 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:26:07.905083 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"
E0920 09:27:33.211667 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:27:33.211720 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"
E0920 09:28:33.213730 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:28:33.213767 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"
E0920 09:29:33.214463 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:29:33.214499 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"
E0920 09:30:33.215495 1 scheduler.go:379] scheduler cache AssumePod failed: pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed
E0920 09:30:33.215532 1 factory.go:338] "Error scheduling pod; retrying" err="pod 3ed0ea4b-407f-427e-a92d-1c1d2adbc55c is in the cache, so can't be assumed" pod="testpods-test/testpods-test-test-pods-65494bf66c-c8k6t"

无法拉取到 0.0.20 版本的 docker 镜像

您好,我部署 0.0.20 版本的时候提示没有找到镜像,请问是我部署的过程有问题还是什么问题呢?
k8s 版本:v1.21
helm 版本:v3.3.3

部署流程:

  1. 直接使用 helm 部署,失败
    图片

  2. 将项目克隆下来,使用 kubectl.exe apply -f rbac.yaml 部署成功,在 k8s 服务器上使用下面命令修改版本失败
    KUBE_EDITOR="sed -i 's/v1beta2/v1beta1/g'" kubectl edit cm scheduler-config -n crane-system && KUBE_EDITOR="sed -i 's/0.0.23/0.0.20/g'" kubectl edit deploy crane-scheduler -n crane-system
    图片
    图片

  3. 直接修改 yaml 文件部署,部署后提示没有找到镜像

将 git\crane-scheduler\deploy\manifests\scheduler-config.yaml 中的 v1beta2 修改为 v1beta1

apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
......

将 git\crane-scheduler\deploy\controller\deployment.yaml 中的 0.0.23 修改为 0.0.20

......
command:
- /controller
- --policy-config-path=/data/policy.yaml
- --prometheus-address=PROMETHEUS_ADDRESS
image: docker.io/gocrane/crane-scheduler-controller:0.0.20
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /data
name: dynamic-scheduler-policy
......

提示没有找到镜像
图片

Bug: Is there any risks that more than one `NewPodTopologyCache` running in one Scheduler App?

Func NewPodTopologyCache responses to build a common cache for plugin NodeResourceTopologyMatch, and it will be called in plugin's New func as following:

func New(args runtime.Object, handle framework.Handle) (framework.Plugin, error) {
	...
	topologyMatch := &TopologyMatch{
                // here initializing the cache
		PodTopologyCache:       NewPodTopologyCache(ctx, 30*time.Minute),
		handle:                 handle,
		lister:                 lister,
		topologyAwareResources: sets.NewString(cfg.TopologyAwareResources...),
	}

	return topologyMatch, nil
}

Then, once plugin NodeResourceTopologyMatch appears in multi profiles for one scheduler app, then the plugin will be initialized many times, which means the upper func New will be triggered more than once.

Then, the most important thing is multi PodTopologyCache shows in one scheduler app. Is there any potential risks in this situation(e.g. data race)?

@Garrybest @qmhu PTAL, thanks

crane-scheduler-controller 健康检查失败,readness和liveness都失败

Events:
Type Reason Age From Message


Normal Scheduled 5m27s default-scheduler Successfully assigned crane-system/crane-scheduler-controller-5c85f47c45-trmzp to 192.168.227.164
Normal Pulled 5m37s kubelet Container image "docker.io/gocrane/crane-scheduler-controller:0.0.23" already present on machine
Normal Created 5m37s kubelet Created container crane-scheduler-controller
Normal Started 5m36s kubelet Started container crane-scheduler-controller
Warning Unhealthy 32s (x31 over 5m32s) kubelet Readiness probe failed: Get "http://10.244.27.203:8090/healthz": dial tcp 10.244.27.203:8090: connect: connection refused

readness和liveness都失败,后面直接注释掉才正常启动。请修复此问题

集群重启 调度器出现问题 (使用的修改默认调度器)

E1018 09:42:10.621700 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=mem_usage_max_avg_1h, float64=33.699000000000005)
E1018 09:42:10.621708 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=mem_usage_max_avg_1d, float64=33.699000000000005)
I1018 09:42:10.621717 1 plugins.go:92] [crane] Node[zcsnode2]'s finalscore is 6, while score is 16 and hotvalue is 1.000000
E1018 09:48:25.615198 1 stats.go:128] [crane] failed to get node 's score: zcsmaster1%!(EXTRA string=cpu_usage_max_avg_1d, float64=45.77980000000001)
E1018 09:48:25.615301 1 stats.go:128] [crane] failed to get node 's score: zcsmaster1%!(EXTRA string=mem_usage_max_avg_1d, float64=71.2381)
I1018 09:48:25.615339 1 plugins.go:92] [crane] Node[zcsmaster1]'s finalscore is 35, while score is 35 and hotvalue is 0.000000
E1018 09:48:25.615397 1 stats.go:128] [crane] failed to get node 's score: zcsnode1%!(EXTRA string=cpu_usage_max_avg_1d, float64=47.9795)
E1018 09:48:25.615417 1 stats.go:128] [crane] failed to get node 's score: zcsnode1%!(EXTRA string=mem_usage_max_avg_1d, float64=75.73570000000001)
I1018 09:48:25.615424 1 plugins.go:92] [crane] Node[zcsnode1]'s finalscore is 37, while score is 37 and hotvalue is 0.000000
E1018 09:48:25.615447 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=cpu_usage_max_avg_1d, float64=47.5513)
E1018 09:48:25.615461 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=mem_usage_max_avg_1d, float64=83.1259)
I1018 09:48:25.615468 1 plugins.go:92] [crane] Node[zcsnode2]'s finalscore is 41, while score is 41 and hotvalue is 0.000000
E1018 09:48:56.352200 1 stats.go:128] [crane] failed to get node 's score: zcsmaster1%!(EXTRA string=cpu_usage_max_avg_1d, float64=45.77980000000001)
E1018 09:48:56.352275 1 stats.go:128] [crane] failed to get node 's score: zcsmaster1%!(EXTRA string=mem_usage_max_avg_1d, float64=71.2381)
I1018 09:48:56.352287 1 plugins.go:92] [crane] Node[zcsmaster1]'s finalscore is 35, while score is 35 and hotvalue is 0.000000
E1018 09:48:56.352346 1 stats.go:128] [crane] failed to get node 's score: zcsnode1%!(EXTRA string=cpu_usage_max_avg_1d, float64=47.9795)
E1018 09:48:56.352368 1 stats.go:128] [crane] failed to get node 's score: zcsnode1%!(EXTRA string=mem_usage_max_avg_1d, float64=75.73570000000001)
I1018 09:48:56.352379 1 plugins.go:92] [crane] Node[zcsnode1]'s finalscore is 37, while score is 37 and hotvalue is 0.000000
E1018 09:48:56.352415 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=cpu_usage_max_avg_1d, float64=47.5513)
E1018 09:48:56.352455 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=mem_usage_max_avg_1d, float64=83.1259)
I1018 09:48:56.352466 1 plugins.go:92] [crane] Node[zcsnode2]'s finalscore is 41, while score is 41 and hotvalue is 0.000000
E1018 09:51:48.506156 1 stats.go:128] [crane] failed to get node 's score: zcsmaster1%!(EXTRA string=cpu_usage_max_avg_1d, float64=45.854200000000006)
E1018 09:51:48.506282 1 stats.go:128] [crane] failed to get node 's score: zcsmaster1%!(EXTRA string=mem_usage_max_avg_1d, float64=71.34190000000001)
I1018 09:51:48.506296 1 plugins.go:92] [crane] Node[zcsmaster1]'s finalscore is 35, while score is 35 and hotvalue is 0.000000
E1018 09:51:48.506329 1 stats.go:128] [crane] failed to get node 's score: zcsnode1%!(EXTRA string=cpu_usage_max_avg_1d, float64=48.017900000000004)
E1018 09:51:48.506357 1 stats.go:128] [crane] failed to get node 's score: zcsnode1%!(EXTRA string=mem_usage_max_avg_1d, float64=75.80170000000001)
I1018 09:51:48.506364 1 plugins.go:92] [crane] Node[zcsnode1]'s finalscore is 37, while score is 37 and hotvalue is 0.000000
E1018 09:51:48.506390 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=cpu_usage_max_avg_1d, float64=47.545500000000004)
E1018 09:51:48.506408 1 stats.go:128] [crane] failed to get node 's score: zcsnode2%!(EXTRA string=mem_usage_max_avg_1d, float64=83.0675)
I1018 09:51:48.506416 1 plugins.go:92] [crane] Node[zcsnode2]'s finalscore is 41, while score is 41 and hotvalue is 0.000000

helm 模板文件语法错误

helm chart中的templates/scheduler-deployment.yaml 语法错误,if格式修复如下

containers:
      - command:
        - /scheduler
        - --leader-elect=false
        - --config=/etc/kubernetes/kube-scheduler/scheduler-config.yaml
        {{- if ge .Capabilities.KubeVersion.Minor "22" }}
        image: "{{ .Values.scheduler.image.repository }}:0.0.23"
        {{- else }}
        image: "{{ .Values.scheduler.image.repository }}:0.0.20"
        {{- end }}

dynamic plugin score is abnormal

What happened?

applying score defaultWeights on Score plugins: plugin "Dynamic" returns an invalid score -8, it should in the range of [0, 100] after normalizing

What did you expect to happen?

pod can be scheduled successfully

How can we reproduce it (as minimally and precisely as possible)?

This problem occurs when the prometheus result times out, but the hotValue is normal.

crane-scheduler-controller 获取prometheus指标失败

crane-scheduler-controller版本:0.0.23
craned版本:0.5.0
k8s版本:1.21.10
docker版本:19.3.14
系统版本:Ubuntu 20.04.3 LTS

手动调用prometheus api接口可以获取到对应的指标
curl -g http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/query?query=cpu_usage_avg_5m
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"cpu_usage_avg_5m","instance":"ceph-01"},"value":[1656488784.456,"2.7104166665715894"]},{"metric":{"name":"cpu_usage_avg_5m","instance":"ceph-02"},"value":[1656488784.456,"1.9583333333351618"]},{"metric":{"name":"cpu_usage_avg_5m","instance":"ceph-03"},"value":[1656488784.456,"2.6000000000931323"]},{"metric":{"name":"cpu_usage_avg_5m","instance":"node-01"},"value":[1656488784.456,"4.0291666666841195"]},{"metric":{"name":"cpu_usage_avg_5m","instance":"node-04"},"value":[1656488784.456,"6.870833333426461"]},{"metric":{"name":"cpu_usage_avg_5m","instance":"ykj"},"value":[1656488784.456,"5.891666666672492"]}]}}/ #

curl -g http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/query?query=mem_usage_avg_5m
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"mem_usage_avg_5m","instance":"ceph-01","job":"node-exporter","namespace":"monitoring","pod":"node-exporter-sn9lp"},"value":[1656488826.549,"32.75862684328356"]},{"metric":{"name":"mem_usage_avg_5m","instance":"ceph-02","job":"node-exporter","namespace":"monitoring","pod":"node-exporter-dgd54"},"value":[1656488826.549,"15.044355868789062"]},{"metric":{"name":"mem_usage_avg_5m","instance":"ceph-03","job":"node-exporter","namespace":"monitoring","pod":"node-exporter-td7k2"},"value":[1656488826.549,"34.21244570563606"]},{"metric":{"name":"mem_usage_avg_5m","instance":"node-01","job":"node-exporter","namespace":"monitoring","pod":"node-exporter-zzxmd"},"value":[1656488826.549,"57.21168005976536"]},{"metric":{"name":"mem_usage_avg_5m","instance":"node-04","job":"node-exporter","namespace":"monitoring","pod":"node-exporter-2zkgk"},"value":[1656488826.549,"72.4792896090607"]},{"metric":{"name":"mem_usage_avg_5m","instance":"ykj","job":"node-exporter","namespace":"monitoring","pod":"node-exporter-xfq4n"},"value":[1656488826./

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.