Giter VIP home page Giter VIP logo

cpu-manager-for-kubernetes's Introduction

DISCONTINUATION OF PROJECT.

This project will no longer be maintained by Intel.

This project has been identified as having known security escapes.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.

Intel no longer accepts patches to this project.

CPU Manager for Kubernetes

Build Status

Overview

This project provides basic core affinity for NFV-style workloads on top of vanilla Kubernetes v1.5+.

This project ships a single multi-use command-line program to perform various functions for host configuration, managing groups of CPUs, and constraining workloads to specific CPUs.

Concepts

Term Meaning
Pool A named group of CPU lists. A pool can be either exclusive or shared. In an exclusive pool, only one task may be allocated to each CPU list simultaneously.
CPU list A group of logical CPUs, identified by ID as reported by the operating system. CPU lists conform to the Linux cpuset CPU list format.
Task list A list of Linux process IDs.
Isolation Steps required to set up a process environment so that it runs only on a desired subset of the available CPUs.
Reconciliation The process of resolving state between the CMK configuration directory and the Linux procfs.

Usage summary

Usage:
  cmk (-h | --help)
  cmk --version
  cmk cluster-init (--host-list=<list>|--all-hosts) [--cmk-cmd-list=<list>]
                   [--cmk-img=<img>] [--cmk-img-pol=<pol>] [--conf-dir=<dir>]
                   [--install-dir=<dir>] [--num-exclusive-cores=<num>]
                   [--num-shared-cores=<num>] [--pull-secret=<name>]
                   [--saname=<name>] [--shared-mode=<mode>]
                   [--exclusive-mode=<mode>] [--namespace=<name>]
  cmk init [--conf-dir=<dir>] [--num-exclusive-cores=<num>]
           [--num-shared-cores=<num>] [--socket-id=<num>]
           [--shared-mode=<mode>] [--exclusive-mode=<mode>]
  cmk discover [--conf-dir=<dir>]
  cmk describe [--conf-dir=<dir>]
  cmk reconcile [--conf-dir=<dir>] [--publish] [--interval=<seconds>]
  cmk isolate [--conf-dir=<dir>] [--socket-id=<num>] --pool=<pool> <command>
              [-- <args> ...][--no-affinity]
  cmk install [--install-dir=<dir>]
  cmk node-report [--conf-dir=<dir>] [--publish] [--interval=<seconds>]
  cmk uninstall [--install-dir=<dir>] [--conf-dir=<dir>] [--namespace=<name>]
  cmk webhook [--conf-file=<file>]

Options:
  -h --help                    Show this screen.
  --version                    Show version.
  --host-list=<list>           Comma seperated list of Kubernetes nodes to
                               prepare for CMK software.
  --all-hosts                  Prepare all Kubernetes nodes for the CMK
                               software.
  --cmk-cmd-list=<list>        Comma seperated list of CMK sub-commands to run
                               on each host
                               [default: init,reconcile,install,discover,nodereport].
  --cmk-img=<img>              CMK Docker image [default: cmk:v1.5.2].
  --cmk-img-pol=<pol>          Image pull policy for the CMK Docker image
                               [default: IfNotPresent].
  --conf-dir=<dir>             CMK configuration directory [default: /etc/cmk].
  --install-dir=<dir>          CMK install directory [default: /opt/bin].
  --interval=<seconds>         Number of seconds to wait between rerunning.
                               If set to 0, will only run once. [default: 0]
  --num-exclusive-cores=<num>  Number of cores in exclusive pool. [default: 4].
  --num-shared-cores=<num>     Number of cores in shared pool. [default: 1].
  --pool=<pool>                Pool name: either infra, shared or exclusive.
  --shared-mode=<mode>         Shared pool core allocation mode. Possible
                               modes: packed and spread [default: packed].
  --exclusive-mode=<mode>      Exclusive pool core allocation mode. Possible
                               modes: packed and spread [default: packed].
  --publish                    Whether to publish reports to the Kubernetes
                               API server.
  --pull-secret=<name>         Name of secret used for pulling Docker images
                               from restricted Docker registry.
  --saname=<name>              ServiceAccount name to pass
                               [default: cmk-serviceaccount].
  --socket-id=<num>            ID of socket where allocated core should come
                               from. If it's set to -1 then child command will
                               be assigned to any socket [default: -1].
  --no-affinity                Do not set cpu affinity before forking the child
                               command. In this mode the user program is
                               responsible for reading the `CMK_CPUS_ASSIGNED`
                               environment variable and moving a subset of its
                               own processes and/or tasks to the assigned CPUs.
  --namespace=<name>           Set the namespace to deploy pods to during the
                               cluster-init deployment process.
                               [default: default].

For detailed usage information about each subcommand, see Using the cmk command-line tool.

Further Reading

cpu-manager-for-kubernetes's People

Contributors

balajismaniam avatar connordoyle avatar damenus avatar flyingcougar avatar lilyabu avatar lmdaly avatar newton-j avatar nqn avatar patricia-cahill avatar pbrownlow7 avatar przemeklal avatar rdower avatar rorysexton avatar squall0gd avatar timolindqvist avatar vladimare avatar williamcaban avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cpu-manager-for-kubernetes's Issues

CMK crashes on Intel® Xeon® Processor E5-2690

Hi,

I am running CMK on this CPU "Intel® Xeon® Processor E5-2690":

https://ark.intel.com/content/www/us/en/ark/products/64596/intel-xeon-processor-e5-2690-20m-cache-2-90-ghz-8-00-gt-s-intel-qpi.html

but CMK crashes during CMK "Isolate" command. From dmesg I found the following errors:

[ 925.634828] traps: cmk[13355] trap invalid opcode ip:7f1233813197 sp:7ffc47a3ca40 error:0 in libpython3.8.so.1.0[7f12337ec000+1e6000]
[ 927.070427] traps: cmk[13593] trap invalid opcode ip:7f5bdc608197 sp:7ffdad41c180 error:0 in libpython3.8.so.1.0[7f5bdc5e1000+1e6000]
[ 940.874612] traps: cmk[14441] trap invalid opcode ip:7fbc065e8197 sp:7ffc455299a0 error:0 in libpython3.8.so.1.0[7fbc065c1000+1e6000]
[ 965.930742] traps: cmk[15930] trap invalid opcode ip:7fd982fea197 sp:7ffcceecbdc0 error:0 in libpython3.8.so.1.0[7fd982fc3000+1e6000]
[ 1017.874618] traps: cmk[18621] trap invalid opcode ip:7f2d34b37197 sp:7ffcde79b560 error:0 in libpython3.8.so.1.0[7f2d34b10000+1e6000]
[ 1107.876086] traps: cmk[24883] trap invalid opcode ip:7f6bdcf3e197 sp:7ffed4821120 error:0 in libpython3.8.so.1.0[7f6bdcf17000+1e6000]
[ 1274.864827] traps: cmk[33972] trap invalid opcode ip:7f34da8bf197 sp:7ffdf6b9a180 error:0 in libpython3.8.so.1.0[7f34da898000+1e6000]

Could you please help me in order to fix it?

Thanks

cmk can't working in kubernetes init-container

after install the CMK, i want to deployment the pod with initcontainer as below:
initContainers:
- name: nwinit
image: "nwinit:1.1.0"
imagePullPolicy: IfNotPresent
stdin: true
tty: true

but after pod and container is running, we can't find cmk binary in bin folder in initcontainer, seems the admission hook not add the needed volume mount into initcontainer.
the other container in the same pod can find the cmk binary in bin folder.

so is this the expected behavior? does CMK already support initcontainer?

Cannot build CMK Docker image

Hi

There's a problem with building the Docker image of CMK.
Logs:

$ make

docker build -t cmk:v1.4.0 .
Sending build context to Docker daemon  8.183MB
Step 1/8 : FROM python:3.4.6
3.4.6: Pulling from library/python
ad74af05f5a2: Pull complete 
2b032b8bbe8b: Pull complete 
a9a5b35f6ead: Pull complete 
3245b5a1c52c: Pull complete 
032924b710ba: Pull complete 
1c0e73a83cd6: Pull complete 
230cc1f59fea: Pull complete 
b21ee41b6021: Pull complete 
Digest: sha256:9c6c97ea31915fc82d4adeca1f9aa8cbad0ca113f4237d350ab726cf05485585
Status: Downloaded newer image for python:3.4.6
 ---> c6402576e1db
Step 2/8 : ADD requirements.txt /requirements.txt
 ---> b6c3a5ccf3cb
Step 3/8 : RUN pip install -r /requirements.txt
 ---> Running in 38a429396fdf
Collecting tox<3.0,>=2.5 (from -r /requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/1d/4e/20c679f8c5948f7c48591fde33d442e716af66a31a88f5791850a75041eb/tox-2.9.1-py2.py3-none-any.whl (73kB)
Collecting docopt<1.0,>=0.6 (from -r /requirements.txt (line 2))
  Downloading https://files.pythonhosted.org/packages/a2/55/8f8cab2afd404cf578136ef2cc5dfb50baa1761b68c9da1fb1e4eed343c9/docopt-0.6.2.tar.gz
Collecting psutil<5.5,>=5.0 (from -r /requirements.txt (line 3))
  Downloading https://files.pythonhosted.org/packages/e3/58/0eae6e4466e5abf779d7e2b71fac7fba5f59e00ea36ddb3ed690419ccb0f/psutil-5.4.8.tar.gz (422kB)
Collecting pyinstaller<4.0,>=3.2 (from -r /requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/e2/c9/0b44b2ea87ba36395483a672fddd07e6a9cb2b8d3c4a28d7ae76c7e7e1e5/PyInstaller-3.5.tar.gz (3.5MB)
Collecting kubernetes==10.0.0 (from -r /requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/2a/09/365f4ad63f71c698c76edb3e666852b87a751ee4b6d23222b09952557d17/kubernetes-10.0.0-py2.py3-none-any.whl (1.5MB)
Collecting requests<3.0,>=2.21 (from -r /requirements.txt (line 6))
  Downloading https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl (57kB)
Collecting urllib3==1.24.2 (from -r /requirements.txt (line 7))
  Downloading https://files.pythonhosted.org/packages/df/1c/59cca3abf96f991f2ec3131a4ffe72ae3d9ea1f5894abe8a9c5e3c77cfee/urllib3-1.24.2-py2.py3-none-any.whl (131kB)
Collecting pytest<=3.3.2 (from -r /requirements.txt (line 8))
  Downloading https://files.pythonhosted.org/packages/38/af/8dcf688d192914928393f931b7b550f2530299bbb08018b2f17efa6aab73/pytest-3.3.2-py2.py3-none-any.whl (185kB)
Collecting pytest-cov<2.6.1,>=2.4.0 (from -r /requirements.txt (line 9))
  Downloading https://files.pythonhosted.org/packages/30/0a/1b009b525526cd3cd9f52f52391b426c5a3597447be811a10bcb1f6b05eb/pytest_cov-2.6.0-py2.py3-none-any.whl
Collecting cryptography<=2.4.2,>=2.3 (from -r /requirements.txt (line 10))
  Downloading https://files.pythonhosted.org/packages/60/c7/99b33c53cf3f20a97a4c4bfd3ab66dcc93d99da0a97cc9597aa36ae6bb62/cryptography-2.4.2-cp34-abi3-manylinux1_x86_64.whl (2.1MB)
Collecting yamlreader==3.0.4 (from -r /requirements.txt (line 11))
  Downloading https://files.pythonhosted.org/packages/84/4b/3af5480c26b3235dcd0984b9664b48115c2308c8c4f22e7162322be4ec0f/yamlreader-3.0.4.tar.gz
Collecting pluggy<0.7,>=0.5 (from -r /requirements.txt (line 12))
  Downloading https://files.pythonhosted.org/packages/ba/65/ded3bc40bbf8d887f262f150fbe1ae6637765b5c9534bd55690ed2c0b0f7/pluggy-0.6.0-py3-none-any.whl
Collecting packaging==17.1 (from -r /requirements.txt (line 13))
  Downloading https://files.pythonhosted.org/packages/ad/c2/b500ea05d5f9f361a562f089fc91f77ed3b4783e13a08a3daf82069b1224/packaging-17.1-py2.py3-none-any.whl
Collecting attrs==18.1.0 (from -r /requirements.txt (line 14))
  Downloading https://files.pythonhosted.org/packages/41/59/cedf87e91ed541be7957c501a92102f9cc6363c623a7666d69d51c78ac5b/attrs-18.1.0-py2.py3-none-any.whl
Collecting py>=1.4.17 (from tox<3.0,>=2.5->-r /requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/99/8d/21e1767c009211a62a8e3067280bfce76e89c9f876180308515942304d2d/py-1.8.1-py2.py3-none-any.whl (83kB)
Collecting six (from tox<3.0,>=2.5->-r /requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
Collecting virtualenv>=1.11.2; python_version != "3.2" (from tox<3.0,>=2.5->-r /requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/05/f1/2e07e8ca50e047b9cc9ad56cf4291f4e041fa73207d000a095fe478abf84/virtualenv-16.7.9-py2.py3-none-any.whl (3.4MB)
Requirement already satisfied: setuptools in /usr/local/lib/python3.4/site-packages (from pyinstaller<4.0,>=3.2->-r /requirements.txt (line 4))
Collecting altgraph (from pyinstaller<4.0,>=3.2->-r /requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/0a/cc/646187eac4b797069e2e6b736f14cdef85dbe405c9bfc7803ef36e4f62ef/altgraph-0.16.1-py2.py3-none-any.whl
Collecting pyyaml>=3.12 (from kubernetes==10.0.0->-r /requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/3d/d9/ea9816aea31beeadccd03f1f8b625ecf8f645bd66744484d162d84803ce5/PyYAML-5.3.tar.gz (268kB)
PyYAML requires Python '>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*' but the running Python is 3.4.6
You are using pip version 9.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The command '/bin/sh -c pip install -r /requirements.txt' returned a non-zero code: 1
make: *** [docker] Error 1

From my investigation it looks like PyYAML (which is needed by python k8s client) got updated to 5.3 on 6th January 2020 and dropped support for 3.4 version of Python (related PR: yaml/pyyaml#345).
So, CMK Dockerfile is based on python:3.4.6, CMK requires python k8s client==10.0.0 and the k8s client is satisfied with PyYAML >= 3.12, so pip tries to download the most recent version (the 5.3 one, without support for python 3.4).

Taints upgraded in K8s version 1.7+

While testing the latest currently available version of CMK (v1.1.0) on K8s 1.7.4. I've noticed an issue with node tainting that is being done by CMK.

With the move to K8s 1.7.4 has been moved from alpha to beta and the way to taint and tolerate the nodes has been changed. Here is the current description from [kubetnetes.io|https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/].

Below is an excerpt from node descriptions after deploying CMK:
{code}

kubectl describe nodes

Name: k8s-ha-minion1
Role:
Labels: beta.kubernetes.io/arch=amd64
...
node.alpha.kubernetes-incubator.io/node-feature-discovery.version=59a8917
Annotations: node.alpha.kubernetes.io/ttl=0
scheduler.alpha.kubernetes.io/taints=[{"value": "true", "key": "cmk", "effect": "NoSchedule"}]
volumes.kubernetes.io/controller-managed-attach-detach=true Taints: CreationTimestamp: Tue, 12 Sep 2017 06:04:46 -0700
{code}

Please note that the taint from current CMK version is placed within annotations and not within "Taints" of node description.
In this case I was able to deploy my example pod even though it had the toleration in annotations commented out:
{code}

cat busybox-crdnetwork.yaml

apiVersion: v1
kind: Pod
metadata:
name: busybox-crdnetwork
annotations:
#scheduler.alpha.kubernetes.io/tolerations: '[{"key":"cmk", "value":"true"}]'
networks: '[
{ "name": "flannel-conf" },
{ "name": "sriov-fvl-ipam10-30"},
{ "name": "sriov-mellanox-sharedvf-vlan5" }
]'
spec:
containers:

  • name: busybox-crdnetwork
    image: "busybox"
    command: ["top"]
    stdin: true
    tty: true
    nodeSelector:
    kubernetes.io/hostname: k8s-ha-minion1
    node.alpha.kubernetes-incubator.io/nfd-network-sriov-configured: "true"
    {code}

{code}

kubectl create -f ../../pods/busybox-crdnetwork.yaml

pod "busybox-crdnetwork" created

kubectl get po -a -o wide

NAME READY STATUS RESTARTS AGE IP NODE
busybox-crdnetwork 1/1 Running 0 3m 192.168.100.7 k8s-ha-minion1
cmk-cluster-init-pod 0/1 Completed 0 6m 192.168.100.9 k8s-ha-minion1
cmk-init-install-discover-pod-k8s-ha-minion1 0/2 Completed 0 6m 192.168.100.10 k8s-ha-minion1
cmk-init-install-discover-pod-k8s-ha-minion2 0/2 Completed 0 6m 192.168.120.15 k8s-ha-minion2
cmk-reconcile-nodereport-ds-k8s-ha-minion1-l4sxx 2/2 Running 0 6m 192.168.100.11 k8s-ha-minion1
cmk-reconcile-nodereport-ds-k8s-ha-minion2-qtjm2 2/2 Running 0 6m 192.168.120.16 k8s-ha-minion2
{code}

Command "kubectl taint" is unaware of the taint from annotations of the node:
{code}

kubectl taint nodes --all cmk-

taint "cmk:" not found
taint "cmk:" not found
{code}
So I have created a proper CMK taint of my own:
{code}

kubectl taint nodes --all cmk=:NoSchedule

node "k8s-ha-minion1" tainted
node "k8s-ha-minion2" tainted
{code}
Here is the node description now:
{code}

kubectl describe nodes Name: k8s-ha-minion1

Role: Labels: beta.kubernetes.io/arch=amd64
...
node.alpha.kubernetes-incubator.io/node-feature-discovery.version=59a8917
Annotations: node.alpha.kubernetes.io/ttl=0
scheduler.alpha.kubernetes.io/taints=[{"value": "true", "effect": "NoSchedule", "key": "cmk"}]
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: cmk:NoSchedule
CreationTimestamp: Tue, 12 Sep 2017 06:04:46 -0700
{code}

Now with or without the toleration in the annotation of the pod:
{code}

cat busybox-crdnetwork.yaml

apiVersion: v1
kind: Pod
metadata:
name: busybox-crdnetwork
annotations:
scheduler.alpha.kubernetes.io/tolerations: '[{"key":"cmk", "value":"true"}]'
networks: '[
{ "name": "flannel-conf" },
{ "name": "sriov-fvl-ipam10-30"},
{ "name": "sriov-mellanox-sharedvf-vlan5" }
]'
spec:
containers:

  • name: busybox-crdnetwork
    image: "busybox"
    command: ["top"]
    stdin: true
    tty: true
    nodeSelector:
    kubernetes.io/hostname: k8s-ha-minion1
    node.alpha.kubernetes-incubator.io/nfd-network-sriov-configured: "true"
    {code}

It would not deploy:
{code}

kubectl create -f busybox-crdnetwork.yaml

pod "busybox-crdnetwork" created

kubectl get po -a -o wide

NAME READY STATUS RESTARTS AGE IP NODE
busybox-crdnetwork 0/1 Pending 0 1m

kubectl describe po busybox-crdnetwork

Name: busybox-crdnetwork
Namespace: default
Node:
Labels:
Annotations: kubernetes.io/psp=privileged
networks=[ { "name": "flannel-conf" }, { "name": "sriov-fvl-ipam10-30"}, { "name": "sriov-mellanox-sharedvf-vlan5" } ]
scheduler.alpha.kubernetes.io/tolerations=[{"key":"cmk", "value":"true"}]
...
Tolerations:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message


1m 13s 8 default-scheduler Warning FailedScheduling No nodes are available that match all of the following predicates:: MatchNodeSelector (1), PodToleratesNodeTaints (2).
{code}

Here is how the taint toleration must now look like for the pod to deploy:
{code}

cat busybox-crdnetwork.yaml

apiVersion: v1
kind: Pod
metadata:
name: busybox-crdnetwork
annotations:
networks: '[
{ "name": "flannel-conf" },
{ "name": "sriov-fvl-ipam10-30"},
{ "name": "sriov-mellanox-sharedvf-vlan5" }
]'
spec:
containers:

  • name: busybox-crdnetwork
    image: "busybox"
    command: ["top"]
    stdin: true
    tty: true
    tolerations:
  • key: "cmk"
    operator: "Exists"
    effect: "NoSchedule"
    nodeSelector:
    kubernetes.io/hostname: k8s-ha-minion1
    node.alpha.kubernetes-incubator.io/nfd-network-sriov-configured: "true"
    {code}

Please note the tolerations in the pods spec dictionary.

Init cluster failed due to discovery container issue

Having this cluster-init pod definition:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: cmk-cluster-init-pod
  name: cmk-cluster-init-pod
  namespace: intel-cmk
spec:
  serviceAccountName: cmk-serviceaccount
  containers:
  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --host-list=test1-worker-0,test1-worker-1 --saname=cmk-serviceaccount --namespace=intel-cmk --cmk-img=quay.io/oglok/ocp4-cmk:latest"
    command:
    - "/bin/bash"
    - "-c"
    image: quay.io/oglok/ocp4-cmk:latest
    name: cmk-cluster-init-pod
  restartPolicy: Never

The image was stored in my own Quay registry, in order to store it somewhere easily accesible.

[root@booger CPU-Manager-for-Kubernetes]# oc get pods -n intel-cmk
NAME                                           READY   STATUS   RESTARTS   AGE
cmk-cluster-init-pod                           0/1     Error    0          24h
cmk-init-install-discover-pod-test1-worker-0   0/2     Error    0          24h
cmk-init-install-discover-pod-test1-worker-1   0/2     Error    0          24h

The install container pastes the cmk binary into the workers in /opt/bin. However, I'm getting the following trace in the discover container:

oc logs pod/cmk-init-install-discover-pod-test1-worker-0 -c discover -n intel-cmk                                                                             [3/1917]
INFO:root:Patching node status test1-worker-0:
[
  {
    "op": "add",
    "path": "/status/capacity/cmk.intel.com~1exclusive-cores",
    "value": 4
  }
]
Traceback (most recent call last):
  File "/cmk/cmk.py", line 158, in <module>
    main()
  File "/cmk/cmk.py", line 115, in main
    discover.discover(args["--conf-dir"])
  File "/cmk/intel/discover.py", line 41, in discover
    add_node_er(conf_dir)
  File "/cmk/intel/discover.py", line 96, in add_node_er
    patch_k8s_node_status(patch_body)
  File "/cmk/intel/discover.py", line 202, in patch_k8s_node_status
    k8sapi.patch_node_status(node_name, patch_body)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/apis/core_v1_api.py", line 17100, in patch_node_status
    (data) = self.patch_node_status_with_http_info(name, body, **kwargs)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/apis/core_v1_api.py", line 17194, in patch_node_status_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 176, in __call_api
    return_data = self.deserialize(response_data, response_type)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 249, in deserialize
    return self.__deserialize(data, response_type)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 289, in __deserialize
    return self.__deserialize_model(data, klass)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 633, in __deserialize_model
    kwargs[attr] = self.__deserialize(value, attr_type)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 289, in __deserialize
    return self.__deserialize_model(data, klass)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 633, in __deserialize_model
    kwargs[attr] = self.__deserialize(value, attr_type)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 267, in __deserialize
    for sub_data in data]
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 267, in <listcomp>
    for sub_data in data]
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 289, in __deserialize
    return self.__deserialize_model(data, klass)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 635, in __deserialize_model
    instance = klass(**kwargs)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/models/v1_container_image.py", line 52, in __init__
    self.names = names
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/models/v1_container_image.py", line 77, in names
    raise ValueError("Invalid value for `names`, must not be `None`")
ValueError: Invalid value for `names`, must not be `None`

I'm not sure what is the command being run, but I can do something like this in the worker nodes:

[root@test1-worker-1 bin]# ./cmk discover --conf-dir=/etc/cmk                                                                                                                                                   
Traceback (most recent call last):
  File "cmk.py", line 158, in <module>
  File "cmk.py", line 115, in main
  File "intel/discover.py", line 32, in discover
  File "intel/k8s.py", line 299, in get_kubelet_version
  File "intel/k8s.py", line 135, in version_api_client_from_config
  File "site-packages/kubernetes/config/incluster_config.py", line 96, in load_incluster_config
  File "site-packages/kubernetes/config/incluster_config.py", line 47, in load_and_set
  File "site-packages/kubernetes/config/incluster_config.py", line 53, in _load_config
kubernetes.config.config_exception.ConfigException: Service host/port is not set.
[89004] Failed to execute script cmk

CMK Isolation on different NUMA nodes

Hi,

are you planning to add "NUMA node configuration" on CMK isolate?
Because the idea is to specify the NUMA node during the isolation, in order to get CPUs from a NUMA node or from another NUMA node on basis of application. In that way, it allows us the splitting of different workloads on different NUMA nodes.

The best use case is run a DPDK application on isolated CPUs of one NUMA node and then run other applications on isolated CPUs of other NUMA node.

Please let me know.

Thanks

Add ENV_CPUS_ASSIGNED_MASK env variable

Some applications expect a mask of cpus on which they isolate themselves (pktgen, testpmd). For the convenience of running such applications in cmk containers, it is useful to add allocated CPUs to the ENV, not only in the list, but also in the form of a mask.

Different pools for different containers

I have a pod, that has two containers.
I want one container to use 1 CPU exclusively with affinity and another container to use a part of a CPU from a shared pool.
Is it possible?

Add support for allocating multiple cpus to a container

CMK should support allocating multiple cpus to a container. The support is needed for dataplane (exlusive) pool allocations.

For this I have created a PoC which uses a new option (--num-cpus) in isolate command. With this simple solution the requirement can be fulfilled.

k8s node loses CMK exclusive-cores property after reboot

After reboot - K8S compute nodes seem to be losing the cmk-related resource properties:
Before reboot:
"cmk.intel.com/exclusive-cores": "1"
After reboot
"cmk.intel.com/exclusive-cores": "0"

I need to rerun the CMK init & discovery pod for the situation to get back to normal.

Steps to reproduce:

  • set up CMK on the k8s compute node
  • reboot the node

Steps to fix:

  • delete the pods: cmk-cluster-init-pod , cmk-init-install-discover-pod-*
  • delete /etc/cmk directory
  • reapply the cmk cluster-init pod

CMK Configmap is hard to manage

Hi guys,

I am using CMK 1.5.1.
Here configmap replaced file-system solution, but the issue here is the configmap is hard-coded inside the python code and that means, it's impossible to manage it inside a helm chart solution. For example, if I remove my CMK helm chart, it can't remove that configmap.

Could you please remove the creation of configmap from python code and create a proper manifest that create it?
Then in the python code, the init will need just to update that configmap instead of create it.

Please let me know as it is blocking my work.

Thanks

CMK resources are not showing after upgrading k8s

No resources are showing for cmk after upgrading the k8s v1.22.1, CMK version 1.5.2.
cloned with with latest tag every pod is coming up for cmk but K8s resources is not showing any resources.

Any resolution for this issue, please help!

INTERNAL ERROR: cannot create temporary directory!

Hello, I am using CMK for my project. After I pin cpu for my container, I run cmk isolate with pool=shared. I get log with this format:
[num] INTERNAL ERROR: cannot create temporary directory!
Have you met this yet ? Please share it with me? Thank you !

problem running the webhook validation test on one node, works fine on other node

I have CMK configured to run on four nodes, controller-0, controller-1, compute-0, compute-1. All have 4 exclusive CPUs and 1 shared.

If I run the webhook validation test pod with the node specified as "controller-0" it works fine and gets assigned an exclusive CPU. If I run it against compute-0 the pod fails with the following logs:

controller-0:/tmp$ kubectl logs pod/cmk-isolate-pod
[6] Failed to execute script cmk
/tmp/_MEI1zW97g/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
Traceback (most recent call last):
File "cmk.py", line 158, in
main()
File "cmk.py", line 126, in main
args["--socket-id"])
File "intel/isolate.py", line 111, in isolate
p.cpu_affinity(cpu_list)
File "site-packages/psutil/init.py", line 813, in cpu_affinity
File "site-packages/psutil/_pslinux.py", line 1508, in wrapper
File "site-packages/psutil/_pslinux.py", line 1931, in cpu_affinity_set
OSError: [Errno 22] Invalid argument

Add support sriov device numa awarenes

CMK should automatically select isolated pool and preferred socket_id in case of container configured with SRIOV network device. How to find such device inside container?
One way - to find even one PCIDEVICE_INTEL_COM_<name_of_resource> variable among env variables and use its value (pci device address) to detect device`s numa.
To obtain the pool and socket-id cmk interface could be extended with command sriov_numa_aware and appropriate implementation.

About why use the pinner binary to pin cpu for a process?

Hi, I am interesting in cmk and we have the same request about container also use exclusive core and share core.
But we always pin a exclusive core to a thread in a process not only on a process, and we pass the exclusive core id to container by docker Environment and let the thread in the process pin this core by themselves.
Why this feature is designed only support pin a process to the exclusive core? and I also see the nokia cpu pooler also design like this.

Pin CPU for Pod on K8s

Hi, I'm using cmk to pin cpu for my app. But I realize cpu pin to container in pod. So I want to ask how to pin cpu to pod (have many container inside pod). Thank for watching !

spread pool mode does not spread

Hello!

I have k8s cluster with cmk installation. k8s cluster parameters:
1 master, 1 minion

on minion isolcpus=0-4,24-28,12-16,36-40

0-4, 24-28 - sibling cpus on 0 numa node
12-16, 36-40 - sibling cpus on 1 numa node

cmk cluster-init parameters:
--exclusive-mode=spread
--num-exclusive-cores=8
--num-shared-cores=1

result exclusive poll:
0-4,12-14

correct spread exclusive pool:
0-3,12-15

so, cmk does not work according documentation

cmk code exploration showed:
init function (init.py) prepares spread list of exclusive cores:
isolated_cores_exclusive = platform.get_isolated_cores(mode="spread")
content of this list:

numa core
0 0
1 12
0 1
1 13
0 2
1 14
0 3
1 15
0 4
1 16

next this list sorted:

# always prioritize free SST_BF cores, even if there may be none free_cores.sort(key=lambda c: (c.is_sst_bf(), -c.core_id), reverse=True)
cpus on minion hqnova1 does not support SST-BF, so list sorted based on cpu id:

numa cpu
0 0
0 1
0 2
0 3
0 4
1 12
1 13
1 14
1 15
1 16

then first num-exclusive-cores of this list marked as belong to exclusive pool:
0-4, 12-14

CMK reconfiguration function errors

Hi folks,

I use CMK v1.5.2 release, and Kubernetes v.1.16.7
I tried the CMK reconfigure function but fails, it seems the code has some errors.

My patch for the issues 1~5 below is here: ylhsiehitri@c195694
With the patch, reconfigure function can resize pool successfully now.
But the problem in 5(b) is left to be solved, core affinity in cmk configmap is changed successfully but in OS it is unchanged actually.
(this case could be produced by: create two pods with one exclusive core for each --> make pod-1 completed while pod-2 keeps running --> reconfigure exclusive pool size to be only 1 --> CMK would change pod-2's core affinity to pod-1's, but actually unchanged in OS)

  1. Bug in lock:
    Within reconfigure.reconfigure(), it goes as follows into error loop
    (a) it calls config.lock():
    in the config.lock(), the "owner" annotation derived from configmap is empty, so it is set to be the name of the associated pod
    (b) then, reconfigure_directory() is called:
    (i) In reconfigure_directory(), discover.add_node_er() is called, in which config.lock() is called again
    (ii) now in the config.lock(), the "owner" is not empty (it was just set in (a)), so following the if-else statement, it goes to the "if" section and leads to infinite loop.
    ==>
    This error infinite loop could be solved, just correct the if statement in config.lock():
    if owner != "" and owner != self.owner

  2. Missing argument "namespace"
    These places miss required argument "namespace":
    (a) In reconfigure.reconfigure_directory() that calls discover.add_node_er() and discover.add_node_oir()
    (b) In describe.describe() that calls config.Config()"
    (c) In isolate.isolate() that calls config.Config()
    ==> Fixed, also the function headers of reconfigure.reconfigure() and describe.describe() are also corrected, added with argument "namespace"

  3. Non-default OS environment variable, e.g. NODE_NAME, should be given in pod yaml
    In CMK code, it often uses os.getenv("NODE_NAME") to get environment variable. However it is not given by default.
    ==> Should define env NODE_NAME in pod yaml. I added it in the cmk-isolate-pod.yaml as an example.

  4. (Discussion) Redundant augument
    In reconfigure.reconfigure() function header, the argument "node_name" is redundant since it will be reassigned in the function body
    ==>
    Or, the reconfigure function was designed to allow user to specify node_name?

  5. yaml.safe_load() fails for "Procs" object deserialization
    In reaffinitize.py, it uses yaml.safe_load() to deserialize the yaml-serialized Procs object, but fails
    (a) solved bug:
    source code: yaml.safe_load(config["config"])
    correct to : yaml.safe_load(config.data["config"])
    (b) unsolved:
    We expect the corrected yaml.safe_load() in (a) would return a Procs object, but actually it encounters exception error that it does not know how do the convertion.
    ==>
    Anyone knows how to deal with it simply?
    As far as I know, yaml.safe_load() can do deserialize for object of simple structure. But like the class Procs defined in reconfigure.py, it is not of simple structure: it has an dict, of which the values are Pid object. For such case, python provide yaml.add_constructor() mechnism to define how to reconstruct the object.

Only one node picked up by cluster-init which also fails with Forbidden Error

My cmk-cluster-init-pod.yaml looks as below. I am trying to initialize a cluster with 3 worker nodes. However, only one of the node (strangely only "worker-1" which is the second in the list) has been installed with cmk and related binaries. I could find the logs for the cluster-init-pod from this node which shows a Forbidden error (403) as shown below

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: cmk-cluster-init-pod
  name: cmk-cluster-init-pod
  namespace: cmk-namespace
spec:
  serviceAccountName: cmk-serviceaccount
  containers:
  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --host-list=worker-0,worker-1,worker-2"
    command:
    - "/bin/bash"
    - "-c"
    image: mjace/cmk:v1.3.1
    name: cmk-cluster-init-pod
    securityContext:
      privileged: false
    ports:
    - containerPort: 8080
  restartPolicy: Never

cluster-init issues the following error on worker-1 node:

2019-12-30T16:45:13.333207607+00:00 stderr F INFO:root:Used ServiceAccount: cmk-serviceaccount
2019-12-30T16:45:13.333207607+00:00 stderr F INFO:root:Creating cmk pod for ['init', 'install', 'discover'] commands ...
2019-12-30T16:45:13.449979685+00:00 stderr F ERROR:root:Exception when creating pod for ['init', 'install', 'discover'] command(s): (403)
2019-12-30T16:45:13.449979685+00:00 stderr F Reason: Forbidden
2019-12-30T16:45:13.449979685+00:00 stderr F HTTP response headers: HTTPHeaderDict({'Content-Length': '301', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Mon, 30 Dec 2019 16:45:13 GMT'})
2019-12-30T16:45:13.449979685+00:00 stderr F HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User "system:serviceaccount:cmk-namespace:cmk-serviceaccount" cannot create resource "pods" in API group "" in the namespace "default"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
2019-12-30T16:45:13.449979685+00:00 stderr F
2019-12-30T16:45:13.449979685+00:00 stderr F
2019-12-30T16:45:13.449979685+00:00 stderr F ERROR:root:Aborting cluster-init ...

I am currently using k8s v1.5.2 and cmk images from v1.3.1.

keyerror observed while running isolate command

/opt/bin/cmk  isolate  --pool=exclusive echo "test" --namespace=cmk-namespace
test
Traceback (most recent call last):
  File "cmk.py", line 216, in <module>
  File "cmk.py", line 151, in main
  File "intel/isolate.py", line 151, in isolate
TypeError: __init__() missing 1 required positional argument: 'cm_namespace'
[344] Failed to execute script cmk

operator.md incorrectly states to set --insecure to false to revert back to regular TLS

operator.md states "You can also set the argument --insecure to false and the webhook service will revert back to regular TLS". However, webhook.py uses mutual tls when self.insecure == "False".

Therefore, whilst operator.md is factually accurate because "false"!="False", it is perhaps a little bit misleading (it certainly confused me)!

My proposal is to update operator.md so that is states "You can also set the argument --insecure to True and the webhook service will revert back to regular TLS"

compare requested cores with cores in pool

Hello!
in check_assignment() function:
if num_exclusive_lists is not num_exclusive_cores:
ERROR

why there is exact comparing? what wrong in situation when num_exclusive_lists > num_exclusive_cores?
Thank you.

failure during uninstall

When running "kubectl apply -f cmk-uninstall-all-daemonset.yaml" the uninstall worked on one of my two nodes, but failed on the other. I'm now left with pod/cmk-uninstall-all-nodes-bm8lp in CrashLoopBackoff, daemonset.apps/cmk-uninstall-all-nodes still running, and /etc/cmk still present.

The final logs for the failed pod were as follows:

WARNING:root:"cmk-nodereport" for node "controller-0" does not exist.
INFO:root:"cmk-nodereport" for node "controller-0" removed.
INFO:root:Removing "cmk-reconcilereport" from Kubernetes API server for node "controller-0".
INFO:root:Converted "controller-0" to "controller-0" for TPR/CRD name
WARNING:root:"cmk-reconcilereport" for node "controller-0" does not exist.
INFO:root:"cmk-reconcilereport" for node "controller-0" removed.
INFO:root:Removing node taint.
INFO:root:Patching node controller-0:
[
{
"op": "replace",
"path": "/spec/taints",
"value": []
}
]
INFO:root:Removed node taint with key"cmk".
INFO:root:Removing node ERs
INFO:root:Patching node status controller-0:
[
{
"op": "remove",
"path": "/status/capacity/cmk.intel.com~1exclusive-cores"
}
]
ERROR:root:Aborting uninstall: Exception when removing ER: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Content-Length': '187', 'Date': 'Wed, 15 May 2019 17:44:02 GMT'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"the server rejected our request due to an error in our request","reason":"Invalid","details":{},"code":422}

Error getting cmk-config-<node-name> Configmap

Hi guys,

I may find several bugs towards the management of Configmap in the latest 1.5.2 version and also try to find a workaround to fix them. My patch file is attached here.
hanyin-intel@e8d5903

Here are the details:

  1. The node-report feature can't get the cmk-config- Configmap since error calling "get_config" function in class "Config". Since this function is not defined in the "Config" class, I use the constructor function and "lock/unlock" function as a workaround. But it seems that the "node-report" container continuously restarts since race on such "lock".

  2. The describe feature can't get the cmk-config- Configmap since the "namespace" attribute is not passed when construct the "Config" class. As a result, I use the "/var/run/secrets/kubernetes.io/serviceaccount/namespace" file in container to get the namespace and pass it to the construction function of "Config" class.

  3. The isolate feature also misses the "namespace" attribute when getting the cmk-config- Configmap.

  4. When enabling the "hostNetwork" in the container, the environment variable "HOSTNAME" in container has been changed to the node name rather than the pod name. Thus, the isolate feature which always be used as the wrapper of the real workload can not get the correct pod name. Then, in the workaround, I use the environment variable "PODNAME" instead of the "HOSTNAME" in isolate.py. And also, the user needs to add the "PODNAME" into their containers.

Thanks

Hyperthreading awareness?

Is there a way to ensure that a request is either hyperthreaded-aware or not? Are there any provisions to make sure multiple requests don't overlap with two cores that share the same physical cores to prevent noisy neighbors?

What's the suit branch/commit for k8s v1.10.11?

I was using CMK release 1.2.2 commit : fc5d194 on k8s 1.9.1.
When I upgrade my k8s to 1.10.11, there are some problem with cmk.
cmk reconcile discover pod crashed.

root@master-1:/home/k200# kubectl get po -o wide
NAME                                         READY     STATUS             RESTARTS   AGE       IP              NODE
cmk-cluster-init-pod                         0/1       Completed          0          22d       10.20.190.200   master-1
cmk-init-install-discover-pod-master-1       0/2       Completed          0          22d       10.244.0.5      master-1
cmk-init-install-discover-pod-minion-1       0/2       Completed          0          22d       10.244.1.5      minion-1
cmk-reconcile-nodereport-ds-master-1-hvb9g   0/2       CrashLoopBackOff   12476      22d       10.244.0.6      master-1
cmk-reconcile-nodereport-ds-minion-1-gjb85   0/2       CrashLoopBackOff   12478      22d       10.244.1.6      minion-1
node-feature-discovery-8dmsf                 0/1       Completed          0          22d       10.20.190.201   minion-1
node-feature-discovery-x8klt                 0/1       Completed          0          22d       10.20.190.200   master-1
root@master-1:/home/k200# kubectl logs cmk-reconcile-nodereport-ds-master-1-hvb9g reconcile 
{
  "reclaimedCpuLists": []
}
Traceback (most recent call last):
  File "/cmk/cmk.py", line 145, in <module>
    main()
  File "/cmk/cmk.py", line 122, in main
    args["--publish"])
  File "/cmk/intel/reconcile.py", line 69, in reconcile
    reconcile_report = reconcile_report_type.create(node_name)
  File "/cmk/intel/third_party.py", line 105, in create
    self.save()
  File "/cmk/intel/third_party.py", line 66, in save
    raise e
  File "/cmk/intel/third_party.py", line 63, in save
    self.api.create_third_party_resource(body)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 878, in create_third_party_resource
    (data) = self.create_third_party_resource_with_http_info(body, **kwargs)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 962, in create_third_party_resource_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 335, in call_api
    _preload_content, _request_timeout)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 148, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 393, in request
    body=body)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/rest.py", line 287, in POST
    body=body)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/rest.py", line 240, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Length': '174', 'Date': 'Thu, 27 Dec 2018 08:46:04 GMT', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"the server could not find the requested resource","reason":"NotFound","details":{},"code":404}
root@master-1:/home/k200# kubectl logs cmk-reconcile-nodereport-ds-master-1-hvb9g nodereport 
INFO:root:Isolated logical cores: 2,3,4,5,6,38,39,40,41,42
{
..........
..........
..........
              {
                "id": 71,
                "isolated": false
              }
            ],
            "id": 35
          }
        ],
        "id": 1
      }
    }
  }
}
Traceback (most recent call last):
  File "/cmk/cmk.py", line 145, in <module>
    main()
  File "/cmk/cmk.py", line 134, in main
    args["--publish"])
  File "/cmk/intel/nodereport.py", line 66, in nodereport
    node_report = node_report_type.create(node_name)
  File "/cmk/intel/third_party.py", line 105, in create
    self.save()
  File "/cmk/intel/third_party.py", line 66, in save
    raise e
  File "/cmk/intel/third_party.py", line 63, in save
    self.api.create_third_party_resource(body)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 878, in create_third_party_resource
    (data) = self.create_third_party_resource_with_http_info(body, **kwargs)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 962, in create_third_party_resource_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 335, in call_api
    _preload_content, _request_timeout)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 148, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/api_client.py", line 393, in request
    body=body)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/rest.py", line 287, in POST
    body=body)
  File "/usr/local/lib/python3.4/site-packages/kubernetes/client/rest.py", line 240, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 27 Dec 2018 08:46:04 GMT', 'Content-Type': 'application/json', 'Content-Length': '174'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"the server could not find the requested resource","reason":"NotFound","details":{},"code":404}

Would you tell me if what's the suitable branch/commit for k8s v1.10.11?
or any information/clue to deal this bug for cmk release 1.2.2 on k8s 1.9.1
Thank you.

CMK on Cluster Multi Node environment

Hi,

I have a question very important: I have a Cluster Multi Node environment and I am using the following manifests in order to setup my CMK cluster:

Unfortunately I can't use cluster-init, because in that way the Helm chart will not have any information about the deployed POD and the problem here is for example, delete CMK Helm chart will not remove any resources created by python code.

Therefore I am using the second solution.
But I found CMK Init / Discover / Install PODs is running just on 1 single node and CMK Reconcile POD running on all nodes.
My questions are:

(1) if CMK Init / Discover / Install POD is running on just 1 single node, how can decide on which node I can deploy an application which use CMK?
(2) if CMK Init / Discover / Install POD runs on just 1 single node in a multi node environment, who made decision about the node? Which node CMK Init / Discover / Install POD, will CMK use?

For example, in my cluster multi node I have:

  • cmk-discover Pod -> 172.20.55.65
  • cmk-init Pod -> 172.20.55.67
  • cmk-install Pod -> 172.20.55.65
  • cmk-reconcile Pod -> 172.20.55.78, 172.20.55.65, 172.20.55.67

but in reality I have isolated CPU just on 172.20.55.67, so what I am expecting is to run all PODs just on that node.

Please let me know as those question are very important for us.

Thanks

Asymmetric Core Requests Across Sockets (Feature Request)

This is related to an ongoing conversation with @przemeklal and @lmdaly.

We have a use case where pods have a known affinity to NICs and/or other devices that may be asymmetric across sockets. For example, let's say we have the following hardware configuration:

Dual socket 18-core xeon
4 40Gbps NICs, two per socket

We launch a pod that requests resources for 3 NICs, and to service the throughput on those NICs with DPDK, we need 6 exclusive/isolated data plane cores per NIC. In this case we would want to request 6+6 cores from socket 0, and 6 cores from socket 1 so that our pod does not cross UPI/QPI on the data path. It would be ideal if in CMK there was flexibility to make requests as specific as this. In our case, we have a custom scheduler where we know exactly how many cores and NICs are free on each node, so we can make the appropriate combination of request. This would allow another pod that only needs a single NIC and 6 cores to be scheduled on the same node, and not have any affinity issues.

future of CMK after k8s topology manager released

@ConnorDoyle @przemeklal @lmdaly

Hi.

I have worked with this CMK, but i found k8s built-in CPU Manager and Topology Manager proposal/implementations weeks ago.

I belive that k8s built-in CPU Manager is proposed to offer better experience than CMK.

And topology manager it seems to be proposed as a solution for misalignment of cpus and pci devices(such as NIC and GPU) assignment cross numa node.

My question is will you guys support this CMK after topology manager merged and became default feature of k8s?

Processes in shared pool not actually sharing the cores

Hi,

I create multiple processes by "cmk.py isolate --pool=shared ...". Then I use "cmk.py describe", and see them listed in the shared pool. However, their PSR are the same (see PSR by "ps -o psr "), though their core affinity are exactly the shared pool cores (see affinity by "taskset -pc "). I suppose their PSR should be spread over the shared pool cores, right?

I guess it is because the shared pool cores are isolate cores (i.e. the "isolcpus" in GRUB config), and OS does not schedule the processes over isolate cores. What else configurations should I do to make the processes' PSR spreading over the shared pool cores?

My environment
OS: Ubuntu 18.04.5
Kernel: 5.4.0-66-generic
CMK: 1.5.2

Thanks!

cmk-cluster-init-pod failed with CreateContainerConfigError

When the resources/pods/cmk-cluster-init-pod.yaml is applied, pod failed with status 'CreateContainerConfigError'

Pod describe has following message
Warning Failed 6s (x4 over 18s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "cmk-cluster-init-pod_cmk-namespace(fa1d4889-e80e-4506-ab08-40e0c50305cf)", container: cmk-cluster-init-pod)

Issue about using hyper-thread in exclusive mode

Hi, I am trying to use CPU pinning for POD using your source. I have successfully installed and tested exclusives mode. It work well, each task run in only one core and no share.
But I have a problem, in each CPU core we have 2 thread like 2-38 and 3-39 but when I start using cmk isolate command, I only see task run in thread 2 and 3 then the next task said core limit, task never run in thread 38 and 39.
Are you support run in mode isolate in thread level or you only support run isolate in core level.
Thanks.
Capture
exclusive

CPU isolation from host daemons without isolcpus

Hi.
Is it possible to isolate a CPU for a container from system/host processes/daemons, but not using isolcpus kernel arg?

I have this use case:
We host game servers in Kubernetes clusters. They are single-threaded. In one pod, we have one container with a game server, and another is a sidecar container with some helper processes.

We also have a few daemonsets for maintenance (like logging (promtail), monitoring (kube-prometheus-stack), updating game server files, uploading game replays to the s3, and so on).

Every container with a game server (actually a Linux thread) should allocate one dedicated CPU thread and be pinned to it (to avoid context switches and CPU cache misses to make sure we have consistent latency and fps without any jitters).

I want behavior like this:
For example, I have one server with Ryzen 9 5950x (16 cores / 32 threads).

During pick hours, we have 30 game servers, and all of them allocate CPU threads exclusively (one game server per one CPU thread), so all other sidecar containers, daemonsets, and system processes/daemons (including kubelet, etc.) should run on the last CPU core and never schedule to first 15 CPU cores.
During the period of low online, we have, for example, 10 game servers that allocate 10 CPU threads (5 CPU cores). Other 11 CPU cores should be available for those system processes/daemons, daemonsets, etc.
Any ideas on how to achieve this behavior?

cmk cluster-init failed to re-run with new added node

Description

cmk cluster-init failed with new added node

ERROR:root:Exception when creating secret: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 11 Apr 2019 05:03:41 GMT', 'Content-Length': '218', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"cmk-webhook-certs\" already exists","reason":"AlreadyExists","details":{"name":"cmk-webhook-certs","kind":"secrets"},"code":409}


ERROR:root:Aborting webhook deployment ...

steps to reproduce

1. create cmk-cluster-init-pod with the following args:

  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --host-list=node-0 --saname=cmk-serviceaccount"

2. check that the cmk cluster-init is successful

$ kubectl get pods -o wide
NAME                                             READY   STATUS      RESTARTS   AGE    IP             NODE           NOMINATED NODE
cmk-cluster-init-pod                             0/1     Completed   0          128m   172.16.0.8     node-0         <none>
cmk-init-install-discover-pod-node-0             0/2     Completed   0          128m   172.16.0.9     node-0         <none>
cmk-reconcile-nodereport-ds-node-0-g5ww7         2/2     Running     0          127m   172.16.0.10    node-0         <none>
cmk-webhook-deployment-5b7895df7f-zzwgc          1/1     Running     0          127m   172.16.0.11    node-0         <none>

3. add a new node node-1 and create cmk-cluster-init-pod-node-1 with the following args:

  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --host-list=node-1 --saname=cmk-serviceaccount"

cmk-cluster-init-pod-node-1 failed with errors:

$ kubectl get pods cmk-cluster-init-pod-node-1 -o wide
NAME                                             READY   STATUS      RESTARTS   AGE    IP             NODE           NOMINATED NODE
cmk-cluster-init-pod-node-1                      0/1     Error       0          18m    172.16.1.98    node-1         <none>

$ kubectl describe logs cmk-cluster-init-pod-node-1
error: the server doesn't have a resource type "logs"
$ kubectl logs cmk-cluster-init-pod-node-1
INFO:root:Used ServiceAccount: cmk-serviceaccount
INFO:root:Creating cmk pod for ['init', 'install', 'discover'] commands ...
INFO:root:Waiting for cmk pod running ['init', 'install', 'discover'] cmds to enter Succeeded state.
INFO:root:Creating cmk pod for ['reconcile', 'nodereport'] commands ...
INFO:root:Waiting for cmk pod running ['reconcile', 'nodereport'] cmds to enter Running state.
ERROR:root:Exception when creating secret: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 11 Apr 2019 05:03:41 GMT', 'Content-Length': '218', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"cmk-webhook-certs\" already exists","reason":"AlreadyExists","details":{"name":"cmk-webhook-certs","kind":"secrets"},"code":409}


ERROR:root:Aborting webhook deployment ...

Failed to build cmk docker image.

I get an InvocationError when I building cmk.
It shows error like

"ERROR: InvocationError: '/cmk/.tox/lint/bin/flake8 intel cmk.py tests setup.py'"

I tried the cmk 1.2.2 1.3.0 and master branch with the same result.

the full log is in attachment.
cmk_error.log

no way to uninstall single node while leaving webhook

it would be useful to separate out the cluster-wide uninstall (ie the webhook stuff) from the per-node uninstall. This would allow you to uninstall a single node and then reinstall it with a different CPU allocation without affecting the webhook resources.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.