Giter VIP home page Giter VIP logo

sig-windows-dev-tools's Introduction

Welcome to the SIG Windows Development Environment!

This is a fully batteries-included development environment for Windows on Kubernetes, including:

  • Vagrant file for launching a two-node cluster
  • containerd 1.6.15
  • Support for two CNIs: antrea, or calico on containerd: configure your CNI option in variables.yml
    • Calico 3.25.0 on containerd runs containers out of the box
    • Antrea 0.13.2 runs but requires running with a patch for antrea-io/antrea#2344 which was recently made available
  • NetworkPolicy support for Windows and Linux provided by Antrea and Calico
  • Windows binaries for kube-proxy.exe and kubelet.exe that are either built from source (K8s main branch) or releases
  • Kubeadm installation that can put the latest Linux control plane in place

Quick Start

Prerequisites

  • Linux host - Fedora 38.
    • Experimental support for Windows host with WSL as environment providing make, see Windows with WSL.
  • make
  • Vagrant
  • VirtualBox (we only have VirtualBox automated here, but these recipes have been used with others, like Microsoft HyperV and VMware Fusion).
  • Kubectl

Getting a cluster up and running

Simple steps to a Windows Kubernetes cluster, from scratch, built from source...

  • vagrant plugin install vagrant-reload vagrant-vbguest winrm winrm-elevated, vagrant-reload needed to easily reboot windows VMs during setup of containers features.
  • make all, this will create the entire cluster for you. To compile k/k/ from local source, see instructions later in this doc.
    • If the above failed, run vagrant provision winw1, just in case you have a flake during windows installation.
  • vagrant ssh controlplane and run kubectl get nodes to see your running dual-os linux+windows k8s cluster.

Windows with WSL (experimental)

All the above Quick Start steps apply, except you have to run the Makefile targets in WSL

  • using vagrant.exe on the host
  • while inside clone of this repo on Windows filesystem, not WSL filesystem.

First, get the path for your vagrant.exe on the host use Get-Command vagrant in PowerShell like the following example.

~ > $(get-command vagrant).Source.Replace("\","/").Replace("C:/", "/mnt/c/")
/mnt/c/HashiCorp/Vagrant/bin/vagrant.exe

Next, pass the mount path to the executable on the Windows host with the VAGRANT environment variable exported in WSL.

Then, ensure you clone this repository onto filesystem inside /mnt and not the WSL filesystem, in order to avoid failures similar to this one:

The host path of the shared folder is not supported from WSL.
Host path of the shared folder must be located on a file system with
DrvFs type. Host path: ./sync/shared

Finally, steps to a Windows Kubernetes cluster on Windows host in WSL is turn into the following sequence:

export VAGRANT=/mnt/c/HashiCorp/Vagrant/bin/vagrant.exe
cd /mnt/c/Users/joe
git clone https://github.com/kubernetes-sigs/sig-windows-dev-tools.git
make all
# ...
make clean

Fedora

Follow the steps presented below to prepare the Linux host environment and create the two-node cluster:

1. Install essential tools for build and vagrant/virtualbox packages.

Example:

Adding hashicorp repo for most recent vagrant bits:

sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://rpm.releases.hashicorp.com/fedora/hashicorp.repo
sudo dnf -y install vagrant

Installing packages:

sudo dnf install -y vagrant VirtualBox
sudo vagrant plugin install vagrant-reload vagrant-vbguest winrm winrm-elevated vagrant-ssh

2. Create /etc/vbox/networks.conf to set the network bits:

Example:

sudo mkdir /etc/vbox
sudo vi /etc/vbox/networks.conf

* 10.0.0.0/8 192.168.0.0/16
* 2001::/64

3. Clone the repo and build

If you are building Kubernetes components from source, please follow the development guide.

git clone https://github.com/kubernetes-sigs/sig-windows-dev-tools.git
cd sig-windows-dev-tools
touch tools/sync/shared/kubejoin.ps1
make all

4. ssh to the virtual machines

  • Control Plane node (Linux):
vagrant ssh controlplane
kubectl get pods -A
  • Windows node:
vagrant ssh winw1

Goal

Our goal is to make Windows ridiculously easy to contribute to, play with, and learn about for anyone interested in using or contributing to the ongoing Kubernetes-on-Windows story. Windows is rapidly becoming an increasingly viable alternative to Linux thanks to the recent introduction of Windows HostProcess containers and Windows support for NetworkPolicies + Containerd integration.

Lets run it!

Ok let's get started...

1) Pre-Flight checks...

For the happy path, just:

  1. Start Docker so that you can build K8s from source as needed.
  2. Install Vagrant, and then vagrant-reload
vagrant plugin install vagrant-reload vagrant-vbguest winrm winrm-elevated 
  1. Modify CPU/memory in the variables.yml file. We recommend four cores 8G+ for your Windows node if you can spare it, and two cores 8G for your Linux node as well.

2) Run it!

There are two use cases for these Windows K8s dev environments: Quick testing, and testing K8s from source.

3) Testing from source? make all

To test from source, run vagrant destroy --force ; make all. This will

  • destroy your existing dev environment (destroying the existent one, and removing binaries folder)
  • clone down K8s from GitHub. If you have the k/k repo locally, you can make path=path_to_k/k all
  • compile the K8s proxy and kubelet (for linux and windows)
  • inject them into the Linux and Windows vagrant environment at the /usr/bin and C:/k/bin/ location
  • start up the Linux and Windows VMs

AND THAT'S IT! Your machines should come up in a few minutes...

NOTE: Do not run the middle Makefile targets, they depend of the sequence to give the full cluster experience.

IMPORTANT

Do not log into the VMs until the provisioning is done. That is especially true for Windows because it will prevent the reboots.

Other notes

If you still have an old instance of these VMs running for the same dir:

vagrant destroy -f && vagrant up

after everything is done (can take 10 min+), ssh' into the Linux VM:

vagrant ssh controlplane

and get an overview of the nodes:

kubectl get nodes

The Windows node might stay 'NotReady' for a while, because it takes some time to download the Flannel image.

vagrant@controlplane:~$ kubectl get nodes
NAME     STATUS     ROLES                  AGE    VERSION
controlplane    Ready      control-plane,controlplane   8m4s   v1.20.4
winw1           NotReady   <none>                       64s    v1.20.4

...

NAME     STATUS   ROLES                  AGE     VERSION
controlplane    Ready    control-plane,controlplane     16m     v1.20.4
winw1           Ready    <none>                         9m11s   v1.20.4

Accessing the Windows box

You'll obviously want to run commands on the Windows box. The easiest way is to SSH into the Windows machine and use powershell from there:

vagrant ssh winw1
C:\ > powershell

Optionally, you can do this by noting the IP address during vagrant provision and running any RDP client (vagrant/vagrant for username/password, works for SSH). To run a command on the Windows boxes without actually using the UI, you can use winrm, which is integrated into Vagrant. For example, you can run:

vagrant winrm winw1 --shell=powershell --command="ls"

IF you want to debug on the windows node, you can also run crictl:

.\crictl config --set runtime-endpoint=npipe:////./pipe/containerd-containerd

Where we derived these recipes from

Contributing

Working on Windows Kubernetes is a great way to learn about Kubernetes internals and how Kubernetes works in a multi-OS environment.

So, even if you aren't a Windows user, we encourage Kubernetes users of all types to try to get involved and contribute!

We are a new project and we need help with...

  • contributing / testing recipes on different Vagrant providers
  • docs of existing workflows
  • CSI support and testing
  • privileged container support
  • recipes with active directory
  • any other ideas!

If nothing else, filing an issue with your bugs or experiences will be helpful long-term. If interested in pairing with us to do your first contribution, just reach out in #sig-windows (https://slack.k8s.io/). We understand that developing on Kubernetes with Windows is new to many folks, and we're here to help you get started.

sig-windows-dev-tools's People

Contributors

aravindhp avatar aroradaman avatar bplasmeijer avatar dougsland avatar ffromani avatar friedrichwilken avatar jayunit100 avatar johnschnake avatar k8s-ci-robot avatar knabben avatar lappleapple avatar luckerby avatar luthermonson avatar lzhecheng avatar marosset avatar mloskot avatar perithompson avatar ridavid2002 avatar shivaabhishek07 avatar shivraj-nakum avatar sladyn98 avatar swastik959 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sig-windows-dev-tools's Issues

container runtime not honored, lets remove it from kubelet startup on windows

in the windows kubelet startup, we try to specify the pause image, but theres no reason to do this:

--pod-infra-container-image=`"mcr.microsoft.com/oss/kubernetes/pause:1.4.1`" 
--enable-debugging-handlers 
--cgroups-per-qos=false 
--enforce-node-allocatable=`"`" 
--network-plugin=cni 
--resolv-conf=`"`" 
--log-dir=/var/log/kubelet 
--logtostderr=false 
--image-pull-progress-deadline=20m 
--container-runtime=remote 
--container-runtime-endpoint=npipe:////.//pipe//containerd-containerd"

Details on why this is a silly thing to do

... But the kubelet itself, if

--container-runtime=remote
```, doesn't actually honor that flag, will not actually use this option 

hence we actually only use 1.4.0 in ouur current containerd configs, are the ones that get used.

$config = Get-Content "$global:ConainterDPath\config.toml"
$config = $config -replace "bin_dir = (.)$", "bin_dir = "c:/opt/cni/bin""
$config = $config -replace "conf_dir = (.)
$", "conf_dir = "c:/etc/cni/net.d""
$config | Set-Content "$global:ConainterDPath\config.toml" -Force


So, i guess this is a longwinded way of saying lets not use the `--pod-infra-container-image` flag at all :) . 

i.e. in a running cluster, our windows node has this in the config.toml:
sandbox_image = "mcr.microsoft.com/oss/kubernetes/pause:1.4.0"
hence theres a mismatch that could be confusing for folks long term


this is a minor issue but figured it was worth noting in detail in case people search for it 

GH actions

  • vagrantfile lint
  • maybe try vbox install and up ? see how far it gets :)

WinRM Exec error....

weird winrm error that i just saw...

An error occurred executing a remote WinRM command.

Shell: Cmd
Command: hostname
Message: unknown type: 2577786541

add choco to image-builder

Now that we have made some progress with Windows image-builder, let's add Choco to it.

This allows us to take care of other items from our Windows wish list like:
guest tools
vim

More package proposes are welcome.

cloud ci ?

  • vagrant-azure provider
  • github actions vagrant probably not enough CPU
  • virtbox/vmw in nested cloud

windows-dev-tools vagrant boxes

This will require a long term owner - so hope someone wants to sign up for it, really cool project.

We need to build our own vagrant boxes, possibly w/ image-builder and packer.

These would then probably bootstrap and install alot faster.

https://github.com/kubernetes-sigs/image-builder/tree/master/images/capi most likely can be used as the basis for how this is done .

To get started:

  • learn how image-builder works and try to run it locally. we have a tgik episode describing the process you can watch: https://www.youtube.com/watch?v=l3TWbrWkVzY
  • then try to see if you can run the image-builder windows capi recipes and publish a vagrant.box
  • then replace the vagrantfile box with your windows box, and remove some of the containerd-1.sh and containerd-0.sh scripts, i.e. the ones that are no longer needed
  • for bonus, look into preloading cni binaries for antrea-agent.exe or calico node.exe into the images for even faster spin up times
  • of course, since this is a dev environment, the goal isnt to make it super clean, but rather to make sure to doc all the image builder steps as part of this repo so developers can easily use it to build their own images

Increase local timeout for wsmanfault / restart

Just got this at the end of a full cycle of make all....


Shell: Powershell
Command: if ([System.Net.Dns]::GetHostName() -eq 'winw1') { exit 0 } exit 1
Message: [WSMAN ERROR CODE: 995]: <f:WSManFault Code='995' Machine='127.0.0.1' xmlns:f='http://schemas.microsoft.com/wbem/wsman/1/wsmanfault'><f:Message>The I/O operation has been aborted because of either a thread exit or an application request. </f:Message></f:WSManFault>
make: *** [vagrant-up] Error 1

figure out what pause image we should use?

  • gcr didnt have windows , but now it does
  • currently we send this argument... but somehow we try to run the wrong pause image i think...
    Warning FailedCreatePodSandBox 26m (x2 over 26m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "mcr.microsoft.com/oss/kubernetes/pause:1.4.0": failed to pull image "mcr.microsoft.com/oss/kubernetes/pause:1.4.0": failed to pull and unpack image "mcr.microsoft.com/oss/kubernetes/pause:1.4.0": failed to resolve reference "mcr.microsoft.com/oss/kubernetes/pause:1.4.0": failed to do request: Head "https://mcr.microsoft.com/v2/oss/kubernetes/pause/manifests/1.4.0": dial tcp: lookup mcr.microsoft.com: no such host
  • this is the current args we send on startup... so why is containerd ignoring the pause image ?
    ```$cmd = "C:\k\kubelet.exe $global:KubeletArgs --cert-dir=$env:SYSTEMDRIVE\var\lib\kubelet\pki --config=/var/lib/kubelet/config.yaml --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --hostname-override=$(hostname) --pod-infra-container-image="mcr.microsoft.com/oss/kubernetes/pause:1.4.1" --enable-debugging-handlers --cgroups-per-qos=false --enforce-node-allocatable=`"`" --network-plugin=cni --resolv-conf=`"`" --log-dir=/var/log/kubelet --logtostderr=false --image-pull-progress-deadline=20m --container-runtime=remote --container-runtime-endpoint=npipe:////.//pipe//containerd-containerd"

Unable to create control plane

When creating a dev environment using a standard release the control plan has the following error

  controlplane: Hit:5 http://security.ubuntu.com/ubuntu bionic-security InRelease
    controlplane: Ign:6 https://apt.kubernetes.io kubernetes-xenial InRelease
    controlplane: Err:7 https://apt.kubernetes.io kubernetes-xenial Release
    controlplane:   Could not handshake: The TLS connection was non-properly terminated. [IP: 34.107.204.206 443]
    controlplane: Reading package lists...
    controlplane: E
    controlplane: : 
    controlplane: The repository 'https://apt.kubernetes.io kubernetes-xenial Release' does not have a Release file.
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.

Switching to deb http://packages.cloud.google.com/apt/ kubernetes-xenial main seems to fix this

KPNG Windows sandbox recipe

Kube proxy makes firewall rules , by using golang to make system calls to windows. so this is all synergistic.
:

  • go through the kube-proxy windows directory and document the codepath and structs https://github.com/kubernetes/kubernetes/tree/master/pkg/proxy/winkernel
  • make a golang library that does the same low level HNS calls that it does
  • put that golang library to the test inside of one of your vagrantfiles, and add that recipe
  • make a vagrant recipe that explores those HNS rules by running the go program in two different windows VMs.

cc doug landgraff can help intro to kube proxy overall architecture if needed

Use my local k8s dev folder

Can we add a flag to skip cloning kubernetes if I have a local environment setup already? At the moment this clones k8s down but I already have a copy of this on my laptop

Windows pods are in crashloopback

Investigate why Windows pods are on crashloopback, logs attached:

vagrant@controlplane:~$ kubectl logs windows-server-iis-7985c648cc-gmtq9 

Success Restart Needed Exit Code      Feature Result                           
------- -------------- ---------      --------------                           
True    No             Success        {Common HTTP Features, Default Documen...
Invoke-WebRequest : The remote name could not be resolved: 
'dotnetbinaries.blob.core.windows.net'
At line:1 char:32
+ ... Web-Server; Invoke-WebRequest -UseBasicParsing -Uri 'https://dotnetbi ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:Htt 
   pWebRequest) [Invoke-WebRequest], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShe 
   ll.Commands.InvokeWebRequestCommand
 
C:\ServiceMonitor.exe : The term 'C:\ServiceMonitor.exe' is not recognized as 
the name of a cmdlet, function, script file, or operable program. Check the 
spelling of the name, or if a path was included, verify that the path is 
correct and try again.
At line:1 char:311
+ ... ml>' > C:\inetpub\wwwroot\default.html; C:\ServiceMonitor.exe 'w3svc' ...
+                                             ~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (C:\ServiceMonitor.exe:String) [ 
   ], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

from Jay:

I guess its failing bc it does some weird DNS request. probably bc that private-network doesnt have public DNS. Thats fine, maybe we can just remove thet Invoke-WebRequest call from the IIS pod

execute k.ps1 "remotely"

When I try to execute k.ps1 via winrm on a fresh vm I get this error:

vagrant winrm -c "powershell 'C:\sync\k.ps1'" winw1
powershell.exe : C:\sync\k.ps1 : The term 'C:\sync\k.ps1' is not recognized as the name of a cmdlet, function, 
    + CategoryInfo          : NotSpecified: (C:\sync\k.ps1 :...let, function, :String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.At line:1 char:1+ C:\sync\k.ps1+ ~~~~~~~~~~~~~    + CategoryInfo          : ObjectNotFound: (C:\sync\k.ps1:String) [], CommandNotFoundException    + FullyQualifiedErrorId : CommandNotFoundException make: *** [Makefile:29: rerun] Error 1

Same happens when I log into Windows (no ssh or winrm, directly over the VirtualBox UI) and try to run k.ps1 from powershell:

PS C:\Windows\system32> powershell "C:\sync\k.ps1"
C:\sync\k.ps1 : The term 'C:\sync\k.ps1' is not recognized as the name of a cmdlet, function, script file, or operable
program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
+ C:\sync\k.ps1
+ ~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (C:\sync\k.ps1:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

I can see c:\sync from powershell:

PS C:\Windows\system32> cd C:\
PS C:\> ls
    Directory: C:\

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
...
d----l        4/17/2021   5:05 AM                sync
...

but I cant see it's content:

PS C:\> cd "C:\sync\"
PS C:\sync> dir
dir : Could not find a part of the path 'C:\sync'.
At line:1 char:1
+ dir
+ ~~~
    + CategoryInfo          : ReadError: (C:\sync:String) [Get-ChildItem], DirectoryNotFoundException
    + FullyQualifiedErrorId : DirIOError,Microsoft.PowerShell.Commands.GetChildItemCommand

This, however, changes when I simply navigate over to c:\sync\ with the Windows Explorer:

PS C:\sync> ls
ls : Could not find a part of the path 'C:\sync'.
At line:1 char:1
+ ls
+ ~~
    + CategoryInfo          : ReadError: (C:\sync:String) [Get-ChildItem], DirectoryNotFoundException
    + FullyQualifiedErrorId : DirIOError,Microsoft.PowerShell.Commands.GetChildItemCommand

PS C:\sync> ls


    Directory: C:\sync


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
------        4/17/2021   5:05 AM           5567 config
------        4/10/2021   7:12 AM            216 docker.ps1
------        4/17/2021   3:44 AM           4801 k.ps1

So it looks like the folder needs some 'refresh'. But I don't know yet how to pull that off.

The options to solve this I can see are:

  1. Learn how to 'refresh' the file system.
  2. Execute k.ps1 not from windows, but from the vagrantfile. This would require to also move the installation of docker to the vagrantfile. The problem with this is the provisioned reboot of windows from vagrant. It is extra work, but it would make the makefile obsolete which is the ultimate goal anyway: a single clean vagrantfile.
  3. Execute k.ps1 from the makefile. Dumping the content of k.ps1 into the makefile is not really an option because it a) creates a mess and b) requires to escape a lot of characters and that causes even more of a mess.

stop using the old Install-Containerd.ps1

We used to download and execute Install-Containerd.ps1. Meanwhile, we moved on to use our own version of that script, yet we still have a line somewhere to execute the not existing downloaded version. This causes a (not breaking) error that makes it harder to scan the output for actually relevant error messages:

    winw1: C:\k\Install-Containerd.ps1 : The term 'C:\k\Install-Containerd.ps1' is not recognized as the name of a cmdlet,
    winw1: function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the
    winw1: path is correct and try again.
    winw1: At line:1 char:1
    winw1: + C:\k\Install-Containerd.ps1
    winw1: + ~~~~~~~~~~~~~~~~~~~~~~~~~~~
    winw1:     + CategoryInfo          : ObjectNotFound: (C:\k\Install-Containerd.ps1:String) [], CommandNotFoundException
    winw1:     + FullyQualifiedErrorId : CommandNotFoundException
    winw1:

add antrea as a cni

Easy way to get a dev antrea instance running for CNI

  1. Make sure you use antrea 0.13.2 ! it has a fix to a race condition in binding the vnic in containerd.

  2. Setup antrea on linux nodes:

kubectl apply -f https://github.com/antrea-io/antrea/releases/download/v0.13.2/antrea.yml

Now its running on linux,....

  1. Now install on your windows node

mkdir -Force C:/nssm/ nssm.zip
 https://k8stestinfrabinaries.blob.core.windows.net/nssm-mirror/nssm-2.24.ziptar C $global:NssmInstallDirectory -xvf .\nssm.zip --strip-components 2 */$arch/*.exeRemove-Item -Force .\nssm.zip<br class="Apple-interchange-newline">

then enable testsigning

Bcdedit.exe -set TESTSIGNING ON
Restart-Computer

Then do this:

curl.exe -LO https://raw.githubusercontent.com/antrea-io/antrea/main/hack/windows/Install-OVS.ps1
.\Install-OVS.ps1 # Test-only
.\Install-OVS.ps1 -ImportCertificate $false -Local -LocalFile <PathToOVSPackage> # Production

.... Now, follow the steps here .... (some of these steps we already did above)
https://github.com/antrea-io/antrea/blob/main/docs/windows.md#installation-as-a-service-containerd-based-runtimes

Faster iteration via preloading Windows images

Pulling the windows images often takes a long time, slowing down iteration time.

We should establish a way to prepull the images locally on the host and then move/load them into the Windows machine.

Since you cant do something like docker pull... for windows images when you're on mac/linux, we need to download them as tarballs and once the Windows machine comes up, load that tarball of images.

[idea] host all windows bits on the linux node

  • Vagrant sync doesnt work on all platforms (i.e. requires rsync on aws)
  • Vagrant sync is slow and locks up on virtualbox
  • Windows does alot of wgetting and that slows down load time, even though most of the time the
    stuff we load is the same (antrea-agent.exe, kube-proxy.exe, and so on), and doesnt need to be compiled (we support compilation but, would be easier if we could just run w/o compilation)

ONE SOLUTION

if we just

  1. make a pod.yaml in /etc/kubernetes/manifests that runs python SimpleHTTPServer -p 80 /artifacts/
  2. with a hostPath volume mounted to /artifacts/

and then we have the windows node do

  1. too bootstrap antrea or calico, kube proxy, and so on on the windows side,
  • curl.exe 10.20.30.40:80/artifacts/kube-proxy.exe
  • curl.exe 10.20.30.40:80/artifacts/kubelet.exe
  • ...

Then we could really easily run fast performant windows provisioning
... IT ALSO allows us to build a customer vagrant LINUX box that hosts all the WINDOWS artifacts,
thus allowing us to leverage ultra fast images that have all the stuff in them, preloaded.

ALTERNATIVE

We could use a tool like imgpkg to contain up all the windows artifacts, and extract them using something like imgpkg extract gcr.io/friedrichwilkenasdfasdfs/my-windows-artifacts

add natural ordering to makefile and split out download

Anything we can do to make the explicit steps obvious will be nice for our overall UX as we add more customizations

all: build-binaries vagrant-up

0-pull-kubernetes:
	
1-build-binaries:
	chmod +x build.sh
	./build.sh $(path)

2-vagrant-up:
	vagrant destroy -f && vagrant up

# 3-install-cni-antrea:
# 4-install-cni-calico:
       

4-e2e-test:
	sonobuoy run --e2e-focus=...

Install SSH on the node

Please can we add an ssh server to the windows box

The steps below are how ansible installs SSH, on windows, these require a relatively up to date server 2019 image, there is a KB that enables this as well but these steps should be enough to understand setting this up

Requires admin rights to install

  • name: Install OpenSSH
    win_shell: Add-WindowsCapability -online -Name OpenSSH.Server~~~~0.0.1.0
    become: yes
    become_method: runas
    become_user: SYSTEM
    retries: 5
    delay: 3
    register: result
    until: result is not failed

  • name: Set default SSH shell to Powershell
    win_regedit:
    path: HKLM:\SOFTWARE\OpenSSH
    state: present
    name: DefaultShell
    data: '{{ systemdrive.stdout | trim }}\Windows\System32\WindowsPowerShell\v1.0\powershell.exe'
    type: string

  • name: Enable ssh login without a password
    win_shell: Add-Content -Path "$env:ProgramData\ssh\sshd_config" -Value "PasswordAuthentication no`nPubkeyAuthentication yes"

  • name: Set SSH service startup mode to auto and ensure it is started
    win_service:
    name: sshd
    start_mode: auto
    state: started

figure out Why do our calico routes use pod IPs as the gateway ? (either vxlan or config errors)

this maybe a vxlan networking issue, but it seems like the calico linux
node cant contact the calico windows node because its trying to route to
something using a 100.244.206.65 ip address.

kind

In a BGP cluster, the routes are easier to reason about:

root@calico-worker:/# ip route
default via 172.18.0.1 dev eth0
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.7
192.168.9.128 dev cali868d924bafb scope link
blackhole 192.168.9.128/26 proto bird
192.168.9.129 dev cali73d4c1d05a0 scope link
192.168.9.130 dev calif5885c8f5f0 scope link
192.168.9.131 dev calicb414295737 scope link
192.168.9.132 dev cali05890af621a scope link
192.168.9.133 dev cali4fb2b6d5479 scope link

192.168.88.0/26 via 172.18.0.6 dev eth0 proto bird #### <-- this makes sense :) 

But in our clusters, the networks dont really make sense,

windows

vagrant@controlplane:~$ sudo su
root@controlplane:/home/vagrant# ip route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.20.30.0/24 dev eth1 proto kernel scope link src 10.20.30.10
100.244.49.65 dev caliee300a6309c scope link
100.244.49.66 dev cali4bad977260a scope link
100.244.49.67 dev cali45019e4d786 scope link
100.244.49.68 dev cali2826d4d9b78 scope link
100.244.49.69 dev calie348bef9edc scope link
100.244.206.64/26 via 100.244.206.65 dev vxlan.calico onlink

antrea CNI not setting default gateway correctly

Antrea CNI works be using the node-ip address to find a device, and then tries to use that specific Device to pick a gateway. That default gateway is the uplink for OVS.

However, it looks like it wants that device to have a destinationPrefix of 0.0.0.0/0 , as a way to find the default gateway by then extracting the NextHop value.

So the question is... can we hardcode that NextHop value in cases like, this repos, where the node ip doesnt come up w/ a device that is directly connected to a default gateway?

update this is fixed now:) antrea-io/antrea@da17b5d

  • agent_windows.go: nodeConfig.NodeIpAddr.Ip → GetIPNetDeviceFromIP(..) → adaptor .index → GetDefaultGatewayByInterface(...) → default gateway string
  • Why is the GetDefaultGateway call failing ?
    • because the node ip address device doesn't have a destination prefix for the default gateway... 0.0.0.0/0 !
      • what device DOES have a 0.0.0.0 destination prefix ???
        • Can we hack or hardcode antrea to consume THAT device to be the thing as the uplink interface ?

add tolerations to smoke test

Pretty soon we'll be tainting our windows nodes, and along the way we'll want to also tolerate them accordingly

good first issue to add these tolerations into our smoke-test.yaml file

 spec:
      tolerations:
       - effect: NoSchedule
          key: os
          operator: Equal
          value: windows
       - effect: NoSchedule
          key: os
          operator: Equal
          value: Windows

Move to Server Core 2019

We should create a base image with hyperv (disabled hypervisor) and containers already setup on server core and vim installed as default for text editing to make the box as fast and small as possible

/kind feature

modularize cni provider

one possible solution:

  1. make a dir cni/ that is in .gitignore
  2. make a dir called cni_antrea/
  3. make a dire called cni_calico/
  4. make a dir called cni_flannel/

add a Makefile command that copies contents of 2/3/4 into cni/ on demand and have vagrant run the scripts in the cni folder on startup as a default.

Problem with vagrant-shell script

    winw1: Running: sync/kubejoin.ps1 as c:\tmp\vagrant-shell.ps1
    winw1: At C:\tmp\vagrant-shell.ps1:1 char:4
    winw1: +  --cri-socket="npipe:////./pipe/containerd-containerd"--cri-socket "n ...
    winw1: +    ~
    winw1: Missing expression after unary operator '--'.
    winw1: At C:\tmp\vagrant-shell.ps1:1 char:4
    winw1: +  --cri-socket="npipe:////./pipe/containerd-containerd"--cri-socket "n ...
    winw1: +    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    winw1: Unexpected token 'cri-socket="npipe:////./pipe/containerd-containerd"--cri-socket' in expression or statement.
    winw1:     + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
    winw1:     + FullyQualifiedErrorId : MissingExpressionAfterOperator

I'm trying to get this to run on my Macbook but ran into #20 and also the above issue. Not certain yet how to resolve this one; I was hoping it would be apparent to someone more familiar with the codebase already. Otherwise I'm happy to dig into it a bit.

vagrant up controlplane fails on Windows machine

    controlplane: Removing insecure key from the guest if it's present...
    controlplane: Key inserted! Disconnecting and reconnecting using new SSH key...
==> controlplane: Machine booted and ready!
==> controlplane: Checking for guest additions in VM...
==> controlplane: Setting hostname...
==> controlplane: Configuring and enabling network interfaces...
==> controlplane: Mounting shared folders...
    controlplane: /var/sync/linux => C:/Users/jsturtevant/sig-windows-dev-tools/sync/linux
Vagrant was unable to mount VirtualBox shared folders. This is usually
because the filesystem "vboxsf" is not available. This filesystem is
made available via the VirtualBox Guest Additions and kernel module.
Please verify that these guest additions are properly installed in the
guest. This is not a bug in Vagrant and is usually caused by a faulty
Vagrant box. For context, the command attempted was:

mount -t vboxsf -o uid=1000,gid=1000,_netdev var_sync_linux /var/sync/linux

The error output from the command was:

: Invalid argument

lowered cpu/memory variables but otherwise left everything untouched:

linux_ram: 2048
linux_cpus: 2
windows_ram: 4096
windows_cpus: 2

Run dev build control plane

I would like to be able to run a cluster with an alpha build on the control plane as well the windows node, can we create a way to run that from one command like we can for Linux?

allow pluggable antrea

part of #46 , is making it so that we can put a local copy of antrea-agent.exe and antrea cni in the windows node. That will make it easy to test antrea modifications, specifically, we want to test the antrea/2344 patch which allows specifying a network interface for pod communication.

Fix InternalIP value if possible for linux

not sure how we create internal IPs , probably a kubelet metadata issue,

NAME     STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                    KERNEL-VERSION     CONTAINER-RUNTIME
master   Ready    control-plane,master   11m   v1.20.4   10.0.2.15     <none>        Ubuntu 20.04.2 LTS                          5.4.0-72-generic   docker://20.10.5
winw1    Ready    <none>                 83s   v1.21.0   10.20.30.11   <none>        Windows Server 2019 Datacenter Evaluation   10.0.17763.1879    containerd://1.4.1

Windows pods are slow to start

Investigate why windows pods are hung with ContainerCreating and after ~15 minutes they start to run.

reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-wlzhm\" (UniqueName: \"kubernetes.io/projected/6471b4a8-1c66-41c2-8542-7e8436d1cf7c-kube-api-access-wlz
hm\") pod \"windows-server-iis-7985c648cc-5xfhr\" (UID: \"6471b4a8-1c66-41c2-8542-7e8436d1cf7c\") "

... after ~15-20 min

fake_memory_manager.go:50] "Add container" pod="default/porter" containerName="porter" containerID="59c9c7636f5783556d8e6c20d24f080b311e9fca28624b8ba4d895e14816c0c5"
fake_topology_manager.go:43] "AddContainer" pod="default/porter" containerName="porter" containerID="59c9c7636f5783556d8e6c20d24f080b311e9fca28624b8ba4d895e14816c0c5"

NuGet, CheckNSSM, containerd optimizations

We want to be able to continually re run provisioning
so, lets

  1. Check nuget installed (maybe find nuget | something ?) and skip if we can
  2. Check Get-Service *containerd and if installed/ running, skip containerd1 and containerd2
  3. chek if nssm is installed, and if so, skip its setup

estimating this will save us alot of time

CNI plugin not configured?

I know the windows node takes time to come up but so far its been over an hour and still reads as not ready:

NAME           STATUS     ROLES                  AGE   VERSION
controlplane   Ready      control-plane,master   78m   v1.21.0
winw1          NotReady   <none>                 68m   v1.21.2-rc.0

The reason is:

container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
        message:Network plugin returns error: cni plugin not initialized

When I ssh'd into the winw1 box and checked the kubelet process, it was started with --network-plugin=cni. I dont usually tweak kublet args or CNI configs, so I'm not sure if the problem is within one of the listed config files or if that --node-ip= should have had an ip value and, since it was empty, is causing the problem.

"C:\k\kubelet.exe" --container-runtime=remote --container-runtime-endpoint=npipe:////./pipe/containerd-containerd --pod-infr
a-container-image=gcr.io/k8s-staging-ci-images/pause:3.4.1 --cert-dir=C:\var\lib\kubelet\pki --config=/var/lib/kubelet/confi
g.yaml --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --hostname-ov
erride=winw1 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:1.4.1 --enable-debugging-handlers --cgroups-
per-qos=false --enforce-node-allocatable= --network-plugin=cni --resolv-conf= --log-dir=/var/log/kubelet --logtostderr=false
 --image-pull-progress-deadline=20m --container-runtime=remote --container-runtime-endpoint=npipe:////.//pipe//containerd-co
ntainerd --node-ip=

FWIW this was from a cluster made using master as of today.

Missing logs

vagrant up seems to get nodes up but after 20+m of provisioning I still ended here:

    winw1: This may take several seconds if the vSwitch needs to be created.
    winw1: Waiting for Calico initialisation to finish...
    winw1: Waiting for Calico initialisation to finish...StoredLastBootTime , CurrentLastBootTime 20210802101406.813459-420
    winw1: Waiting for Calico initialisation to finish...StoredLastBootTime , CurrentLastBootTime 20210802101406.813459-420
    winw1: Waiting for Calico initialisation to finish...StoredLastBootTime , CurrentLastBootTime 20210802101406.813459-420
    winw1: Waiting for Calico initialisation to finish...StoredLastBootTime , CurrentLastBootTime 20210802101406.813459-420

And kubectl shows both nodes:

kubectl get nodes
NAME           STATUS   ROLES                  AGE   VERSION
controlplane   Ready    control-plane,master   31m   v1.21.0
winw1          Ready    <none>                 19m   v1.21.2-rc.0

Then when I run sonobuoy it starts running but I can't get logs or status info and the IIS server container seems to have a problem too:

$ kubectl logs windows-server-iis-7985c648cc-k8qnc
Error from server (NotFound): the server could not find the requested resource ( pods/log windows-server-iis-7985c648cc-k8qnc)
$ sonobuoy logs
error streaming logs from container [kube-sonobuoy]: the server could not find the requested resource ( pods/log sonobuoy)%

Originally posted by @johnSchnake in vmware-tanzu/sonobuoy#1320 (comment)

Replace Makefile w/ golang kow

The current flow of the code is:

        edit yaml file
        read makefile and pick a target
        run and make sure defaults arent overriding your variables

What we want is:

kow generate --config=variables.yml # after this you can vagrant up manually
kow create --config=variables.yml
kow desroy --config=variables.yml

Implementation:

  • replace fetch.sh w/ golang cli
  • replace makefile w/ golang cli
  • remove ALL defaults and ALL alternate inputs from bash scripts
  • remove all SED and BASH nonsense
  • generate Vagrantfile
  • use kubernetes-sigs/sig-windows-tools#153 as the artifacts server

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.