Giter VIP home page Giter VIP logo

epoxy-images's Introduction

Branch Status
master Build Status

epoxy-images

Support for building Linux kernels, rootfs images, and ROMs for ePoxy

An ePoxy managed system depends on several image types:

  • generic Linux images that provide a minimal network boot environment.
  • stage1 images that embed node network configuration and are either flashed to NICs, or burned to CDs.
  • stage3 Linux ROM update images, that (re)flash iPXE ROMs to NICs.

Build Automation

The epoxy-images repo is connected to Google Cloud Build.

  • mlab-sandbox - push to a branch matching sandbox-* builds cloudbuild.yaml & cloudbuild-stage1.yaml.
  • mlab-staging - push to master builds both cloudbuild.yaml and cloudbuild-stage1.yaml
  • mlab-oti - tags matching v[0-9]+.[0-9]+.[0-9]+ builds cloudbuild.yaml & cloudbuild-stage1.yaml

Building images

See cloudbuild-stage1.yaml for current steps for stage1 images.

You can also run the build locally using docker.

docker build -t epoxy-images-builder  .

docker run --privileged -e PROJECT=mlab-sandbox -e ARTIFACTS=/workspace/output \
  -v $PWD:/workspace -it epoxy-images-builder /workspace/builder.sh stage1_minimal

docker run --privileged -e PROJECT=mlab-sandbox -e ARTIFACTS=/workspace/output \
  -v $PWD:/workspace -it epoxy-images-builder /workspace/builder.sh stage1_mlxrom

docker run --privileged -e PROJECT=mlab-sandbox -e ARTIFACTS=/workspace/output \
  -v $PWD:/workspace -it epoxy-images-builder /workspace/builder.sh stage1_isos

Using an ISO, you should be able to boot the image using VirtualBox or a similar tool. If your ssh key is in configs/stage2/authorized_keys, and the VM is configured to attach to a Host-only network on the 192.168.0.0/24 subnet, then you can ssh to the machine at:

Deploying images

The M-Lab deployment of the ePoxy server reads images from GCS. The cloudbuild steps deploy images to similarly named folders:

  • output/stage1_mlxrom/* -> gs://epoxy-mlab-sandbox/stage1_mlxrom/
  • output/stage1_isos/* -> gs://epoxy-mlab-sandbox/stage1_isos/

BIOS & UEFI Support

The simpleiso command creates ISO images that are capable of booting from either BIOS or UEFI systems. BIOS systems use isolinux while UEFI systems use grub. These images should also work with USB media.

Testing USB images

VirtualBox natively supports boot from ISO images & supports BIOS or UEFI boot environments. To support VM boot from USB images we must create a virtualbox disk image from the raw USB disk image.

VBoxManage convertdd boot.fat16.gpt.img boot.vdi --format VDI

Then select that image in the VM configuration.

Upgrading Kubernetes components

Upgrading Kubernetes components on platform nodes is a separate process from upgrading them in the API cluster. Upgrading Kubernetes on platform nodes should always occur after the API cluster has been upgraded for a given project. The script ./setup_stage3_ubuntu.sh has logic which is designed to enforce this requirement, but it is still worth mentioning here.

Upgrading Kubernetes components on platform nodes should be as simple as copying the values for the identially named config variables from the k8s-support repository to the ones in ./config.sh in this repository:

  • K8S_VERSION
  • K8S_CNI_VERSION
  • K8S_CRICTL_VERSION

Once the version strings are updated, and match those in the k8s-support repository, just follow the usual deployment path for epoxy-images i.e., push to sandbox, create PR, merge to master, tag repository. The Cloud Builds for this repository will generate new boot images with the updated Kubernetes components. In mlab-sandbox and mlab-staging, the newly built images will be automatically deployed to a node upon reboot. However, in production (mlab-oti) they will not be automatically deployed without further action. See the following section for more details.

Configure ePoxy to use a newer image version

In order to deploy the new boot images to production you will need to modify the ImagesVersion property of every ePoxy Host GCD entity to match the tag name of the production release for this repository. This can be done using the epoxy_admin tool. If you don't already have it installed, then install it with:

$ go get github.com/m-lab/epoxy/cmd/epoxy_admin

Once installed, you can update the ePoxy Host GCD entities in the mlab-oti project with a command like the following. NOTE: do not run this command against the mlab-sandbox or mlab-staging projects, as ImagesVersion is a static value in those projects and should always be "latest":

$ epoxy_admin update --project mlab-oti --images-version <tag> --hostname "^mlab[1-3]"

Trigger a rolling reboot

None of the nodes in any project will be running the updated images until they are rebooted. You can trigger a rolling reboot of all nodes in a cluster with a small shell command like the following:

$ for node in $(kubectl --context <project> get nodes | grep '^mlab[1-4]' | awk '{print $1}'); do \
    ssh $node 'sudo touch /var/run/mlab-reboot'; \
  done

The former command assumes you have ssh access to every platform node. It leverages the Kured DaemonSet running on the platform by creating the "reboot sentinel" file (/var/run/mlab-reboot) on every node, which tells Kured that a reboot is required. From there, Kured handles rebooting all flagged nodes in a safe way (one node a time).

You can check the progress and/or completion of the upgrade by looking at the kubelet version for a node as reported by kubectl:

$ kubectl --context <project> get nodes

epoxy-images's People

Contributors

nkinkade avatar pboothe avatar robertodauria avatar stephen-soltesz avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

freeguy1 sshyran

epoxy-images's Issues

COREOS_VERSION must be set in at least two locations, maybe three

Currently, the variable COREOS_VERSION must be set in at least two places, and possibly a third:

epoxy-images$ grep -r COREOS_VERSION= .
./cloudbuild.yaml:   - 'COREOS_VERSION=2079.6.0'
./.travis.yml:- COREOS_VERSION="2079.6.0"
./manual/manually_generate_images.sh:COREOS_VERSION="2079.6.0"

This means that every time we decided to update the CoreOS version, the person doing it has to remember to update it in three places, which is not ideal. We should figure out a way that will allow us to defined it in a single location.

epoxy-images deployment for images should preserve older versions

Currently, epoxy-images deployment overwrites old versions with the new version. This makes rollback impossible without a rebuild. Instead, the deployment step should make two copies: 1) to latest, 2) to the current version.

Boot scripts should reference 'latest' and the explicit version directory would serve as a backup.

Add device & model names to kargs for epoxy

The stage1-template.ipxe script and however we generate stage1 images for legacy hardware, we should encode some known hardware information into the kargs so that they are available to epoxy during boot.

In particular, this could be helpful for updating the mlx rom. For example, if the kargs include:

epoxy.model=Mellanox epoxy.device=ConnectX3

Then the process that downloads and flashes the rom could construct a URL to locate the latest image, e.g.

https://storage.googleapis.com/epoxy-mlab-staging/roms/{{.kargs.epoxy.model}}/{{.kargs.epoxy.device}}/{{.kargs.epoxy.hostname}}.mrom

IPv6 SLAAC and RAs appear to take down container IPv6

We recently discovered IPv6 was down for a good number of sites, and we were never aware of it. Most of the issues are likely legitimate IPv6 issues with the transit providers. However, for one site, nbo01, something else is going on. IPv6 works to the host machine, but not to containers. Interestingly, rebooting a machine will cause IPv6 to work for about 30m, then go down again. Further investigation revealed that the usual default route on nbo01 containers was getting removed, causing IPv6 to break. Digging into that fact a bit revealed that this is likely an issue with SLACC and RA (Router Advertisements). There is a kernel setting, net.ipv6.conf.<iface>.accept_ra, which instructs the network stack to either accept or ignore RAs. It turns out that this setting is set to 0 in the default network namespace, but to 1 in a container namespace. I don't know why this has thus far only affected nbo01, seemingly, but it is likely because most upstream providers have SLAAC disabled on the interface that connects the M-Lab equipment.

reboot-api-node.server system unit is failing

This unit is failing with errors like:

api-platform-cluster-us-west2-c reboot-api-node.sh[2973302]: {"level":"warn","ts":"2023-11-29T15:00:07.066Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003c0700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection closed before server preface received"}

api-platform-cluster-us-west2-c reboot-api-node.sh[2973302]: Error: context deadline exceeded

After some investigation, I found that etcdctl was not getting the proper configurations from its environment. The necessary ENV variables are set in /root/.bashrc, and the unit script sources /root/.profile, which itself should source /root/.bashrc. However, .bashrc's first line checks if the script is running interactively, and if not exits.

This used to work so something must have changed in etcd when we upgraded to a newer version of k8s. The underlying errors are TLS related. Maybe etcd because more strict about TLS in some way.

What is confusing here is the create-control-plane.sh script should be putting the necessary ENV variables into both /root/.profile and /root/.bashrc, yet for some reason the ENV variables are no present in .profile:

https://github.com/m-lab/epoxy-images/blob/main/configs/virtual_ubuntu/opt/mlab/bin/create-control-plane.sh#L292

I need to figure out why create-control-plane.sh is not getting the variables into .profile.

Servers won't boot if an Internet connection isn't immediately available

We have recently discovered that when configuring a new site the nodes do not have access to the Internet until the full switch configuration is applied. This might happen some time after remote hands have set up boot sequence and powered on the server, due to timezone differences / after-the-fact communication from remote hands / Ops team not immediately available.

Currently, the lack of Internet connection means that boot fails even if the connection becomes available later (e.g. when we apply the full switch configuration). Implementing some kind of retry logic would simplify the new site deployment significantly.

Make DRAC configuration steps more robust

We currently apply both a basic DRAC configuration in stage1 and a full one during stage2. To apply these configurations we use ipmitool, which awaits for confirmation from DRAC after sending each command. For unknown reasons, on R640s some of the commands can take a long time to be confirmed and ipmitool times out.

Both the stage1 and the stage2 scripts should be modified to tolerate these transient failures and keep retrying each command for a few times (e.g. 10).

Write the USB equivalent of mlx_update

We have an automated way of updating stage1 images for the R630s booting from Mellanox NICs (mlx_update stage3 images), but we have no automated way of updating stage1 boot images on machines that boot from NICs. This needs to be addressed.

epoxy_client should construct URL for epoxy.mrom

Today, updaterom.sh performs the logic of parsing /proc/cmdline, finding the epoxy.mrom prefix URL, discovering the device model, and building a full URL to download a new mrom image before burning it.

The URL discovery and construction logic should be executed by epoxy_client. And epoxy_client action should be able to specify a URL variable as:

{{kargs `epoxy.mrom_url`}}/{{.vars.device}}/{{.vars.hostname}}.mrom

Then download that image, and pass the local file to a modified updaterom.sh script that accepts a rom file as parameter.

With this support updaterom.sh can be replaced with flashrom.sh.

Update packages on each build of stage3 Ubuntu image

The debootstrap utility can only source files from a single apt repository. In our case, when generating machine images we point it to the point release repository. What this means is that the images built by epoxy-images will only ever contain the packages at the versions they had at the time of the point release. For LTS releases of Ubuntu, this can mean we are running very old versions of packages, with known bugs which have been resolved in updates versions.

The OS is configured to automatically install security updates, but is not configured to install regular package updates. This is how we want it, but we need a mechanism to get updated packages from our base OS release onto the platform.

Create systemd unit to trigger node reboot when uptime is too long

For a long time we have talked about the possibility of regularly rebooting the entire fleet. In the past year we launched a new reboot management system called Kured. Kured runs as a pod on all machines (a DaemonSet), and watches for the existence of a configurable "sentinel" file. When that file exists, the service will queue the machine for a reboot.

Because Kured operates at the local filesystem level, this opens the possibility that we could write a simple systemd service (with associated timer unit) which could regularly (daily?) compare the uptime of the machine to some configurable maximum value (60d?), and if the uptime is greater than that value to write out the sentinel file, triggering a reboot.

Consideration: automatically rebooting a node will automatically pull in any new changes in the underlying ePoxy stage3 boot images. For example, if we are in the process of updating to a new kubernetes version and merge those changes to the master branch, or tag the repository, before the API cluster is updated, then nodes will fail to join the cluster. This is just the first example that occurs to me. It means that we will need to be somewhat deliberate about the changes that get merged into master of this repo, and especially considerate of tagging the repo, knowing that those changes will get automatically applied to all nodes somewhere between 0 and, say, 60d.

Create versioned stage3 images and URLs

In a discussion with @stephen-soltesz today, we realized that deployment of stage3 images into production is currently condensed into a single, non-versioned step, but should probably be two.

Currently, when the epoxy-images repository is tagged, new stage1 and stage3 images are built and published to static, non-versioned locations in GCS. This means that the URL for images never changes, while the underlying file that backs the URL does. This can lead to confusion about what is actually being deployed and booting at any given time.

We should probably implement versioned stage3 image URLs, such that tagging epoxy-images builds versioned images (and URLs), and then a second step would be required to cause them to be deployed. That second step would be to somehow modify the GCD entity for every node to point to the new versioned URL.

Enforce that epoxy-images are not tagged before k8s master updates

One of the root causes of the pusher/TSO outage was due to epoxy-images creating new boot images that included later versions of k8s than the k8s master supported.

We should enforce that epoxy-images are only updated after a k8s master version is updated to support the client.

stage1 images should have a fall-back login mechanism

Today, on the legacy PLC platform, if a node fails to boot fully to PLC for some reason the node will fall-back to "SafeBoot" mode, allowing an operator or PLC admin to login to inspect the issue. However, our ePoxy stage1 images don't have this facility, though it does appear that it occurred to @stephen-soltesz when developing ePoxy.

One use case for this feature is if ePoxy boot fails for any reason, an operator can fix the issue with ePoxy, login to the machine and reboot it. Generally the reboot could be done via the DRAC, but at new sites the DRAC may not yet have been configured. This is probably another good use case for configuring DRAC basics during stage1 boot.

Add project to kernel cmdline

Now that ROMs can target epoxy server in different GCP projects, this variable should be accessible during runtime so epoxy_client can automatically fetch images from the correct project buckets.

In particular, the later stage actions should refer to URLs that reference the GCP project.

"files" : {
     "vmlinuz" : {
        "url" : "https://storage.googleapis.com/epoxy-{{get_karg `epoxy.project`}}/stage3_coreos/coreos_production_pxe.vmlinuz"
     },
     "initram" : {
        "url" : "https://storage.googleapis.com/epoxy-{{get_karg `epoxy.project`}}/stage3_coreos/coreos_custom_pxe_image.cpio.gz"
     }
 },

TCP pacing maxrate is currently set to the max uplink rate: revisit this setting

We are currently setting the TCP pacing maxrate to the max rate of the uplink. This may be fine, but once ethernet flow control (PAUSE frames) is turned off on all ports, we'll want to keep a close eye on discards at sites. It's possible that we may want to reduce the maxrate for pacing to something less than the uplink's capacity if we see discards.

stage1 builds are failing due to python missing from the image

+ python ./mlabconfig.py --format=server-network-config --sites https://siteinfo.mlab-sandbox.measurementlab.net/v2/sites/sites.json --physical --project mlab-sandbox --label project=mlab-sandbox --template_input /workspace/configs/stage1_mlxrom/stage1-template.ipxe --template_output '/tmp/stage1_scripts.WUWltK/stage1-{{hostname}}.ipxe'
/workspace/setup_stage1_mlxrom.sh: line 78: python: command not found

Need separate builds for stage1 images (longer) and generic images (faster)

Right now epoxy-images build everything on commits and tags.

Instead, we want to have differential builds for certain changes that can occur much faster.

  • build only generic images (e.g. stage3*)
  • build only stage1 images (e.g. roms, isos, usbs, etc)
  • build only new stage1 images (it's redundant to rebuild images that haven't changed)

fix-hung-shim.sh must recognize multiple shim processes

Recently a rollout of the host DS stalled for multiple days b/c several machines had 2 containers hung during shutdown. Currently, fix-hung-shim.sh expects only a single shim process to hang. It needs to detect multiple hung shim processes.

host-stall

For example:

$ docker ps | grep host
06ee5a8d016e        e4c310f63d8a    "/traceroute-caller …"   13 days ago         Up 13 days                              k8s_traceroute_host-cpzdj_default_dd0c6aab-3dbb-44aa-8ea1-3dd6fd2bf299_0
e38f71d2de57        c478ac8aa02b    "/bin/tcp-info -prom…"   13 days ago         Up 13 days                              k8s_tcpinfo_host-cpzdj_default_dd0c6aab-3dbb-44aa-8ea1-3dd6fd2bf299_0

Create all-in-one ROM update images

Up to now all ROM update processes have relied on downloading a suitable ROM image over the network. In some cases, this may not be possible, such as if the NIC is non-operational due to a bad firmware update.

To support cases like this we need an all-in-one ROM update image that bundles the suitable ROMs in the ISO.

epoxy_client context deadline exceeded errors

On multiple occassions now, the epoxy_client receives a context deadline exceeded error during download. When it happens, this interrupts the download and causes the client to fail.

This should really almost never happen, or only after many failures.

Add epoxy_client to all images

Stage2 uses epoxy-client to boot the third stage.

The stage3 coreos and stage3 mlxupdate images should also include the epoxy client so that they can execute any final nextboot commands and report success (or failure).

Unpin linux-source-4.4.0 version

The latest linux-source versions 4.4.0-109.132 and 4.4.0-108.131 do not build.

The cause appears to be fixed according to the ChangeLog in 4.4.111

However, that version is not yet available in xenial packages. So, temporarily, we will pin the linux-source-4.4.0 version to 4.4.0-104.127

R640 CPU frequency set for "performance"

R640's should be configured to use the "performance" CPU frequency governor rather than the default "powersave".

After manually setting the "performance" governor on mlab3-hkg01 using:

sudo apt-get install cpufrequtils
for i in `seq 0 31` ; do sudo cpufreq-set --cpu $i --governor performance ; done

Shortly after the very high 95th% polling interval from tcpinfo became much closer to the median.

Screen Shot 2020-11-05 at 5 09 24 PM

And, the overall pod CPU usage and load dropped, without an apparent increase in CPU temperature.

Screen Shot 2020-11-05 at 5 12 14 PM

Configure DRACs during stage1

Currently DRAC configuration is done manually after the nodes are up and running CoreOS. However, if anything goes wrong during the boot process we have no way of accessing the node and debugging the issue, unless we involve remote hands.

We should configure DRACs as soon as possible, which probably means during stage1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.