metal3-io / metal3-docs Goto Github PK

View Code? Open in Web Editor NEW

263.0 263.0 111.0 18.28 MB

Architecture documentation that describes the components being built under Metal³.

Home Page: http://metal3.io

License: Apache License 2.0

Shell 34.47% Makefile 38.25% Go 27.28%

metal3-docs's People

Stargazers

Watchers

Forkers

russellb mhrivnak dhellmann yboaron juliakreger knowncitizen maryshak1996 elfosardo alanmeadows nuzumco g2t2 longkb fitzenterprises ogelbukh imain cynepco3hahue markmc zaneb codificat th3architect rashidxf nordix bcrochet alcamie101 yu2003w sb1975 beekhof digambar15 nehaalhat kirandivekar demoncoder95 kimduksoo sharajat sana-aawan stbenjam ashughorla n1r1 dukov johnlam90 xujintao1996 nasirkamal danielepa-zz zhouhao3 hs0210 dtantsur andfasano devopstoday11 akiselev1 liamlu28 hardys panfeng-smiles ardaguclu fenggw-fnst rhjanders manojkva sataqiu hroyrh rdoxenham bfournie redixin iyng htwalid pallavgupta ravipwaghmare tomassedovic mahnoorasghar mattmceuen maxrink chazzrobbz slintes sshukun arvinderpal sailorvii ssamylin isabella232 yydzhou moshe010 haoziwu fmuyassarov knfoo shibapuppy hellcatlk thegirlcoderr lentzi90 timtech4u betahubcodes furkatgofurov7 safeeha gutopro wrkode iurygregory c634019 lmzuccarelli p-strusiewiczsurmacki-mobica geekrypter lekaf974 monjikura nymanrobin jschoone pierrecregut

metal3-docs's Issues

User-guide: add doc to briefly describe automated cleaning feature in BMO

Add a document briefly explaining the automated cleaning feature in BMO.
Describe:

briefly automated cleaning in BMO
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

Proposal to Add support of Keystone authentication

As we discussed in the last meeting, currently Metal3 is only relying on standalone Ironic. So all the Ironic calls are currently initiated directly. @maelk has proposed one Authentication which is SSL/TLS certificate based system, need to be configured outside BMO.
There are plenty use cases where Ironic will not run as standalone, rather Ironic will be running as part OpenStack Cloud or It will run along with Keystone. I think we came to a point that, we have to add support of Keystone in BMO where each Ironic calls will be authenticated against Keystone based on Token provided.
This implementation only targets non-standalone Ironic scenario.

...

User-guide: add introduction document to Baremetal Operator

Write down below sections for the Introduction section to BMO section in the user-guide book.
Describe:

What is BMO
Why and when should/can it be used
It is relationship with CAPM3
it is relationship with Ironic

/kind documentation
/help

User-guide: add diagram to describe the baremetal inspection/provision workflow

Having a diagram to explain the inspection/provision workflow for Metal3 (similar to https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/bare_metal_provisioning/index#the-bare-metal-provisioning-service) would greatly help understand how it works.

Also, a diagram will help understand what part to troubleshoot when an issue arise in the workflow.

Automated node remediation

The Cluster API includes an optional Machine Healthcheck Controller component that implements automated health checking capability, however it doesn’t offer any other remediation than replacing the underlying infrastructures.

Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes.

Other environments and vendors also have specific remediation requirements, such as KCP, so there is a need to provide a generic mechanism for implementing custom remediation logic.

User-guide: add doc to briefly describe Node Reuse feature in CAPM3

Add a document briefly explaining the Node Reuse feature in CAPM3.
Describe:

briefly Node Reuse feature in CAPM3
why we need it
how to use it
what are the caveats

/kind documentation
/help

User-guide: add doc to describe how to tune ironic before deploying

Describe:

how to tune Ironic parameters before deploying. Example, how to enable a certain features like fast track before ironic is deployed
what is the format we pass or manipulate those parameters. As an example, in dev-env we use configMap

/kind documentation
/help

User-guide: add introduction document for IPAM

Write down below sections for the Introduction section to ip-address-manager section in the user-guide book.
Describe:

What is IPAM
Why and when should/can it be used
It is relationship with CAPM3

/kind documentation
/help

Add support for Conditions in CAPM3

CAPI has recently adopted the use of Conditions to provide better visibility into the current state of the workload cluster. We should add similar functionality to CAPM3.

References:
CAEP: kubernetes-sigs/cluster-api#3017

User-guide: add doc to briefly describe Remediation feature in CAPM3

Add a document briefly explaining the Remediation feature in CAPM3.
Describe:

briefly Remediation feature in CAPM3
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

Handling Special Case Compute Devices

An administrator places into a BMH a PCIe card that has a fully programmable device where:

The device can act as a switch or as an off-load network controller, and requires uploading of an OS image to the PCIe-attached device on the host after which the device is externally programmable / configurable.
The device is a GPU that requires upload of an OS image or a bootstrap image to the PCIe-attached device on the host after which it can be programmed independently from its network connection

Question: How shall we classify such dependent hosts and then bring them into full manageability.
Note: The key to this issue is that the host-like device is capable of being detected via BMH introspection.

User-guide: add document to briefly describe Pivoting feature in CAPM3

Add a document briefly explaining the pivoting feature in CAPM3 and it is requirements.
Describe

briefly Pivoting feature in CAPM3
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

Proposal: In-Place Executors

For day-2 operations, users may want to make in-place changes to various host configurations. The configuration changes are in-place in the sense that they can be carried out without host reprovisioning or reboot.

This proposal seeks to extend the Metal3MachineTemplate with a new CRD for storing the in-place host configuration data. In particular, it seeks to (i) add the ability to lookup the set of machines built from a specific Metal3MachineTemplate, and (ii) when the in-place host configuration data is updated by the user, pass this data, along with the set of machines impacted, to the in-place executors.

https://docs.google.com/document/d/1Ozby8U08WIor1xkqXTDVMyjRUpdGUArKOZTgNOQAOn4/edit?usp=sharing

NOTE: I'm keeping the proposal in google docs to allow for quicker iteration. I will move it into the /design section once we're close to the final design.

FR: target node boot via virtualmedia or network boot

Instead of using (i)PXE, we could use the virtualmedia boot of some BMC drivers (such as Redfish) to lift some networking constraints to boot IPA. We could even extend that and investigate booting the target node via redfish virtualmedia to understand the constraints of such a solution. Another possibility would be to network boot the nodes and use the local drives for persistent storage.

User-guide: Add introduction doc to Ironic in Metal3

Describe:

What is ironic
Why Metal3 uses Ironic
How Metal3 uses Ironic
What are the limitations of Ironic in Metal3
/kind documentation
/help

/cc @dtantsur @Rozzii

User-guide: add doc to briefly describe External Inspection feature in BMO

Add a document briefly explaining the External Inspection feature in BMO.
Describe:

briefly External Inspection in BMO
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

Add ironic authentication

Considering the self-managed cluster use-case, Ironic would run on the target cluster, in host networking mode. hence, all nodes would need physical access to the provisioning network, to be able to host ironic. However, by default, all traffic from the pods is NATed and routed, meaning that all pods could interact with ironic.

Currently there is no authentication in Ironic, making this setup very insecure. We should investigate adding some authentication in ironic. This issue is to keep track of securing the ironic setup.

User-guide: add doc to briefly describe Live ISO feature in BMO

Add a document briefly explaining the Live ISO feature in BMO.
Describe:

briefly Live ISO feature in BMO
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

proposal: add auto-scaling for MachineSets

In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster.

Rather than make some external process manage the size of MachineSets as BareMetalHosts come and go, we could create a small controller that (optionally) automatically ensures a MachineSet has a size equal to the number of matching BareMetalHosts.

The controller would be an additional Controller in this project. It would watch MachineSets as its primary resource, and if they have a particular annotation, ensure that their size equals the number of matching BareMetalHosts. It would watch BareMetalHosts as a secondary resource.

Thoughts?

User-guide: add doc to describe how to deploy Ironic in Metal3

Describe:

how to run ironic within a k8s cluster
how to run outside of a k8s cluster, i.e., on the host machine

/kind documentation
/help

Update example Machines in README.md

We just merged a provider spec definition. We should update the example Machine objects to reflect what is currently implemented.

metal3-io/cluster-api-provider-baremetal#59

User-guide: add document on how to install CAPM3

Write down subsection of the user-guide to explain how to install the Cluster-api-provider-metal3

/kind documentation
/help

query: vm support

will it support provisioning VM in esxi ? and deploy the k8s on top

FR: Add Infoblox IPAM support

The current IP Address Manager could be extended to support using Infoblox as a backend. Or as an alternative, create a new CRD and controllers on the same model for infoblox IPAM : InfobloxIPPool to back the IPClaim and provide the IPAddress objects.

Sync a set of labels placed on a BMH to the corresponding K Node object

There are use cases where certain information is inherently tied to the BMH but at the same time, is valuable to the users/operators/schedulers in the K workload cluster. For example:

As a user, I would like to place my workloads across hosts that are in different failure zones. For example, I may want replicas of a specific workload spread out across different server racks or geographical locations.
As a user, I would like to place my security sensitive workloads on machines that meet specific security requirements. For example, certain hosts may be in a fortified zone within a data center or certain hosts may have strong hardware based security mechanisms in place (e.g. hardware attestation via TPM).

In Kubernetes, labels on Node objects are the primary means by which to solve this problem. The ask here is for CAPM3 to synchronize a specific set of labels placed on a BMH object with labels on the corresponding K Node running on that BMH. CAPM3 is already capable of mapping BMH<->Node, so a controller that keeps the labels in sync may be a straightforward addition. The synchronization would be limited to only a specific set of labels matching a certain prefix. For example, the user may specify my-prefix.metal3.io/ as their prefix (e.g. via a command-line flag). Labels placed on the BMH that match this prefix would be synchronized with the labels on the K Node object. For example,

kind: BareMetalHost
name: node-0
metadata:
  labels:
    my-prefix.metal3.io/rack: xyz-123
    my-prefix.metal3.io/zone: security-level-0
...
---
kind: Node
name: worker-node-0 
metadata:
  labels:
    my-prefix.metal3.io/rack: xyz-123
    my-prefix.metal3.io/zone: security-level-0
...

Proposal doc: https://docs.google.com/document/d/1qMCkggaLGQLHNPnEVGYjpabSrO-DTvSsr8uIW06W0ck/edit?usp=sharing

Related:
There is a related issue in the CAPI community linked below. The proposal there is for CAPI to synchronize labels placed on MachineDeployment objects with the K Nodes created from that deployment. While similar, they are addressing different things. However, the proposed CAPI approach also uses prefixes to limit the scope of the synchronization.

CAPI: Support syncing a set of labels from MachineDeployment/MachineSet/Machine to Nodes

FR: Network configuration integration

Currently, when deploying a node, the networking configuration (switches etc.) needs to be done beforehand and be ready for deployment. However, it would be great to be able to leverage some available solutions such as SDN to perform the network configuration just in time, right before provisioning the node. This could be done in two steps, the first one via hooks that would be called before provisioning, the second by creating new CRDs and controller that would manage the network configuration. Each host would then have a network configuration object representing the configuration to apply before provisioning.

try-it-out results in baremetalhosts with 'inspecting' indefinitely

When I tried out the 'try-it-out' I was presented with the following:

The make command is waiting on:

./04_verify.sh
Logging to ./logs/04_verify-2019-10-08-220516.log
OK - Network provisioning exists
OK - Network baremetal exists
OK - Kubernetes cluster reachable

OK - Fetch CRDs
OK - CRD baremetalhosts.metal3.io created
OK - CRD clusters.cluster.k8s.io created
OK - CRD machineclasses.cluster.k8s.io created
OK - CRD machinedeployments.cluster.k8s.io created
OK - CRD machines.cluster.k8s.io created
OK - CRD machinesets.cluster.k8s.io created

   - Waiting for task completion (up to 2400 seconds)  - Command: 'check_k8s_entity statefulsets cluster-api-controller-manager   cluster-api-provider-baremetal-controller-manager'
OK - statefulsets cluster-api-controller-manager created
OK - cluster-api-controller-manager statefulsets replicas correct
OK - statefulsets cluster-api-provider-baremetal-controller-manager created
OK - cluster-api-provider-baremetal-controller-manager statefulsets replicas correct
   - Waiting for task completion (up to 2400 seconds)  - Command: 'check_k8s_entity deployments metal3-baremetal-operator'
OK - deployments metal3-baremetal-operator created
OK - metal3-baremetal-operator deployments replicas correct
OK - Replica set metal3-baremetal-operator created
OK - metal3-baremetal-operator replicas correct
OK - Fetch Baremetalhosts
OK - Fetch Baremetalhosts VMs

   - Waiting for task completion (up to 2400 seconds)  - Command: 'check_bm_hosts master-0 ipmi://192.168.111.1:6230 admin password 00:e8:69:bb:a0:99'
OK - master-0 Baremetalhost exist
OK - master-0 Baremetalhost address correct
OK - master-0 Baremetalhost mac address correct
OK - master-0 Baremetalhost status OK
OK - master-0 Baremetalhost credentials secret exist
OK - master-0 Baremetalhost password correct
OK - master-0 Baremetalhost user correct
OK - master-0 Baremetalhost VM exist
OK - master-0 Baremetalhost VM interface provisioning exist
OK - master-0 Baremetalhost VM interface baremetal exist
FAIL - master-0 Baremetalhost introspecting completed
       expected ready, got inspecting

   - Waiting for task completion (up to 2400 seconds)  - Command: 'check_bm_hosts worker-0 ipmi://192.168.111.1:6231 admin password 00:e8:69:bb:a0:9d'

Indeed through kubectl, this is also the case:

$ kubectl get baremetalhosts -n metal3
NAME       STATUS   PROVISIONING STATUS   CONSUMER   BMC                         HARDWARE PROFILE   ONLINE   ERROR
master-0   OK       inspecting                       ipmi://192.168.111.1:6230                      true     
worker-0   OK       inspecting                       ipmi://192.168.111.1:6231                      false

What could be the case?

User-guide: add introduction document for CAPM3

Write down below sections for the Introduction section to CAPM3 section in the user-guide book.
Describe:

What is CAPM3
Why and when should/can it be used
It is relationship with CAPI

/kind documentation
/help wanted

Purge offensive language from Metal3-io repositories

Following a K8s-wide effort to remove terms such as "master", "slave", "whitelist" and "blacklist", we should take steps to remove this type of words from our repos. In K8S repos, the Github "master" branches will be renamed to "main".

We should follow the movement and remove at least those four words with the following proposed replacements:

master -> controlplane
slave -> worker
whitelist -> allowlist
blacklist -> denylist

we should also rename all our master branches to main.

User-guide: add Metal3 project introduction section

Write down below sections for the Introduction section to the Metal3.io user-guide book.

what is Metal3 project about
What is it trying to solve
Why and when would someone use Metal3.io
What is different between Metal3 and some other server management projects/tools
Explain briefly the relationship between Metal3 and Cluster API
Explain briefly the relationship between Metal3 and Ironic
Shortly about community and useful references like meeting links, blog posts, etc

Should have a quick start guide to kick the tire

looking for a getting-started guide to get a cluster provisioned using Metalkube. Can't find it anywhere. Wouldn't mind submit a PR for one once I figured out how to get the code to work.

User-guide: add doc to briefly describe Automated Cleaning feature in CAPM3

Add a document briefly explaining the Automated Cleaning feature in CAPM3.
Describe:

briefly Automated Cleaning feature in CAPM3
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

Node Failure detection mechanism

Problem Statement -

Airship community has come up with one requirement on how user will get to know what is the cause of failure on the host.
So it's the use case related to failure detection mechanism where user can get to know what has failed so that User can go and fix the perticular issue on the node.
At broader side, there are many reasons host will fail either in introspection or in day n scenarios. Following are the scenarios -
Disk Errors - Some disks went into error/failed state, another point is raid creation is failed.
PXE Errors - It's because PXE NIC is not properly configured
Management and power Errors - Now able to power on the node or Node goes into recovery mode but it's not accessible.
Boot/Bios Errors - Boot is failed due to device settings are incorrect.

Currently from Ironic side, none of the above failures put into introspection data. I have done investigation but I never found these level of specifics which will help user identify reason behind the failure.

I thought through above scenarios, we can bring some classification for errors detection where we can classify hosts based on specific failures.
For that, we need to write entire failure detection mechanism in HCC. I will put my investigation and solution approach in google doc for more details.

Change default branch to "main"

Addresses this.

This has already been done in CAPM3 and IPAM. Check these repo's for reference.

Define release guidelines for baremetal-operator and start publishing releases

We should adopt a releasing approach similar to that of CAPI and follow the well-known semantic versioning guidelines (see references below).

Besides this being a good practice, integration with the redesigned clusterctl will require that projects follow the above mentioned releasing strategy. A similar issue was created for capbm: metal3-io/cluster-api-provider-baremetal#224

PR Tracking:

Makefile updates for docker build and push, and release manifests generation: metal3-io/baremetal-operator#405
#272

FR: In-place upgrade

Considering the stacked storage use-case, during the upgrade of a CAPI cluster, I would like to make sure that the upgraded metal3machine lands on the same physical node, to be able to re-use the disk in a ceph cluster for example, without wiping them out

Netlify: select a new build image

Notice from Netlify

After November 15, 2022, builds for this site will fail unless you update the build image.

As reported via deploy logs:

DEPRECATION NOTICE: Builds using the Xenial build image will fail after November 15th, 2022.

The build image for this site uses Ubuntu 16.04 Xenial Xerus, which is no longer supported.
All Netlify builds using the Xenial build image will begin failing in the week of November 15th, 2022.

To avoid service disruption, please select a newer build image at the following link:
https://app.netlify.com/sites/etcd/settings/deploys#build-image-selection

For more details, visit the build image migration guide:
https://answers.netlify.com/t/please-read-end-of-support-for-xenial-build-image-everything-you-need-to-know/68239

Proposal: enable scaling up by grabbing baremetalhosts consumed by other clusters

Hi, Metal3 team!

Currently we have encountered a bare metal specific use case.

One of our customers has a bare metal environment with multiple clusters deployed on it. Each cluster has different priority, for example one cluster for production with high priority and another with low priority for test. Considering that all the physical machines have been consumed, to ensure the performance of production cluster, scaling up needs to be done by stealing nodes from test cluster (ignore the performance of test cluster). The workflow includes determining whether the cluster needs to scale up, choosing a node in other cluster, removing the node from that cluster, reconfiguring the physical machine, and joining to the target cluster.

Our customer wants to automate this workflow and we think this can be done by combining Cluster Autoscaler, ClusterAPI and Metal3 although CA does not support ClusterAPI right now. But if it does, it can create new machine based on the load and we need Metal3 to be able to handle the rest tasks which is provisioning a baremetalhost to the new machine even when there is no available baremetalhosts. So we want to add such scaling up through grabbing baremetalhosts consumed by other clusters feature to Metal3. This feature asumes running in a multi-cluster environment and there is only one cluster need to scale up, others' performance can be ignored temporarily.

We are looking forward to discussion.

FAQ link in README goes to blog

README.md has:

## Useful links

* [Quick start](http://metal3.io/try-it.html)
* [Demos](https://www.youtube.com/watch?v=VFbIHc3NbJo&list=PL2h5ikWC8viKmhbXHo1epPelGdCkVlF16&ab_channel=Metal3)
* [Blog posts](https://metal3.io/blog/index.html)
* [FAQ](https://metal3.io/blog/index.html)

Note that the "Blog posts" and "FAQ" links both go to the blog

User-guide: How to build Ironic images in Metal3

Describe:

how to build ironic container images
how to build other supportive container images (vbmc., sushy-tools) and when we need them

/kind documentation
/help

User-guide: add doc to briefly describe Automatic Secure Boot feature in BMO

Add a document briefly explaining the Automatic Secure Boot feature in BMO.
Describe:

briefly Automatic Secure Boot feature in BMO
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

"How Ironic Works" Github link on openstack.org links to a page that does not exist.

The page linked to in "How Ironic Works" -> "Read on Github" (https://github.com/metal3-io/metal3-docs/blob/master/design/how-ironic-works.md) on https://www.openstack.org/use-cases/bare-metal/ does not exist (404 not found).

Document steps for doing the Kubernetes version upgrade and machine OS upgrade using metal3

As a user I would like to upgrade the kubernetes version of my target cluster and able to upgrade the OS of my machines.

This is documented in the CAPI project here - https://cluster-api.sigs.k8s.io/tasks/kubeadm-control-plane.html?highlight=upgrade#upgrading-workload-clusters

This issue is to request documentation for handling the kubernetes version upgrade and OS upgrade using metal3.

User-guide: add doc to briefly describe Reboot annotation in BMO

Add a document briefly explaining the Reboot annotation feature in BMO.
Describe:

briefly Reboot annotation feature in BMO
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

Extending Baremetal Operator to extract RAID Controller and Disk names

Introduction:
In the metal3 baremetal-operator repository, the hardware RAID configuration of a baremetal host is being extended to include the 'physical disks' and 'controller' items. Here is the specification Pull Request:
https://github.com/metal3-io/metal3-docs/pull/148/files

When specifying the disk numbers, the user specifies a simple number like ‘0’, ’1’ etc. However, this needs to be translated to vendor-specific disk names (without the user specifying the vendor), so that the vendor-specific disk names can be passed to Ironic (and eventually passed to the iDRAC driver), for RAID configuration.

It was found that the Ironic inspector output does not contain the disk names.

Proposal:
One solution for this issue is:

The Ironic iDRAC driver API can be extended to allow the operator to query the physical disk and RAID controller names. The following two APIs are proposed:

Disk name API Query:
/v1/nodes/{node-ident}/disks
Sample output:
[Disk.Bay.0:Enclosure.Internal.0-1:RAID.Intergrated.1-1,...]

RAID Controller name API Query:
/v1/nodes/{node-ident}/RAID-controller
Sample output:
[‘Boss-F1 RAID Controller’,...]

Can anyone suggest any other good way to resolve this issue?

User-guide: add doc to briefly describe Detached Annotation feature in BMO

Add a document briefly explaining the Detached Annotation feature in BMO.
Describe:

briefly Detached Annotation feature in BMO
why we need it
how to use it
what are the caveats if any

/kind documentation
/help

User-guide: add doc to describe Ironic Python Agent in Metal3

Describe:

What is IPA
Why we need it
How we use it in Metal3
How to manipulate its workflow

/kind documentation
/help

MetalKube should incorporate disk encrpytion support by deault

LUKS provides a well understood disk encryption methodology for Linux system. Tang+Clevis provide a scalable methodology to support disk encryption unseal operations in an automated data center. MetalKube should incorporate support for disk encryption as part of initial design.

https://latchset.github.io/clevistang/

User-guide: add installation guide for IPAM

Describe how to install IPAM in a k8s cluster and how to utilize it.

/kind documentation
/help

metal3-io / metal3-docs Goto Github PK

metal3-docs's People

Stargazers

Watchers

Forkers

metal3-docs's Issues

Recommend Projects

Recommend Topics

Recommend Org