gluster / anthill Goto Github PK
View Code? Open in Web Editor NEWA Kubernetes/OpenShift operator to manage Gluster clusters
Home Page: http://gluster-anthill.readthedocs.io/
License: Apache License 2.0
A Kubernetes/OpenShift operator to manage Gluster clusters
Home Page: http://gluster-anthill.readthedocs.io/
License: Apache License 2.0
Describe the feature you'd like to have.
Instead of having to set the size of the cluster manually (#11), it should be possible to have the cluster grow & shrink as necessary. This dynamic sizing should be subject to limits to contain costs.
What is the value to the end user? (why is it a priority?)
With manual sizing, the admin must constantly monitor the cluster and vary the number of nodes as storage usage changes. This requires a good amount of knowledge and a willingness to probe into the cluster's state to track free space. Instead, admins should be able to provide a min and max size for the cluster that they are willing to have, and the operator should dynamically size the cluster, trading off cost (large cluster) with spare capacity that is available for new volume allocations.
How will we know we have a good solution? (acceptance criteria)
Work items
Additional context
We'll need to consider:
Describe the feature you'd like to have.
We need to be able to install/start the operator from a set of yaml descriptions.
What is the value to the end user? (why is it a priority?)
This provides a (first) method for starting the operator. Without this, we have nothing. ๐
How will we know we have a good solution? (acceptance criteria)
Work items
Additional context
This is the "get something off the ground" item. Once we get this, we'll have a running system (though useless).
Describe the feature you'd like to have.
We need to have automated E2E testing to make sure the operator continues to function correctly.
What is the value to the end user? (why is it a priority?)
Automated E2E testing will help keep master working and decrease the likelihood of regressions, leading to higher quality releases.
How will we know we have a good solution? (acceptance criteria)
On each commit/pr:
Additional context
Child of #5
Describe the feature you'd like to have.
The operator should be able to non-disruptively upgrade a Gluster cluster. When an admin changes the Gluster template, the operator should automatically roll out the change to the entire cluster. As it does so, it must ensure data volumes are available and healed such that restarting a pod does not cause a loss of availability nor split-brain situation.
What is the value to the end user? (why is it a priority?)
Gluster upgrades need to be carefully choreographed to maintain cluster health and data availability. Relying on an admin to carry out these steps is both time consuming and error-prone. Implementing this feature ensures that best practices will be followed during the upgrade while also freeing the admin.
How will we know we have a good solution? (acceptance criteria)
Work items
Additional context
Interacts with:
Describe the feature you'd like to have.
It should be possible to manually specify the number of Gluster nodes and the operator should ensure that number of nodes remains operational.
What is the value to the end user? (why is it a priority?)
This capability allows users to manually scale their gluster-based storage to meet their needs.
How will we know we have a good solution? (acceptance criteria)
operatorMode.mode: manual
and specifying operatorMode.manual.nodeCount: n
the user can change the size of the gluster clusternodeCount
causes additional nodes to be created and probed into the clusternodeCount
causes one or more nodes to be drained of bricks, removed from the cluster (detach), and have the pod + backing PV deleted.Work items
Additional context
Depends on GD2 to drain the nodes of bricks based on node state tag
Describe the feature you'd like to have.
Gluster pods should be able to float between nodes in response to failures. This includes support for more than one gluster pod per node. Note: the storage for the bricks must be able to move for this to be possible.
What is the value to the end user? (why is it a priority?)
Currently, Gluster is deployed onto specific nodes, and when a node fails, loses connectivity, or is taken down for some reason, the corresponding gluster pod remains down until the node is repaired. Users would like to have better availability for their data by allowing the gluster pod to restart elsewhere if the back end storage can still be accessed. This also allows users to run multiple gluster clusters on the same set of storage nodes, lower their minimum investment.
How will we know we have a good solution? (acceptance criteria)
Additional context
This requires a fixed identity for the pod that can travel with it (i.e., DNS name), and that ID must be used properly by CSI & GD2 peers. If a stable IP is also needed, there will probably have to also be a service per pod.
Depends on:
Describe the feature you'd like to have.
When a gluster pod fails, kube will attempt to restart it; if it was a simple crash or other transient problem, this should be sufficient to repair the system (plus automatic heal). However, if the node's state becomes corrupt or is lost, it may be necessary to remove the failed node from the cluster and potentially spawn a new one to take its place.
What is the value to the end user? (why is it a priority?)
If a gluster node (pod) remains offline, the associated bricks will have a reduced level of availability & reliability. Being able to automatically repair failures will help increase system availability and protect users' data.
How will we know we have a good solution? (acceptance criteria)
Additional context
This relies on the node state machine (#17) and an, as yet, unimplemented GD2 automigration plugin.
Describe the feature you'd like to have.
The operator should properly secure all components (CSI, Gluster pods, etcd) at time of deployment.
The CR will contain a reference to a secret with a CA key pair. This key pair should be used to secure the Gluster cluster.
What is the value to the end user? (why is it a priority?)
In a kubernetes environment, pods can get traffic from arbitrary sources. In order to maintain the integrity of the infrastructure and properly protect user data, the operator should properly secure all components via TLS (or other supported/appropriate method)
How will we know we have a good solution? (acceptance criteria)
Additional context
Depends on:
Describe the feature you'd like to have.
The operator should maintain a pod disruption budget for the Gluster cluster pods to prevent voluntary disruptions from hurting service availability. After a Gluster pod is down for any reason, the data hosted on that pod will likely need to be healed before the next outage can be fully tolerated. Having a disruption budget will prevent kubernetes from voluntarily taking down a pod until the proper number are up and healthy.
What is the value to the end user? (why is it a priority?)
Users expect storage to be continuously available, through both planned and unplanned events. Having properly maintained disruption budgets will prevent voluntary events (upgrades, etc.) from causing outages.
How will we know we have a good solution? (acceptance criteria)
Additional context
This item will need some investigation (and may not actually be usable):
Describe the feature you'd like to have.
What new functionality do you want?
The gluster operator will create a k8s Prometheus object file which can be applied using OC/kubectl create -f file
command. Also, it will create a yaml for grafana dashboard which can be applied to see the gluster related dashboards
Right now we are doing it manually using a shell script in gluster-mixins project.
What is the value to the end user? (why is it a priority?)
How would the end user gain value from having this feature?
How will we know we have a good solution? (acceptance criteria)
Add a list of criteria that should be met for this feature to be useful
Additional context
Add any other context or screenshots about the feature request here.
Describe the feature you'd like to have.
The operator should properly deploy a given version of the CSI-based gluster file volume driver.
What is the value to the end user? (why is it a priority?)
The operator should be the single point of contact for users admins to manage a Gluster deployment. As such, starting the operator and creating the "cluster CR" should be everything needed to be up and running.
How will we know we have a good solution? (acceptance criteria)
Additional context
This depends on having a CSI driver: https://github.com/gluster/gluster-csi-driver
Describe the feature you'd like to have.
It should be possible to install the operator as a part of OpenShift installation, either via openshift-ansible or whatever other method OpenShift supports.
What is the value to the end user? (why is it a priority?)
It is important that we make Gluster storage easy to install (and do our best to at least maintain parity w/ the current workflow).
How will we know we have a good solution? (acceptance criteria)
Describe the feature you'd like to have.
We need to have a design for:
What is the value to the end user? (why is it a priority?)
Users expect to be able to deploy the latest version of GCS and have the system upgrade automatically. This includes proper sequencing of the upgrade process and multi-step upgrades where necessary. The user needs to be able to set the version and walk away because they will not typically have the in-depth knowledge to perform a manual system upgrade
How will we know we have a good solution? (acceptance criteria)
Additional context
The requirements above come from the behavior of OLM. OLM upgrades the operator by stepping through its versions. However, it does not have the ability to wait for underlying resources to be upgraded. As soon as the operator's Deployment becomes ready, it is subject to upgrade if a new version is available. This means the current version of the operator may see an arbitrarily old version of the Gluster cluster and must be able to upgrade it.
Describe the feature you'd like to have.
The operator should be able to deploy the infrastructure necessary to use gluster-block volumes.
What is the value to the end user? (why is it a priority?)
Gluster-block volumes are commonly used for metadata intensive workloads as well as metrics and logging. Given how common these workloads are, users should have access to gb volumes when deploying a Gluster cluster through the operator.
How will we know we have a good solution? (acceptance criteria)
Work items
Additional context
Add any other context or screenshots about the feature request here.
Describe the feature you'd like to have.
Each commit/PR should be checked via Travis.
What is the value to the end user? (why is it a priority?)
Having linting and unit tests provide a minimum level of assurance for new code submissions, helping to keep quality high.
How will we know we have a good solution? (acceptance criteria)
Each Commit and PR should be checked:
Additional context
Child of #5
Describe the feature you'd like to have.
It should be possible to deploy a Gluster cluster via a Service Catalog.
What is the value to the end user? (why is it a priority?)
Using the service catalog, we can provide a "point-and-click", menu-driven method for installing a Gluster cluster... Answer a few questions and be on your way. Not all admins are interested in editing yamls to configure their storage, and this can provide a way to guide them to a working storage system in a "wizard-like" manner.
How will we know we have a good solution? (acceptance criteria)
Additional context
I view this as just a wrapper around the yaml method. I think it would be beneficial for maintainability to keep the service catalog specific code/infra as light as possible.
Describe the feature you'd like to have.
When a user removes the top-level CR that defines a Gluster cluster, the operator should properly clean up the Gluster resources.
What is the value to the end user? (why is it a priority?)
An end-user should be able to remove Gluster just as easily as they installed it. With this proposal, removal would be:
How will we know we have a good solution? (acceptance criteria)
Work items
Additional context
Describe the feature you'd like to have.
We need a CI system to ensure master remains healthy as development progresses.
What is the value to the end user? (why is it a priority?)
Having a healthy master means that users don't need to search for a "stable build" to use, and we will be able to release new features faster with higher quality than would otherwise be possible.
How will we know we have a good solution? (acceptance criteria)
Work items
Describe the bug
Naming this repository just "operator" feels wrong, why is this isn't called "gluster-operator" instead?
Steps to reproduce
Steps to reproduce the behavior:
Actual results
The repository is named just "operator"
Expected behavior
The repository is named "gluster-operator"
Additional context
See examples of other operator projects named in a same way:
Describe the feature you'd like to have.
It should be possible to monitor the Gluster cluster via prometheus
What is the value to the end user? (why is it a priority?)
Prometheus is the monitoring solution for Kubernetes, and as such, admins expect it to be the place to go for metrics about the services running in their infra. By providing statistics about the cluster, admins can be assured things are working well, or they know that they need to dig deeper.
How will we know we have a good solution? (acceptance criteria)
Work items
Additional context
We need to consider what data should come from the operator and what should come from the Gluster pods themselves or the CSI driver.
Describe the feature you'd like to have.
We currently support specifying node affinity to constrain where pods are placed. We should also support Tolerations so that our pods can be placed on dedicated nodes if desired.
What is the value to the end user? (why is it a priority?)
An admin may want to have dedicated kube worker nodes for the gluster storage server pods. This requires 2 things to happen:
In order to implement this, the admin would apply a 'noschedule' Taint to the designated nodes, preventing pods from being scheduled there, effectively implementing (1). To achieve (2), it requires both the existing NodeAffinity to direct the pods to these nodes and specifying a toleration on the pods to permit them to ignore the taint.
How will we know we have a good solution? (acceptance criteria)
It should be possible to run gluster pods on a dedicated set of nodes via the combination of taint/toleration and node affinity.
Additional context
This requirement arose based on a discussion of how machinesets will be used with workloads requiring dedicated resources in OpenShift.
Major work items:
[]*corev1.Toleration
to both CRDs)Describe the feature you'd like to have.
It should be possible to remove or replace a specific Gluster pod, maximally preserving affected data's resiliency.
What is the value to the end user? (why is it a priority?)
In certain deployment scenarios, it may be necessary to decommission an admin-specified Gluster node. Examples include:
How will we know we have a good solution? (acceptance criteria)
Additional context
Requires state machine #17
Describe the feature you'd like to have.
The operator needs to be aware of failure domains so that it can:
What is the value to the end user? (why is it a priority?)
Users want to be able to control how their data is spread relative to failure boundaries. For example, they may want a R3 volume to use 3 different domains so that an infrastructure outage does not affect the storage. They may also want to co-locate their storage with their workload to increase performance since crossing infrastructure boundaries tands to increase latency & decrease bandwidth.
How will we know we have a good solution? (acceptance criteria)
Work items
Additional context
Dependencies:
Describe the feature you'd like to have.
The operator should be automatically deployed into a continuous testing environment to help with hardening of the code.
What is the value to the end user? (why is it a priority?)
Deploying the latest version of the operator into a live testing environment will allow faster feedback and higher quality code, resulting in a better, more robust product for the final user.
How will we know we have a good solution? (acceptance criteria)
Additional context
Implementing this will require a separate effort to develop something similar to Chaos Monkey for our system.
Describe the feature you'd like to have.
We need to have a set of well organized documentation for both users and developers.
What is the value to the end user? (why is it a priority?)
Well organized documentation would make it easy for users to get started using the operator to deploy a Gluster cluster. They should also be able to read about how to perform day 2 operations and troubleshoot their installation.
In addition to end-user documentation, new developers can use the resource as a friendly way to learn about how to contribute to the project
How will we know we have a good solution? (acceptance criteria)
Additional context
My suggestion is to use readthedocs for documentation
Describe the feature you'd like to have.
The operator should log all of its actions to stdout to help diagnose problems.
What is the value to the end user? (why is it a priority?)
While the operator is intended to automate the management of a Gluster cluster, its actions must be both observable and understandable. An admin will hesitate to allow automated management if it is difficult to understand what the operator is doing and why.
How will we know we have a good solution? (acceptance criteria)
Additional context
I expect this to be the main source of data for answering "why did it (not) do that?" This will become increasingly important as the sophistication of the operator's algorithms increases.
Describe the feature you'd like to have.
Determine a set of RBAC roles that are appropriate for the gluster operator ecosystem. This includes access that:
What is the value to the end user? (why is it a priority?)
Admins need to be able to properly secure their cluster, both to prevent accidental changes as well as to prevent malicious actors from exploiting the system. A security conscious admin would like to know what permissions are required and why.
How will we know we have a good solution? (acceptance criteria)
Additional context
Child of #6
Describe the feature you'd like to have.
Build the initial skeleton of the operator. It should start and watch the defined CRs.
What is the value to the end user? (why is it a priority?)
With this, the project has an initial instance of a thing that can be run e2e.
How will we know we have a good solution? (acceptance criteria)
Describe the feature you'd like to have.
Gluster storage can currently be deployed either with gluster running in pods (converged) or using an external gluster cluster that lives outside the kubernetes cluster. The operator should have a operatorMode.mode: external
for the "non-converged" (i.e., external) gluster deployment.
The functionality of such a mode of operation is limited to just managing the CSI driver.
What is the value to the end user? (why is it a priority?)
By supporting this "external" mode, user can continue to use separate Gluster deployments in their environment instead of forcing everyone using gluster to deploy it in a converged manner.
How will we know we have a good solution? (acceptance criteria)
Additional context
Describe the feature you'd like to have.
The operator should be able to upgrade the CSI driver associated with the Gluster cluster. It should also ensure version compatibility before performing the upgrade.
What is the value to the end user? (why is it a priority?)
The operator is meant to be the single point of contact user to manage the Gluster storage. In keeping with that aim, the user should be able to upgrade the storage system by updating the image tag for the CSI driver associated with the cluster.
How will we know we have a good solution? (acceptance criteria)
Additional context
Add any other context or screenshots about the feature request here.
It's time to get this repo up and running:
Describe the feature you'd like to have.
It should be possible to enable VDO in the Gluster pods to get better storage efficiency.
What is the value to the end user? (why is it a priority?)
By increasing storage efficiency, we can decrease the cost of the storage infrastructure.
How will we know we have a good solution? (acceptance criteria)
Additional context
This depends on VDO support in both the gluster containers and GD2
Describe the feature you'd like to have.
A number of features require the operator and GD2 to coordinate their actions, such as when decommissioning or upgrading a gluster node. This coordination can be handled via a state machine that is represented by a (GD2-level) metadata tag that is applied to gluster nodes. This issue is to fully document the states, allowed transitions, actors, and permissible actions in each state.
What is the value to the end user? (why is it a priority?)
Without proper coordination, the operator may cause a user's data to be destroyed or become unavailable.
How will we know we have a good solution? (acceptance criteria)
Additional context
This is user by a number of features, including: #11 #13 #14
Describe the feature you'd like to have.
We need to have a well-thought out design for our custom resources so that the project's capabilities can grow over time while minimizing the need for breaking changes to the CRs.
What is the value to the end user? (why is it a priority?)
Avoiding breaking changes will allow the end user to more easily upgrade their system, and it will also permit implementers to build out the system without significant rework of kube-based state.
How will we know we have a good solution? (acceptance criteria)
Additional context
Child of #6
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.