gluster / anthill Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 12.0 274 KB

A Kubernetes/OpenShift operator to manage Gluster clusters

Home Page: http://gluster-anthill.readthedocs.io/

License: Apache License 2.0

Shell 6.08% Dockerfile 0.21% Go 93.15% Ruby 0.56%

gluster kubernetes openshift operator storage

anthill's People

Contributors

Stargazers

Watchers

Forkers

johnstrunk shtripat gsr-shanks madhu-1 sidharthanup jarrpa rohan47 rohantmp sourabh1031 maduhu pinkbluersglobal levysantanna

anthill's Issues

Control size of cluster via thresholds & limits

Describe the feature you'd like to have.
Instead of having to set the size of the cluster manually (#11), it should be possible to have the cluster grow & shrink as necessary. This dynamic sizing should be subject to limits to contain costs.

What is the value to the end user? (why is it a priority?)
With manual sizing, the admin must constantly monitor the cluster and vary the number of nodes as storage usage changes. This requires a good amount of knowledge and a willingness to probe into the cluster's state to track free space. Instead, admins should be able to provide a min and max size for the cluster that they are willing to have, and the operator should dynamically size the cluster, trading off cost (large cluster) with spare capacity that is available for new volume allocations.

How will we know we have a good solution? (acceptance criteria)

When unallocated capacity falls below a configurable threshold, the cluster should be expanded
When unallocated capacity exceeds a configurable threshold, the cluster should be contracted
There should be a configurable minimum and maximum number of nodes allowed
There should be a configurable maximum amount of capacity allowed
The operator should maintain a good distribution of nodes across the various fault domains (#13)

Work items

Operator monitors unallocated capacity
Operator uses monitored capacity to increase/decrease node count

Additional context
We'll need to consider:

How to rate limit growth/shrink
How to handle allocations greater than cluster free space
How to intelligently spread nodes across domains (What it some are full and others not? How do we know?)

Start the operator via `kubectl/oc -f` of yaml(s)

Describe the feature you'd like to have.
We need to be able to install/start the operator from a set of yaml descriptions.

What is the value to the end user? (why is it a priority?)
This provides a (first) method for starting the operator. Without this, we have nothing. 😄

How will we know we have a good solution? (acceptance criteria)

We can create custom resources
We can start the skeleton of an operator
It can watch custom resources
RBAC roles are defined (and minimal) for the operator to access the CRs

Work items

Define CRDs - #3
RBAC roles for CRs - #28
Operator watches CRs - #30

Additional context
This is the "get something off the ground" item. Once we get this, we'll have a running system (though useless).

Set up E2E tests on CentosCI

Describe the feature you'd like to have.
We need to have automated E2E testing to make sure the operator continues to function correctly.

What is the value to the end user? (why is it a priority?)
Automated E2E testing will help keep master working and decrease the likelihood of regressions, leading to higher quality releases.

How will we know we have a good solution? (acceptance criteria)
On each commit/pr:

Initialize a multi-node cluster
Start the operator
Run a configurable set of operations
Cleanly shut the operator down
Ensure clean end state

Additional context
Child of #5

Rolling upgrade of Gluster

Describe the feature you'd like to have.
The operator should be able to non-disruptively upgrade a Gluster cluster. When an admin changes the Gluster template, the operator should automatically roll out the change to the entire cluster. As it does so, it must ensure data volumes are available and healed such that restarting a pod does not cause a loss of availability nor split-brain situation.

What is the value to the end user? (why is it a priority?)
Gluster upgrades need to be carefully choreographed to maintain cluster health and data availability. Relying on an admin to carry out these steps is both time consuming and error-prone. Implementing this feature ensures that best practices will be followed during the upgrade while also freeing the admin.

How will we know we have a good solution? (acceptance criteria)

The operator ensures affected volumes are fully healed before upgrading each pod
Admin can specify a desired version to roll out, and the operator will ensure all pods match that version
Interacts properly with GD2 to prevent races with auto volume management that could compromise data
Tracks and ensures compatibility of client and server versions (CSI as gluster client; Gluster pod as server).
- Operator will refuse to advance the client version higher than server, in keeping with best practice

Work items

Operator can monitor heal status
Operator can monitor node health
Operator can quiesce a node via state tag
Operator can query and modify the Deployment for a Gluster pod to sync with topology template
Upgrade of CSI driver: #15

Additional context
Interacts with:

GD2 node state tags
Operator deploying CSI: #8

Operator "manual" mode (fixed # of nodes)

Describe the feature you'd like to have.
It should be possible to manually specify the number of Gluster nodes and the operator should ensure that number of nodes remains operational.

What is the value to the end user? (why is it a priority?)
This capability allows users to manually scale their gluster-based storage to meet their needs.

How will we know we have a good solution? (acceptance criteria)

By setting operatorMode.mode: manual and specifying operatorMode.manual.nodeCount: n the user can change the size of the gluster cluster
Increasing nodeCount causes additional nodes to be created and probed into the cluster
Decreasing nodeCount causes one or more nodes to be drained of bricks, removed from the cluster (detach), and have the pod + backing PV deleted.
Deletion of a node waits for bricks to be drained before detach and deletion
Implements node state machine with GD2 to properly synchronize removal

Work items

Operator can create initial Gluster node
- Creation of Gluster PVCs and Deployments
- Creation of CRs for Gluster Deployments
- Create service for new Gluster cluster
Operator can add nodes to an existing cluster
- Probe new node into existing cluster
Operator can shrink cluster
Operator can set/read node state variables
- Sets "failed" state when desire removal
- Waits for "abandoned" before deleting node

Additional context
Depends on GD2 to drain the nodes of bricks based on node state tag

Gluster pods can move between nodes

Describe the feature you'd like to have.
Gluster pods should be able to float between nodes in response to failures. This includes support for more than one gluster pod per node. Note: the storage for the bricks must be able to move for this to be possible.

What is the value to the end user? (why is it a priority?)
Currently, Gluster is deployed onto specific nodes, and when a node fails, loses connectivity, or is taken down for some reason, the corresponding gluster pod remains down until the node is repaired. Users would like to have better availability for their data by allowing the gluster pod to restart elsewhere if the back end storage can still be accessed. This also allows users to run multiple gluster clusters on the same set of storage nodes, lower their minimum investment.

How will we know we have a good solution? (acceptance criteria)

When a node that hosts a gluster pod is taken down, that pod restarts elsewhere in the cluster, properly rejoins, and heals.
Moving gluster pods is non-disruptive to client workloads, even if all gluster pods are moved (one at a time)
Multiple gluster pods can run on a single node without interfering with each other (excepting CPU/memory/network bandwidth issues). They should be able to be a part of the same or different gluster clusters

Additional context
This requires a fixed identity for the pod that can travel with it (i.e., DNS name), and that ID must be used properly by CSI & GD2 peers. If a stable IP is also needed, there will probably have to also be a service per pod.

Depends on:

Containers use GD2: gluster/gluster-containers#89
Containers use block PV: gluster/gluster-containers#88

Automatic replacement of failed nodes

Describe the feature you'd like to have.
When a gluster pod fails, kube will attempt to restart it; if it was a simple crash or other transient problem, this should be sufficient to repair the system (plus automatic heal). However, if the node's state becomes corrupt or is lost, it may be necessary to remove the failed node from the cluster and potentially spawn a new one to take its place.

What is the value to the end user? (why is it a priority?)
If a gluster node (pod) remains offline, the associated bricks will have a reduced level of availability & reliability. Being able to automatically repair failures will help increase system availability and protect users' data.

How will we know we have a good solution? (acceptance criteria)

Kubernetes will act as the 1st line of defense, restarting failed Gluster pods
A Gluster pod that remains offline from the gluster cluster for an extended period of time will have its bricks moved to other Gluster nodes (by GD2). Permissible downtime should be configurable.
Gluster nodes that have been "abandoned" by GD2 should be removed from the TSP and destroyed by the operator
Ability to mark a node via the CR such that it will not be subject to replacement (abandonment by GD2 nor destruction by the operator). This is necessary in cases where a Gluster node is expected to be temporarily unavailable (i.e., scheduled downtime or other maintenance).

Additional context
This relies on the node state machine (#17) and an, as yet, unimplemented GD2 automigration plugin.

TLS support

Describe the feature you'd like to have.
The operator should properly secure all components (CSI, Gluster pods, etcd) at time of deployment.
The CR will contain a reference to a secret with a CA key pair. This key pair should be used to secure the Gluster cluster.

What is the value to the end user? (why is it a priority?)
In a kubernetes environment, pods can get traffic from arbitrary sources. In order to maintain the integrity of the infrastructure and properly protect user data, the operator should properly secure all components via TLS (or other supported/appropriate method)

How will we know we have a good solution? (acceptance criteria)

The CA secret is used by Gluster pods on startup to generate a pod keypair
The CSI driver (on each node), at startup, generates a keypair using the CA secret so that it can access the Gluster cluster
The GD2 management API is properly secured
Keys are properly distributed between GD2 and etcd to secure that communication channel (NOTE: this will eventually need to be broken into a separate issue)

Additional context
Depends on:

Pod disruption budget to manage availability

Describe the feature you'd like to have.
The operator should maintain a pod disruption budget for the Gluster cluster pods to prevent voluntary disruptions from hurting service availability. After a Gluster pod is down for any reason, the data hosted on that pod will likely need to be healed before the next outage can be fully tolerated. Having a disruption budget will prevent kubernetes from voluntarily taking down a pod until the proper number are up and healthy.

What is the value to the end user? (why is it a priority?)
Users expect storage to be continuously available, through both planned and unplanned events. Having properly maintained disruption budgets will prevent voluntary events (upgrades, etc.) from causing outages.

How will we know we have a good solution? (acceptance criteria)

The operator should manage a pod disruption budget object that refers to the gluster pods
The operator should update the min available number based on the size of the cluster

Additional context
This item will need some investigation (and may not actually be usable):

We would like to consider a pod "disrupted"/unhealthy if it has pending heals on any of its volumes.
- Is having the health check reflect pending heals the correct approach?
- Would an extended period of unhealthy-ness cause the pod to be killed (we don't want that)?
As a first cut the operator would set min available to be (nodes - 1), but this is overly conservative.
- An alternate approach would be to have a budget per AZ, requiring (az_nodes - 1) to be available. This would permit more parallelism during upgrades.

Prometheus Object Builder and Grafana dashboard for Gluster-mixins

Describe the feature you'd like to have.
What new functionality do you want?
The gluster operator will create a k8s Prometheus object file which can be applied using OC/kubectl create -f file command. Also, it will create a yaml for grafana dashboard which can be applied to see the gluster related dashboards

Right now we are doing it manually using a shell script in gluster-mixins project.

What is the value to the end user? (why is it a priority?)
How would the end user gain value from having this feature?

Will allow a user to get alerts based on the gluster metrics availability.
Also, visualize the graphs on the grafana dashboard.

How will we know we have a good solution? (acceptance criteria)
Add a list of criteria that should be met for this feature to be useful

Generate prometheus_alerts.yaml
Generate dashboards_out(for grafana dashboards)
Generate prometheus_rules.yaml
Applying all the above Prometheus YAML files which can be picked by Openshift/k8s Prometheus object in openshift-monitoring namespace.
Applying files inside dashboard_out for generating grafana dashboard.

Additional context
Add any other context or screenshots about the feature request here.

Operator should deploy CSI driver for gluster file

Describe the feature you'd like to have.
The operator should properly deploy a given version of the CSI-based gluster file volume driver.

What is the value to the end user? (why is it a priority?)
The operator should be the single point of contact for ~~users~~ admins to manage a Gluster deployment. As such, starting the operator and creating the "cluster CR" should be everything needed to be up and running.

How will we know we have a good solution? (acceptance criteria)

The operator will properly deploy the specified version of the CSI file driver when the cluster CR is created
The operator will provide the driver with the CA key necessary to access the Gluster cluster
The operator will remove the CSI driver when the cluster CR is deleted
The operator will upgrade the CSI driver when the supplied image tag is changed in the cluster CR

Additional context
This depends on having a CSI driver: https://github.com/gluster/gluster-csi-driver

Installation via cluster installer

Describe the feature you'd like to have.
It should be possible to install the operator as a part of OpenShift installation, either via openshift-ansible or whatever other method OpenShift supports.

What is the value to the end user? (why is it a priority?)
It is important that we make Gluster storage easy to install (and do our best to at least maintain parity w/ the current workflow).

How will we know we have a good solution? (acceptance criteria)

Gluster can be installed at the same time as OpenShift
Gluster can be used for logging/metrics
Gluster file and block volumes are available

Design for upgrade

Describe the feature you'd like to have.
We need to have a design for:

How the operator will handle upgrades of itself
How the operator will perform upgrades on the Gluster cluster

What is the value to the end user? (why is it a priority?)
Users expect to be able to deploy the latest version of GCS and have the system upgrade automatically. This includes proper sequencing of the upgrade process and multi-step upgrades where necessary. The user needs to be able to set the version and walk away because they will not typically have the in-depth knowledge to perform a manual system upgrade

How will we know we have a good solution? (acceptance criteria)

If operator version n is running in the cluster, operator version n+1 may be deployed at any time
The operator has the ability to upgrade underlying components from any previous stable release of the operator

Additional context
The requirements above come from the behavior of OLM. OLM upgrades the operator by stepping through its versions. However, it does not have the ability to wait for underlying resources to be upgraded. As soon as the operator's Deployment becomes ready, it is subject to upgrade if a new version is available. This means the current version of the operator may see an arbitrarily old version of the Gluster cluster and must be able to upgrade it.

Operator deploys gluster-block

Describe the feature you'd like to have.
The operator should be able to deploy the infrastructure necessary to use gluster-block volumes.

What is the value to the end user? (why is it a priority?)
Gluster-block volumes are commonly used for metadata intensive workloads as well as metrics and logging. Given how common these workloads are, users should have access to gb volumes when deploying a Gluster cluster through the operator.

How will we know we have a good solution? (acceptance criteria)

Gluster-block can be enabled in a cluster by adjusting the cluster CR
Operator will deploy and manage the CSI driver for g-b
Operator will deploy and manage the g-b pods necessary for providing iSCSI targets

Work items

Gluster-block is represented in the cluster CR
Operator can deploy and upgrade g-b CSI driver
Operator can deploy and upgrade g-b target pods

Additional context
Add any other context or screenshots about the feature request here.

Set up TravisCI, codecov

Describe the feature you'd like to have.
Each commit/PR should be checked via Travis.

Run linters
Run unit tests

What is the value to the end user? (why is it a priority?)
Having linting and unit tests provide a minimum level of assurance for new code submissions, helping to keep quality high.

How will we know we have a good solution? (acceptance criteria)
Each Commit and PR should be checked:

linting of all code and docs (if it's a structured file, we should lint it)
Run all unit tests
Integrate w/ codecov to ensure coverage of unit tests remains acceptable

Additional context
Child of #5

Installation via Service Catalog

Describe the feature you'd like to have.
It should be possible to deploy a Gluster cluster via a Service Catalog.

What is the value to the end user? (why is it a priority?)
Using the service catalog, we can provide a "point-and-click", menu-driven method for installing a Gluster cluster... Answer a few questions and be on your way. Not all admins are interested in editing yamls to configure their storage, and this can provide a way to guide them to a working storage system in a "wizard-like" manner.

How will we know we have a good solution? (acceptance criteria)

A cluster admin can exclusively use the service catalog to deploy Gluster
Admin is guided through questions necessary to deploy the cluster and provided sensible defaults where possible.
Cluster can be destroyed in the same manner.

Additional context
I view this as just a wrapper around the yaml method. I think it would be beneficial for maintainability to keep the service catalog specific code/infra as light as possible.

Properly clean up when deleting a cluster

Describe the feature you'd like to have.
When a user removes the top-level CR that defines a Gluster cluster, the operator should properly clean up the Gluster resources.

It should wait until all Gluster PVs for the affected cluster have been deleted
It should then delete all Gluster pods and external etcd instance
It should delete the Gluster cluster's "South" PVs that were used for bricks
It should delete any other automatically created resources such as Secrets and ConfigMaps

What is the value to the end user? (why is it a priority?)
An end-user should be able to remove Gluster just as easily as they installed it. With this proposal, removal would be:

Delete the top-level CR for the cluster(s)
Delete the operator
Delete CRDs and RBAC rules

How will we know we have a good solution? (acceptance criteria)

All resources that are automatically created as a part of deploying/maintaining the cluster should be properly cleaned up when the top-level CR is deleted
The cluster should be protected from deletion as long as 1 or more PVs still exist

Work items

Wait for volume deletion
Delete Gluster Deployments
Delete etcd
Delete South PVs & other automatically created resources

Additional context

Continuous integration

Describe the feature you'd like to have.
We need a CI system to ensure master remains healthy as development progresses.

What is the value to the end user? (why is it a priority?)
Having a healthy master means that users don't need to search for a "stable build" to use, and we will be able to release new features faster with higher quality than would otherwise be possible.

How will we know we have a good solution? (acceptance criteria)

CI system runs against all PRs and master branch
CI system runs linters
CI system runs all unit tests
CI system runs e2e tests

Work items

Unit testing: #31
- TravisCI: linters, unit tests
- codecov
E2E testing via CentOS CI - #29

Repository name

Describe the bug

Naming this repository just "operator" feels wrong, why is this isn't called "gluster-operator" instead?

Steps to reproduce
Steps to reproduce the behavior:

see the name of this repository

Actual results

The repository is named just "operator"

Expected behavior

The repository is named "gluster-operator"

Additional context

See examples of other operator projects named in a same way:

Prometheus monitoring

Describe the feature you'd like to have.
It should be possible to monitor the Gluster cluster via prometheus

What is the value to the end user? (why is it a priority?)
Prometheus is the monitoring solution for Kubernetes, and as such, admins expect it to be the place to go for metrics about the services running in their infra. By providing statistics about the cluster, admins can be assured things are working well, or they know that they need to dig deeper.

How will we know we have a good solution? (acceptance criteria)

Statistics about the Gluster cluster are available in Prometheus

Work items

Create Prometheus endpoint
Instrument operator actions and provide counts to prometheus endpoint
Provide node uptime metrics
Provide stats on storage assigned to Gluster

Additional context
We need to consider what data should come from the operator and what should come from the Gluster pods themselves or the CSI driver.

Add Tolerations to cluster and node CRDs

Describe the feature you'd like to have.
We currently support specifying node affinity to constrain where pods are placed. We should also support Tolerations so that our pods can be placed on dedicated nodes if desired.

What is the value to the end user? (why is it a priority?)
An admin may want to have dedicated kube worker nodes for the gluster storage server pods. This requires 2 things to happen:

non-gluster pods must be prohibited from running there
gluster pods should be directed to these nodes

In order to implement this, the admin would apply a 'noschedule' Taint to the designated nodes, preventing pods from being scheduled there, effectively implementing (1). To achieve (2), it requires both the existing NodeAffinity to direct the pods to these nodes and specifying a toleration on the pods to permit them to ignore the taint.

How will we know we have a good solution? (acceptance criteria)
It should be possible to run gluster pods on a dedicated set of nodes via the combination of taint/toleration and node affinity.

Additional context
This requirement arose based on a discussion of how machinesets will be used with workloads requiring dedicated resources in OpenShift.

Major work items:

Revise CRDs to support tolerations (adding a []*corev1.Toleration to both CRDs)
Code to take the toleration from the GlusterNode and add it to the gd2 StatefulSet during reconcile

Replacement of a specific Gluster node

Describe the feature you'd like to have.
It should be possible to remove or replace a specific Gluster pod, maximally preserving affected data's resiliency.

What is the value to the end user? (why is it a priority?)
In certain deployment scenarios, it may be necessary to decommission an admin-specified Gluster node. Examples include:

In a deployment using DAS for Gluster bricks, the hosting server may need to be retired (hardware failure, lease expiration, etc).
For network-based storage, the backing storage system may need to be replaced/retired

How will we know we have a good solution? (acceptance criteria)

A particular Gluster pod instance can be targeted for decommissioning
Bricks are migrated off of the pod prior to it being taken offline
Once empty, the operator destroys the pod and associated South storage
Operator adjusts cluster sizing as necessary to account for the loss of the node (i.e., for fixed node count clusters, "manual mode", a new node would be created).

Additional context
Requires state machine #17

Support for failure domains

Describe the feature you'd like to have.
The operator needs to be aware of failure domains so that it can:

Create Gluster pods that target different domains
Maintain a pool of storage across domains so that resilient volumes can be provisioned

What is the value to the end user? (why is it a priority?)
Users want to be able to control how their data is spread relative to failure boundaries. For example, they may want a R3 volume to use 3 different domains so that an infrastructure outage does not affect the storage. They may also want to co-locate their storage with their workload to increase performance since crossing infrastructure boundaries tands to increase latency & decrease bandwidth.

How will we know we have a good solution? (acceptance criteria)

Different templates can be defined to place pods into different failure domains. This includes both node-based affinities (for rack/quadrant/host/DC/AZ granularity) and storage affinities (for obtaining South PVs from different pools)
Gluster's presence in each domain can be scaled independently
The failure domains can be used as a part of of volume provisioning. The template and topology information plugs into both the CSI driver and Gluster.

Work items

Support for topology templates
Be able to vary node count per template

Additional context
Dependencies:

GD2 node-level tags to designate topology template
- Topology tag can be used as a filter by GD2 IVP
CSI can use topology information from a StorageClass to send tags in provisioning request to GD2

CI/CD: Automatic deployment

Describe the feature you'd like to have.
The operator should be automatically deployed into a continuous testing environment to help with hardening of the code.

What is the value to the end user? (why is it a priority?)
Deploying the latest version of the operator into a live testing environment will allow faster feedback and higher quality code, resulting in a better, more robust product for the final user.

How will we know we have a good solution? (acceptance criteria)

Builds from master are automatically deployed into an OpenShift internal environment
Randomized workloads, management operations, and faults are applied to the cluster managed by the operator.

Additional context
Implementing this will require a separate effort to develop something similar to Chaos Monkey for our system.

Project documentation

Describe the feature you'd like to have.
We need to have a set of well organized documentation for both users and developers.

What is the value to the end user? (why is it a priority?)
Well organized documentation would make it easy for users to get started using the operator to deploy a Gluster cluster. They should also be able to read about how to perform day 2 operations and troubleshoot their installation.
In addition to end-user documentation, new developers can use the resource as a friendly way to learn about how to contribute to the project

How will we know we have a good solution? (acceptance criteria)

Documentation should be a part of the source tree, so that it can be updated with each user-visible change
Documentation should be automatically built and published with each change
CI should fail if there is a problem with the docs
Developers should be able to build/preview/debug their documentation on their local system

Additional context
My suggestion is to use readthedocs for documentation

Operator should log all actions

Describe the feature you'd like to have.
The operator should log all of its actions to stdout to help diagnose problems.

What is the value to the end user? (why is it a priority?)
While the operator is intended to automate the management of a Gluster cluster, its actions must be both observable and understandable. An admin will hesitate to allow automated management if it is difficult to understand what the operator is doing and why.

How will we know we have a good solution? (acceptance criteria)

Logs should go to stdout in a standard format in the operator container to be picked up by the logging (EFK) stack
All "configuration altering" actions should be logged
The reason for the action should also be logged

Additional context
I expect this to be the main source of data for answering "why did it (not) do that?" This will become increasingly important as the sophistication of the operator's algorithms increases.

Create RBAC roles for CRs

Describe the feature you'd like to have.
Determine a set of RBAC roles that are appropriate for the gluster operator ecosystem. This includes access that:

The admin needs in order to control the operator and perform maintenance on nodes
The operator needs to deploy CSI driver(s), gluster pods, etc.
other?
The rules should be minimal for the required purpose and each permission should be documented with its reason.

What is the value to the end user? (why is it a priority?)
Admins need to be able to properly secure their cluster, both to prevent accidental changes as well as to prevent malicious actors from exploiting the system. A security conscious admin would like to know what permissions are required and why.

How will we know we have a good solution? (acceptance criteria)

Separate roles for the main "entities" in the system
All permissions documented
Permissions minimized for each role

Additional context
Child of #6

Initial operator executable

Describe the feature you'd like to have.
Build the initial skeleton of the operator. It should start and watch the defined CRs.

What is the value to the end user? (why is it a priority?)
With this, the project has an initial instance of a thing that can be run e2e.

How will we know we have a good solution? (acceptance criteria)

YAML files to deploy the operator
The operator will start and establish a watch on the CRs as defined in #3
E2E tests can:
- Add CRs
- Start/stop the operator
- Clean up

Additional context
Child of #6
E2E depends on #29

Operator mode for "external" Gluster clusters

Describe the feature you'd like to have.
Gluster storage can currently be deployed either with gluster running in pods (converged) or using an external gluster cluster that lives outside the kubernetes cluster. The operator should have a operatorMode.mode: external for the "non-converged" (i.e., external) gluster deployment.
The functionality of such a mode of operation is limited to just managing the CSI driver.

What is the value to the end user? (why is it a priority?)
By supporting this "external" mode, user can continue to use separate Gluster deployments in their environment instead of forcing everyone using gluster to deploy it in a converged manner.

How will we know we have a good solution? (acceptance criteria)

The operator will deploy the CSI driver
Users will be able to dynamically provision & delete volumes by creating/deleting PVCs
Users will be able to access the dynamically provisioned volumes by referencing the PVCs in pods
The operator will maintain a Service that points to the external Gluster cluster
If the IP/DNS of a Gluster node is changed/added/removed in the cluster RC, the operator will update the Service as appropriate.

Additional context

Upgrade of CSI driver

Describe the feature you'd like to have.
The operator should be able to upgrade the CSI driver associated with the Gluster cluster. It should also ensure version compatibility before performing the upgrade.

What is the value to the end user? (why is it a priority?)
The operator is meant to be the single point of contact user to manage the Gluster storage. In keeping with that aim, the user should be able to upgrade the storage system by updating the image tag for the CSI driver associated with the cluster.

How will we know we have a good solution? (acceptance criteria)

Operator will monitor and reconcile the requested version of the CSI driver with the version that is deployed.
Operator will ensure version compatibility between the gluster client version and the gluster server version (client version <= server version).

Additional context
Add any other context or screenshots about the feature request here.

Repo setup

It's time to get this repo up and running:

VDO support

Describe the feature you'd like to have.
It should be possible to enable VDO in the Gluster pods to get better storage efficiency.

What is the value to the end user? (why is it a priority?)
By increasing storage efficiency, we can decrease the cost of the storage infrastructure.

How will we know we have a good solution? (acceptance criteria)

VDO can optionally be enabled on a set of gluster nodes (probably as a topology template option)
Physical capacity remaining can be monitored by the operator for expanding/shrinking the cluster

Additional context
This depends on VDO support in both the gluster containers and GD2

Document operator/GD2 node state machine

Describe the feature you'd like to have.
A number of features require the operator and GD2 to coordinate their actions, such as when decommissioning or upgrading a gluster node. This coordination can be handled via a state machine that is represented by a (GD2-level) metadata tag that is applied to gluster nodes. This issue is to fully document the states, allowed transitions, actors, and permissible actions in each state.

What is the value to the end user? (why is it a priority?)
Without proper coordination, the operator may cause a user's data to be destroyed or become unavailable.

How will we know we have a good solution? (acceptance criteria)

Meaning of each state will be documented
Allowed states and transitions will be documented
Entities allowed to perform the transitions will be documented

Additional context
This is user by a number of features, including: #11 #13 #14

CRD Design

Describe the feature you'd like to have.
We need to have a well-thought out design for our custom resources so that the project's capabilities can grow over time while minimizing the need for breaking changes to the CRs.

What is the value to the end user? (why is it a priority?)
Avoiding breaking changes will allow the end user to more easily upgrade their system, and it will also permit implementers to build out the system without significant rework of kube-based state.

How will we know we have a good solution? (acceptance criteria)

The proposed CRD design will take into account all the "modes" of operation (external, manual, and automatic)
It should account for other functionality such as Day 2 operations
It should adhere to the Kubernetes API conventions

Additional context
Child of #6

gluster / anthill Goto Github PK

anthill's People

Contributors

Stargazers

Watchers

Forkers

anthill's Issues

Recommend Projects

Recommend Topics

Recommend Org