Giter VIP home page Giter VIP logo

hive's Introduction

OpenShift Hive

API driven OpenShift 4 cluster provisioning and management.

Hive is an operator which runs as a service on top of Kubernetes/OpenShift. The Hive service can be used to provision and perform initial configuration of OpenShift clusters.

Supported cloud providers

  • AWS
  • Azure
  • Google Cloud Platform
  • IBM Cloud
  • OpenStack
  • vSphere

In the future Hive will support more cloud providers.

Documentation

hive's People

Contributors

2uasimojo avatar abhinavdahiya avatar abraverm avatar abutcher avatar akhil-rane avatar celebdor avatar csrwng avatar dgoodwin avatar dhellmann avatar dlom avatar fxierh avatar gregsheremeta avatar gyliu513 avatar hassenius avatar jianping-shu avatar jmelis avatar jstuever avatar lalatendumohanty avatar lleshchi avatar maorfr avatar mjlshen avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar shivamchamoli avatar staebler avatar suhanime avatar waseem-h avatar wking avatar yithian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hive's Issues

Stuck uninstall when deleting a ClusterDeployment before provisioning is complete

I deleted a ClusterDeployment before provisioning was complete. Now the uninstall pod is stuck trying to delete network interfaces:

time="2018-11-28T02:14:23Z" level=debug msg="Exiting deleting EIPs (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:32:47Z" level=debug msg="Deleting internet gateways (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:32:47Z" level=debug msg="deleting internet gateway: igw-0597f04b2b21e9150"
time="2018-11-28T05:32:47Z" level=debug msg="detaching Internet GW igw-0597f04b2b21e9150 from VPC vpc-02dbc31a784cf1a70"
time="2018-11-28T05:32:47Z" level=debug msg="error detaching igw: error detaching internet gateway: DependencyViolation: Network vpc-02dbc31a784cf1a70 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.\n\tstatus code: 400, request id: 3f1c9453-42d5-4680-b708-1538dc27b1d8"
time="2018-11-28T05:32:47Z" level=debug msg="Exiting deleting internet gateways (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:32:57Z" level=debug msg="Deleting subnets (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:32:57Z" level=debug msg="error deleting subnet: DependencyViolation: The subnet 'subnet-0aab27e3305a5f066' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: ad047fa6-2244-41c5-99c4-067780dc3c8b"
time="2018-11-28T05:32:57Z" level=debug msg="error deleting subnet: DependencyViolation: The subnet 'subnet-0524453910b3f2445' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: c8acf976-0942-4b66-abf0-1636aea03791"
time="2018-11-28T05:32:57Z" level=debug msg="Exiting deleting subnets (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:32:57Z" level=debug msg="Deleting VPCs (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:32:58Z" level=debug msg="Deleting load balancers (vpc-02dbc31a784cf1a70)"
time="2018-11-28T05:32:58Z" level=debug msg="from 16 total load balancers, 0 scheduled for deletion"
time="2018-11-28T05:32:58Z" level=debug msg="Exiting deleting load balancers (vpc-02dbc31a784cf1a70)"
time="2018-11-28T05:32:58Z" level=debug msg="Deleting V2 load balancers (vpc-02dbc31a784cf1a70)"
time="2018-11-28T05:32:58Z" level=debug msg="from 4 total V2 load balancers, 0 scheduled for deletion"
time="2018-11-28T05:32:58Z" level=debug msg="Deleting target groups (vpc-02dbc31a784cf1a70)"
time="2018-11-28T05:32:58Z" level=debug msg="from 7 total target groups, 0 scheduled for deletion"
time="2018-11-28T05:32:58Z" level=debug msg="Exiting deleting target groups (vpc-02dbc31a784cf1a70)"
time="2018-11-28T05:32:58Z" level=debug msg="Exiting deleting V2 load balancers (vpc-02dbc31a784cf1a70)"
time="2018-11-28T05:32:58Z" level=debug msg="deleting VPC: vpc-02dbc31a784cf1a70"
time="2018-11-28T05:32:58Z" level=debug msg="error deleting VPC vpc-02dbc31a784cf1a70: DependencyViolation: The vpc 'vpc-02dbc31a784cf1a70' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 623dcf6e-7f39-491d-8379-2b3b0aa84a8c"
time="2018-11-28T05:32:58Z" level=debug msg="Exiting deleting VPCs (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:33:09Z" level=debug msg="Deleting EIPs (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"
time="2018-11-28T05:33:09Z" level=debug msg="deleting EIP: eni-0dbf268139e5a6263"
time="2018-11-28T05:33:09Z" level=debug msg="deleting network interface: eni-0dbf268139e5a6263"
time="2018-11-28T05:33:09Z" level=debug msg="error deleting network iface: InvalidParameterValue: Network interface 'eni-0dbf268139e5a6263' is currently in use.\n\tstatus code: 400, request id: 5714fd9f-840a-4cc7-92dc-fe8b808297a4"
time="2018-11-28T05:33:09Z" level=debug msg="error deleting network iface: InvalidParameterValue: Network interface 'eni-0dbf268139e5a6263' is currently in use.\n\tstatus code: 400, request id: 5714fd9f-840a-4cc7-92dc-fe8b808297a4"
time="2018-11-28T05:33:09Z" level=debug msg="deleting EIP: eni-091fed9de78f91bd6"
time="2018-11-28T05:33:09Z" level=debug msg="deleting network interface: eni-091fed9de78f91bd6"
time="2018-11-28T05:33:10Z" level=debug msg="error deleting network iface: InvalidParameterValue: Network interface 'eni-091fed9de78f91bd6' is currently in use.\n\tstatus code: 400, request id: d3554626-ac73-4168-a020-a49d37299205"
time="2018-11-28T05:33:10Z" level=debug msg="error deleting network iface: InvalidParameterValue: Network interface 'eni-091fed9de78f91bd6' is currently in use.\n\tstatus code: 400, request id: d3554626-ac73-4168-a020-a49d37299205"
time="2018-11-28T05:33:10Z" level=debug msg="Exiting deleting EIPs (map[tectonicClusterID:db802f7a-0af7-4c5a-983e-dc5eea2d78d2])"

These lines have been repeating for ~18 hours now.
ealfassa-test-7-8xqhk-uninstall-bvzrb 1/1 Running 0 18h

Cluster provisioning without cluster installation

Goal: on an existing (freshly–installed) cluster have hive set up all of the standard SRE stuff, especially monitoring. The installation is openshift-installer based, but is not in any way controlled by hive.

Issues I'd like to clarify regarding this goal:

  1. What is the communication channel(s?) used for hive<->endcluster syncing (especially SyncSets as they form the basis afaik) and how can such a channel be set up.
  2. Are there any additional installation steps that a hive–managed cluster is expected to have completed post openshift-install?
  3. Can the communication channels between hive and the end cluster be severed once everything is provisioned? The idea being to have monitored clusters, but without them being managed.

Hive fails to provision cluster - index out out range error

During the process of provisioning a new cluster, hive failed with the following error:

Error: Error applying plan:

3 error(s) occurred:

* module.vpc.data.aws_route_table.worker[5]: data.aws_route_table.worker.5: Your query returned no results. Please change your search criteria and try again.
*module.vpc.aws_route.to_nat_gw[5]: index 5 out of range for list aws_route_table.private_routes.*.id (max 5) in:

${aws_route_table.private_routes.*.id[count.index]}
*module.vpc.aws_route_table_association.worker_routing[5]: index 5 out of range for list aws_route_table.private_routes.*.id (max 5) in:

${aws_route_table.private_routes.*.id[count.index]}

After re-running it with the same arguments to provision another cluster, it passed this phase successfully. Seems like inconsistent failure.

Support for post-installation hooks

We have an use case where we would like to perform additional configuration steps once the cluster has been installerd. These steps will most probably need the kubeconfig of the admin of the cluster to run. Will Hive support this kind of post-install hooks?

CC: @zgalor

incorrect console URL on deployed clusters

Hi,

Our QE noticed that webConsoleURL for deployed clusters has an incorrect value. We've been able to reproduce this every time.

The webConsoleURL in the clusterdeployment status is usually something like https://yasun-stg4-03-api.devcluster.openshift.com:6443/console, and that leads to an error page,

but running oc get route -n openshift-console on the deployed cluster shows a different URL, sometimes like console-openshift-console.apps.yasun-stg4-03.devcluster.openshift.com, and that URL works.

If I'm reading the Hive source code correctly, it seems that it makes an assumption about what the console URL would be, based on the API server URL, and that assumption is apparently not accurate.

// We should be able to assume only one cluster in here:
server := cluster.Server
cdLog.Debugf("found cluster API URL in kubeconfig: %s", server)
u, err := url.Parse(server)
if err != nil {
return err
}
cd.Status.APIURL = server
u.Path = path.Join(u.Path, "console")
cd.Status.WebConsoleURL = u.String()

Can the `hiveImage` and `installerImage` be removed or optional?

Currently in order to create a cluster deployment the user of the API has to explicitly specify the hiveImage and the installerImage parameters. This is error prone, as it is easy to make mistakes and easy to use images that aren't in sync with the rest of Hive. Can this be removed or made optional? If optional then Hive should have information/configuration enough to decide what are the right images to use.

Retrying Install Can Fail on IAM role already existing.

Not sure how this happens, looks like somehow this didn't get cleaned up. Cluster install was retried several times by the point this happened.

module.vpc.aws_lb_listener.api_internal_api: Creation complete after 1s
module.vpc.aws_lb_listener.api_internal_services: Creation complete after 1s
aws_route53_record.etcd_cluster: Still creating... (20s elapsed)
module.dns.aws_route53_record.tectonic_api_external: Still creating... (20s elapsed)
module.dns.aws_route53_record.tectonic_api_internal: Still creating... (10s elapsed)
aws_route53_record.etcd_cluster: Still creating... (30s elapsed)
module.dns.aws_route53_record.tectonic_api_external: Still creating... (30s elapsed)
module.dns.aws_route53_record.tectonic_api_internal: Still creating... (20s elapsed)
aws_route53_record.etcd_cluster: Still creating... (40s elapsed)
module.dns.aws_route53_record.tectonic_api_external: Creation complete after 35s (ID: Z2I29TC6NNC5SM_jamesh-test-4-api.aws.openshift.com_A)                                                                        
aws_route53_record.etcd_cluster: Creation complete after 45s (ID: Z1S45H7KZWOOGM__etcd-server-ssl._tcp.jamesh-test-4_SRV)                                                                                          
module.dns.aws_route53_record.tectonic_api_internal: Still creating... (30s elapsed)
module.dns.aws_route53_record.tectonic_api_internal: Still creating... (40s elapsed)
module.dns.aws_route53_record.tectonic_api_internal: Creation complete after 46s (ID: Z1S45H7KZWOOGM_jamesh-test-4-api.aws.openshift.com_A)                                                                        

Error: Error applying plan:

1 error(s) occurred:

* module.iam.aws_iam_role.worker_role: 1 error(s) occurred:

* aws_iam_role.worker_role: Error creating IAM Role jamesh-test-4-worker-role: EntityAlreadyExists: Role with name jamesh-test-4-worker-role already exists.                                                       
        status code: 409, request id: a953a657-ecd9-11e8-80ce-5fcb6c926834

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.


level=fatal msg="Error executing openshift-install: failed to fetch Cluster: failed to generate asset \"Cluster\": failed to run terraform: failed to execute Terraform: exit status 1"                            
time="2018-11-20T15:37:39Z" level=error msg="error capturing openshift-install stdout" error="read |0: file already closed"                                                                                        
time="2018-11-20T15:37:39Z" level=error msg="error capturing openshift-install stderr" error="read |0: file already closed"                                                                                        
time="2018-11-20T15:37:39Z" level=error msg="error running openshift-install" error="exit status 1"
time="2018-11-20T15:37:39Z" level=info msg="uploading cluster metadata"
time="2018-11-20T15:37:39Z" level=error msg="error creating metadata configmap" error="configmaps \"jamesh-test-4-854wl-metadata\" already exists"                                                                 
time="2018-11-20T15:37:39Z" level=fatal msg="error uploading cluster metadata.json" error="configmaps \"jamesh-test-4-854wl-metadata\" already exists"

Role Used For Cluster Installs Is Not Updated For New Permissions

Our code in clusterdeployment controller that creates the role to run the installer pod ("cluster-installer") does not update the role if it already exists. We should probably ensure the permissions on the role are correct.

Workaround is to delete the role and let it be recreated.

Persist install logs for debugging?

Hi. Currently the main log is the logs of the installer pod. These have a limited lifetime, and also get trampled if the pod keeps restarting...
Are there any plans to copy them somewhere more persistent?

There are also various other logs that might be of interest — controller's logs (well I guess it could create Events if interesting things happen), bootstrap node logs...

cc @elad661 @tzvatot

Support optional adopting ownership of AWS cred secret (possibly SSH keys as well)

SD would like to have the AWS creds secret go away with the cluster deployment. We did not support this as we thought creds would be shared by multiple clusters. However optionally we could add a boolean to the cluster deployment spec indicating it should adopt and own the AWS creds secret. (and probably ssh key secret as well) Controller could then make sure we own these objects as soon as it's possible to do so.

ClusterDeployment without aws credentials => uninstall stuck

Might be related to #114 ? Might also be NOTABUG, PEBKAC 😉

I created a ClusterDeployment (actually 2 of them) without providing aws credentials — the aws secrets contain:

data:
  awsAccessKeyId: null
  awsSecretAccessKey: null

Naturally, no cluster got created.
I deleted the ClusterDeployments and now uninstall pods seem stuck.
Logs: https://gist.github.com/cben/3bf9c8c5e4e8c8f207c24b7460b2ee0c
(as can be expected lots of UnauthorizedOperation and AccessDenied...)

Installer Not Completing Successfully When Run in Hive Pods

We're getting failed clusters where all masters are left in NotReady state due to cni not being configured when we run the installer in Hive pods specifically. If we run externally even with the same image via podman, the cluster comes up fine.

Suspect this points to a problem in how Hive's install manager is executing the installer.

End of a failed Hive provision:

level=info msg="API v1.11.0+d4cacc0 up"                                                                                                                                                                     [0/3239]
level=debug msg="added kube-controller-manager.1566fa9c9ff3b86e: ip-10-0-10-83_b3efcb53-e801-11e8-8d14-0e50234f79ea became leader"                                                                                 
level=debug msg="added kube-scheduler.1566fa9cca7f44e9: ip-10-0-10-83_b3feb2d0-e801-11e8-98c8-0e50234f79ea became leader"
level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 64"
level=warning msg="Failed to connect events watcher: Get https://dgoodwin1-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=64&watch=true: dial tcp 54.234.142.41:6443: con
nect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://dgoodwin1-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=64&watch=true: dial tcp 18.214.173.45:6443: con
nect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://dgoodwin1-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=64&watch=true: dial tcp 54.234.142.41:6443: con
nect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://dgoodwin1-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=64&watch=true: dial tcp 54.211.8.168:6443: conn
ect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://dgoodwin1-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=64&watch=true: dial tcp 54.161.146.11:6443: con
nect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://dgoodwin1-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=64&watch=true: dial tcp 18.207.48.49:6443: conn
ect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://dgoodwin1-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=64&watch=true: dial tcp 54.159.141.39:6443: con
nect: connection refused"
level=error msg="waiting for bootstrap-complete: watch closed before UntilWithoutRetry timeout"
level=info msg="Install complete! Run 'export KUBECONFIG=/output/auth/kubeconfig' to manage your cluster."
level=info msg="After exporting your kubeconfig, run 'oc -h' for a list of OpenShift client commands."
time="2018-11-14T11:58:23Z" level=info msg="uploading cluster metadata"
time="2018-11-14T11:58:23Z" level=info msg="uploaded cluster metadata configmap" configMapName=dgoodwin1-metadata
time="2018-11-14T11:58:23Z" level=info msg="uploading admin kubeconfig"
time="2018-11-14T11:58:23Z" level=info msg="uploaded admin kubeconfig secret" secretName=dgoodwin1-admin-kubeconfig

Running in Podman with:

sudo podman run -ti --rm -e AWS_ACCESS_KEY_ID=SNIP -e AWS_SECRET_ACCESS_KEY=SNIP -e OPENSHIFT_INSTALL_CLUSTER_NAME="dgoodwin2" -e OPENSHIFT_INSTALL_BASE_DOMAIN="SNIP" -e OPENSHIFT_INSTALL_EMAIL_ADDRESS="SNIP" -e OPENSHIFT_INSTALL_PASSWORD="password" -e  OPENSHIFT_INSTALL_SSH_PUB_KEY="SNIP" -e OPENSHIFT_INSTALL_PULL_SECRET="SNIP" -e OPENSHIFT_INSTALL_PLATFORM=aws -e OPENSHIFT_INSTALL_AWS_REGION=us-east-1 -v /home/dgoodwin/go/src/github.com/openshift/installer/output:/output:Z registry.svc.ci.openshift.org/openshift/origin-v4.0:installer create cluster --log-level=debug

The podman install comes up healthy with different output near the end:

WARNING Failed to connect events watcher: Get https://dgoodwin2-api.new-installer.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=1940&watch=true: dial tcp 54.145.130.123:6443: connect: co
nnection refused                                                                                                                                                                                                    
DEBUG added openshift-master-controllers.1566fde611cb83a1: controller-manager-nnvhs became leader                                                                                                                  
DEBUG added kube-controller-manager.1566fde67693b5ec: ip-10-0-17-165_06eba434-e80a-11e8-9e6c-02b5eb23c7ce became leader                                                                                             
DEBUG added bootstrap-complete: cluster bootstrapping has completed              
INFO Destroying the bootstrap resources...                                                                                                                                                                          
DEBUG Stopping RetryWatcher.                                                             
INFO Using Terraform to destroy bootstrap resources...    

Our code to execute can be seen here: https://github.com/openshift/hive/blob/master/contrib/pkg/installmanager/installmanager.go#L246

This was working previously but may have gone bad with installer changes around monitoring the logs from the bootstrap node.

Appears to be the root cause of openshift/cluster-network-operator#35

CC @wking

Install/Uninstall Jobs Must Be Deleted For Changes to Take Effect

Certain fields in cluster deployment effect the jobs that are created, image overrides for example. If the user updates the cluster deployment to correct a problem with these, the jobs are never updated and must be deleted instead, at which point they are recreated correctly. It would be nice if we made some attempts to update them for critical fields like this.

How to get OpenShift version out of the ClusterDeployment?

In UHC, we're looking at the ClusterDeployment's status.clusterVersionStatus.current.version, hoping that this would give us the version of OpenShift deployed on the cluster, but the values we're seeing are something like 0.0.1-2018-12-08-172651.

I assume that this is the version of the Cluster Version Operator, and not OpenShift itself (as this is OpenShift 4, not 0.0.1).

Is there a canonical way to convert this string into an OpenShift version? Is this string expected to always look like this, or to turn into the user facing 4.0.x style number when OpenShift 4 is GA?

Thanks.

cluster provisioning doesn't finish

I've provisioned a cluster using api.openshift.com.

As a result I see the instances were created in AWS, however the bootstrap node is still running.
The end of the output of "journalctl --unit=bootkube.service" on the bootstrap node is:

Nov 26 14:32:48 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-apiserver/openshift-kube-apiserver DoesNotExist
Nov 26 14:32:53 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-controller-manager/openshift-kube-controller-manager-ip-10-0-43-201.ec2.internal Pending
Nov 26 14:32:53 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-cluster-version/cluster-version-operator-8bb6cff75-7fhxd Running
Nov 26 14:32:53 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-apiserver/openshift-kube-apiserver DoesNotExist
Nov 26 14:32:53 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-18-37.ec2.internal Running
Nov 26 14:34:38 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-apiserver/openshift-kube-apiserver-ip-10-0-43-201.ec2.internal Pending
Nov 26 14:34:38 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-18-37.ec2.internal Running
Nov 26 14:34:38 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-controller-manager/openshift-kube-controller-manager-ip-10-0-43-201.ec2.internal Pending
Nov 26 14:34:38 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-cluster-version/cluster-version-operator-8bb6cff75-7fhxd Running
Nov 26 14:34:43 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-apiserver/openshift-kube-apiserver-ip-10-0-43-201.ec2.internal Pending
Nov 26 14:34:43 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-18-37.ec2.internal Running
Nov 26 14:34:43 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-controller-manager/openshift-kube-controller-manager-ip-10-0-43-201.ec2.internal Running
Nov 26 14:34:43 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-cluster-version/cluster-version-operator-8bb6cff75-7fhxd Running
Nov 26 14:34:48 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-18-37.ec2.internal Running
Nov 26 14:34:48 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-controller-manager/openshift-kube-controller-manager-ip-10-0-43-201.ec2.internal Running
Nov 26 14:34:48 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-cluster-version/cluster-version-operator-8bb6cff75-7fhxd Running
Nov 26 14:34:48 ip-10-0-7-196 bootkube.sh[4235]: Pod Status:openshift-kube-apiserver/openshift-kube-apiserver-ip-10-0-43-201.ec2.internal Running
Nov 26 14:34:48 ip-10-0-7-196 bootkube.sh[4235]: All self-hosted control plane components successfully started
Nov 26 14:34:48 ip-10-0-7-196 bootkube.sh[4235]: Tearing down temporary bootstrap control plane...

And on one of the master nodes I see:
[core@ip-10-0-13-248 ~]$ sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=etcd-member --quiet) --quiet)
E1127 08:00:27.762235 3992 remote_runtime.go:278] ContainerStatus "CONTAINER" from runtime service failed: rpc error: code = Unknown desc = container with ID starting with CONTAINER not found: ID does not exist
FATA[0000] rpc error: code = Unknown desc = container with ID starting with CONTAINER not found: ID does not exist
[core@ip-10-0-13-248 ~]$ sudo crictl pods --name=etcd-member
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
73e729675f488 17 hours ago Ready etcd-member-ip-10-0-13-248.ec2.internal kube-system 1
0110484f5a685 18 hours ago NotReady etcd-member-ip-10-0-13-248.ec2.internal kube-system 0

Cluster provision fails on missing cluster-config configmap

During provision of cluster using hive we get the network-operator pod on error config:

[core@ip-10-0-3-26 ~]$ sudo oc get pods --config=/var/opt/tectonic/auth/kubeconfig --all-namespaces 
NAMESPACE                              NAME                                                              READY     STATUS                       RESTARTS   AGE
kube-system                            etcd-member-ip-10-0-8-201.ec2.internal                            1/1       Running                      0          1h
kube-system                            kube-proxy-n64xx                                                  1/1       Running                      0          1h
kube-system                            kube-scheduler-mjsc9                                              0/1       ContainerCreating            0          1h
kube-system                            tectonic-network-operator-gjps2                                   1/1       Running                      0          1h
openshift-cluster-api                  machine-api-operator-649c446d5b-49rf9                             0/1       Pending                      0          1h
openshift-cluster-dns-operator         cluster-dns-operator-6b8b9cbbcd-rmhsx                             0/1       Pending                      0          1h
openshift-cluster-network-operator     cluster-network-operator-cz466                                    0/1       CreateContainerConfigError   0          1h
openshift-cluster-version              cluster-version-operator-4lkmd                                    1/1       Running                      0          1h
openshift-core-operators               openshift-cluster-kube-apiserver-operator-576768b698-7mwg9        0/1       Pending                      0          1h
openshift-core-operators               openshift-cluster-kube-controller-manager-operator-76c57b55sgf8   0/1       Pending                      0          1h
openshift-core-operators               openshift-cluster-kube-scheduler-operator-5b6c66dd59-cc9nr        0/1       Pending                      0          1h
openshift-core-operators               openshift-cluster-openshift-apiserver-operator-759fd945d8-gw8tl   0/1       Pending                      0          1h
openshift-core-operators               openshift-cluster-openshift-controller-manager-operator-58prxwq   0/1       Pending                      0          1h
openshift-core-operators               openshift-service-cert-signer-operator-7fd688bd7f-rjft5           0/1       Pending                      0          1h
openshift-machine-config-operator      machine-config-operator-744f64d9c7-lrhtk                          0/1       Pending                      0          1h
openshift-operator-lifecycle-manager   catalog-operator-cf4cd9c5c-hhc9h                                  0/1       Pending                      0          1h
openshift-operator-lifecycle-manager   olm-operator-7f44dd6495-fv2wg                                     0/1       Pending                      0          1h
openshift-operator-lifecycle-manager   package-server-54d99f7dfc-s2jbq                                   0/1       Pending                      0          1h

Running sudo oc get -o yaml pod -n openshift-cluster-network-operator cluster-network-operator-cz466 --config=/var/opt/tectonic/auth/kubeconfig outputs that the pod is waiting for cluster-config configmap:

containerStatuses:
  - image: registry.svc.ci.openshift.org/openshift/origin-v4.0-20181107015454@sha256:af8333760046cefb84d5f222a96e28c8af06823de8e2fed44647d614cedc925a
    imageID: ""
    lastState: {}
    name: cluster-network-operator
    ready: false
    restartCount: 0
    state:
      waiting:
        message: configmaps "cluster-config" not found
        reason: CreateContainerConfigError

Please report installation errors in the ClusterDeployment status

Right now, the ClusterDeployment status is very vague, a boolean of installed: true or installed: false. However, it seems that in some cases Hive can know when the installer error'd out, for example:

if err := m.updateClusterDeploymentStatus(cd, adminKubeconfigSecret.Name, m); err != nil {
// non-fatal. log and continue.
// will fix up any updates to the clusterdeployment in the periodic controller
m.log.WithError(err).Warning("error updating cluster deployment status")
}
if installErr != nil {
m.log.WithError(installErr).Fatal("failed due to install error")
}
return nil

In these cases, it would be beneficial if the status of the ClusterDeployment would contain error: true or even error: "failed due to installer error".

Clarify Cluster ID, UUID, Name in API

Right now we have the cluster deployment name, a ClusterID, and a ClusterUUID.

The installer has a Name, and a ClusterID. Name is used in DNS in combination with the base domain.

Our ClusterUUID -> their ClusterID
Our ClusterID -> their cluster Name.
Our cluster deployment name is not used.

SD has raised a use case where name is not unique even within one account, as it could be combined with different base domains. As such we can't really map our cluster name to their cluster name. (We're going to need something validating uniqueness)

We should probably rename our ClusterID to ClusterName to eliminate the confusion around ClusterID in each API. Should we make it "DNSName" instead?

SHould we then rename our ClusterUUID to just ClusterID to match installer? Or is the UUID more precise and clear.

The result would be:

ClusterDeployment:
Name: foo-xldkj
DNSName: foo
BaseDomain: example.com
ClusterUUID: UUIDHERE

Resource deletion races with MachineAPI actuators

When the installer attempts to tear down a cluster, it ends up racing with MAO, which is desperately trying to replace the machines that keep getting deleted. The result is that a few machines get recreated before we completely tear down the control plane. The deletion logic in hive doesn't scan for new resources when it runs, so it never sees these new machines and will get stuck trying to delete dependent resources.

I think there are basically two options going forward:

  • Hive/Installer tells MAO (and every future infrastructure operator) to pause before destroying the cluster.
  • The deletion code continually scans for and deletes resources. This process would end once no more resources have been observed.

For what it's worth, I've observed this on both AWS and libvirt.

Some `hive-controller-manage` messages contain `ERROR: logging before flag.Parse`

Some of the messages in the log of the hive-controller-manager pod contain the ERROR: logging before flag.Parse, for example:

ERROR: logging before flag.Parse: W1120 11:25:50.863560       1 reflector.go:341] github.com/openshift/hive/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: watch of *v1alpha1.DNSZone ended with: The resourceVersion for the provided watch is too old.

This is confusing. In the above example it looks like an error, but it is actually a warning. Can this prefix be removed?

Cannot provision the openshift cluster

Follow the document - https://github.com/openshift/hive/blob/master/docs/using-hive.md to clone the latest hive code and use hiveutil to create the openshift cluster. It always throws Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.189.82.228:6443: connect: connection refused. The three master nodes and one bootstrap node are created in aws.

Append more detailed log information:

time="2019-10-14T07:40:07Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.190.58.3:6443: connect: connection refused"
time="2019-10-14T07:40:07Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.190.58.3:6443: connect: connection refused"
time="2019-10-14T07:40:41Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.189.82.228:6443: connect: connection refused"
time="2019-10-14T07:40:41Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.189.82.228:6443: connect: connection refused"
time="2019-10-14T07:40:41Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.189.82.228:6443: connect: connection refused"
time="2019-10-14T07:41:15Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.189.82.228:6443: connect: connection refused"
time="2019-10-14T07:41:15Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.189.82.228:6443: connect: connection refused"
time="2019-10-14T07:41:15Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.clyang-oc-cluster.clyang.de:6443/version?timeout=32s: dial tcp 18.189.82.228:6443: connect: connection refused"
time="2019-10-14T07:41:29Z" level=debug msg="Fetching \"Install Config\"..."
time="2019-10-14T07:41:29Z" level=debug msg="Loading \"Install Config\"..."
time="2019-10-14T07:41:29Z" level=debug msg="Fetching \"Install Config\"..."
time="2019-10-14T07:41:29Z" level=debug msg="Fetching \"Install Config\"..."
time="2019-10-14T07:41:29Z" level=debug msg="Loading \"Install Config\"..."
time="2019-10-14T07:41:29Z" level=debug msg="Loading \"Install Config\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"SSH Key\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"SSH Key\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Base Domain\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Base Domain\"..."
time="2019-10-14T07:41:29Z" level=debug msg="    Loading \"Platform\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Cluster Name\"..."
time="2019-10-14T07:41:29Z" level=debug msg="    Loading \"Base Domain\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Pull Secret\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Platform\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"SSH Key\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Base Domain\"..."
time="2019-10-14T07:41:29Z" level=debug msg="    Loading \"Platform\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Cluster Name\"..."
time="2019-10-14T07:41:29Z" level=debug msg="    Loading \"Base Domain\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Pull Secret\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Platform\"..."
time="2019-10-14T07:41:29Z" level=debug msg="Using \"Install Config\" loaded from state file"
time="2019-10-14T07:41:29Z" level=debug msg="Reusing previously-fetched \"Install Config\""
time="2019-10-14T07:41:29Z" level=debug msg="    Loading \"Platform\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Cluster Name\"..."
time="2019-10-14T07:41:29Z" level=debug msg="    Loading \"Base Domain\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Pull Secret\"..."
time="2019-10-14T07:41:29Z" level=debug msg="  Loading \"Platform\"..."
time="2019-10-14T07:41:29Z" level=debug msg="Using \"Install Config\" loaded from state file"
time="2019-10-14T07:41:29Z" level=debug msg="Reusing previously-fetched \"Install Config\""
time="2019-10-14T07:41:29Z" level=debug msg="Using \"Install Config\" loaded from state file"
time="2019-10-14T07:41:29Z" level=debug msg="Reusing previously-fetched \"Install Config\""
time="2019-10-14T07:41:29Z" level=info msg="Pulling debug logs from the bootstrap machine"
time="2019-10-14T07:41:29Z" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: failed to initialize the SSH agent: failed to read directory \"/root/.ssh\": open /root/.ssh: no such file or directory"
time="2019-10-14T07:41:29Z" level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"
time="2019-10-14T07:41:29Z" level=info msg="Pulling debug logs from the bootstrap machine"
time="2019-10-14T07:41:29Z" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: failed to initialize the SSH agent: failed to read directory \"/root/.ssh\": open /root/.ssh: no such file or directory"
time="2019-10-14T07:41:29Z" level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"
time="2019-10-14T07:41:29Z" level=info msg="Pulling debug logs from the bootstrap machine"
time="2019-10-14T07:41:29Z" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: failed to initialize the SSH agent: failed to read directory \"/root/.ssh\": open /root/.ssh: no such file or directory"

I have tried use openshift/installer to install openshift cluster with the same configuration and credentials as using by hive. it can create the cluster successfully in AWS cloud.

Any comments? Thanks.

Cannot have htpasswd secret created in provisioning cluster

I have followed this guide to create syncidentityprovider, but the htpasswd secret cannot be created in target cluster. Here is my syncidentityprovider file content:

---
apiVersion: hive.openshift.io/v1alpha1
kind: SyncIdentityProvider
metadata:
  name: allowall-identity-provider
spec:
  identityProviders:
  - name: htpasswd
    challenge: true
    login: true
    mappingMethod: claim
    type: HTPasswd
    htpasswd:
      fileData:
        name: htpasswd-zzl2r
  clusterDeploymentRefs:
  - name: "mycluster"

Am I missed anything? I just tried to create htpasswd-zzl2r secret in openshift-config namespace, everything is working now. Thanks.

Drop Admin Password From API

This functionality was recently dropped from the installer in favor of using a static user with an auto-generated password.

We need to revendor the installer, adjust our API to match the change in openshift/installer#771, and I suspect we will need to pull out the admin password generated similar to how we upload the admin kubeconfig after an install.

CC @jhernand the installer dropped functionality to let you specify the OpenShift admin password and email, is this workable on your side?

Is it planned for Hive to provide information about the nodes once the cluster is deployed?

It would be really useful for us (the SD-A team) to get information about the nodes once the cluster is deployed - a list of nodes, their types, IP addresses, their OS version and their container runtime version. All of information is in the output of oc get nodes -o wide

I was wondering if it's planned for Hive to provide this information in the ClusterDeployment object of a deployed cluster.

Support for disconnecting a cluster from Hive wihout de-provisioning it

We have an use case where users would like to end their relationship with us, but they would like to keep the cluster that they have been created. How can this be achieved with Hive? Deleting the cluster is not OK for this, because it will de-provision it. In some cases we edited the clusterdeployment manually to remove the finalizer, and then we deleted the clusterdeployment object. Is that enough, or is there any other think that needs to be done to make sure that the cluster is completely disconnected from Hive? Also, will that be supported going forward?

Hive limitations on CIDR

Seems like hive have limitations/constraints about the CIDR blocks for ServiceCIDR and PodCIDR.
Can you please share what are the limitations?
Can hive calculate the ServiceCIDR and PodCIDR itself based on the VPCCIDRBlock?

Hive can't load the admin kubeconfig it created for a new cluster.

Description:

After creating a new cluster Hive creates a secret with the new clusters kubeconfig.
Hive then tries to load it and fails.

TL;DR: It looks like a generateName vs name bug.

What happen:

In my Hive logs I get:

DEBU[9541] reconcile complete                            clusterDeployment=yaacov-05-jnptv job=yaacov-05-jnptv-install namespace=unified-hybrid-cloud
ERRO[9542] unable to load admin kubeconfig               clusterDeployment=yaacov-05-jnptv controller=remotemachineset error="Secret \"yaacov-05-admin-kubeconfig\" not found" namespace=unified-hybrid-cloud
ERRO[9543] unable to load admin kubeconfig               clusterDeployment=yaacov-05-jnptv controller=remotemachineset error="Secret \"yaacov-05-admin-kubeconfig\" not found" namespace=unified-hybrid-cloud

When I do oc get secrets I see a kubeconfig secret named yaacov-05-jnptv-admin-kubeconfig

Use registry.svc.ci.openshift.org/openshift/hive-v4.0:hive as the default hive image

Presently we're using local image defaults in the code, this should match installer:

defaultInstallerImage           = "registry.svc.ci.openshift.org/openshift/origin-v4.0:installer"                                                                                                           
defaultInstallerImagePullPolicy = corev1.PullAlways                                                                                                                                                         
defaultHiveImage                = "hive-controller:latest"                                                                                                                                                  
defaultHiveImagePullPolicy      = corev1.PullNever       

Pull policy should be always as well.

'Zones' field does not work

Specifying the "Zones" field in the ClusterDeployment object should provision the nodes on the specified list of zones, Assuming I understand correctly. However this does not work - for example creating the following deployment:

Name:         degas-mtpjc
Namespace:    uhc-development
Labels:       api.openshift.com/id=1F4FyulChH0kFyhSlewODxbjYVV
              api.openshift.com/name=degas
              controller-tools.k8s.io=1.0
Annotations:  <none>
API Version:  hive.openshift.io/v1alpha1
Kind:         ClusterDeployment
...
Spec:
  Cluster UUID:  94d16fa8-3e75-4b0b-a40d-9579c269c5a0
  Config:
    Base Domain:  sdev.devshift.net
    Cluster ID:   degas
    Machines:
      Name:  master
      Platform:
        Aws:
          Iam Role Name:  
          Root Volume:
            Iops:  100
            Size:  32
            Type:  gp2
          Type:    m5.xlarge
      Replicas:    3
      Name:        worker
      Platform:
        Aws:
          Iam Role Name:  TBD
          Root Volume:
            Iops:  100
            Size:  32
            Type:  gp2
          Type:    m5.xlarge
          Zones:
            us-east-1a
....

Does not create a single AZ cluster in zone us-east-1a even though this AZ is available (according to my AWS console - see image below)

screenshot from 2018-12-30 11-13-54

cc: @jhernand @oourfali @tzvatot @dgoodwin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.