ryanhay / ocp4-metal-install Goto Github PK

View Code? Open in Web Editor NEW

195.0 22.0 366.0 280 KB

Install OpenShift 4 on Bare Metal - UPI

ocp4-metal-install's Introduction

OpenShift 4 Bare Metal Install - User Provisioned Infrastructure (UPI)

OpenShift 4 Bare Metal Install - User Provisioned Infrastructure (UPI)

Architecture Diagram

Download Software

Download CentOS 8 x86_64 image
Login to RedHat OpenShift Cluster Manager
Select 'Create Cluster' from the 'Clusters' navigation menu
Select 'RedHat OpenShift Container Platform'
Select 'Run on Bare Metal'
Download the following files:
- Openshift Installer for Linux
- Pull secret
- Command Line Interface for Linux and your workstations OS
- Red Hat Enterprise Linux CoreOS (RHCOS)
  - rhcos-X.X.X-x86_64-metal.x86_64.raw.gz
  - rhcos-X.X.X-x86_64-installer.x86_64.iso (or rhcos-X.X.X-x86_64-live.x86_64.iso for newer versions)

Prepare the 'Bare Metal' environment

VMware ESXi used in this guide

Copy the CentOS 8 iso to an ESXi datastore
Create a new Port Group called 'OCP' under Networking
- (In case of VirtualBox choose "Internal Network" when creating each VM and give it the same name. ocp for instance)
- (In case of ProxMox you may use the same network bridge and choose a specific VLAN tag. 50 for instance)
Create 3 Control Plane virtual machines with minimum settings:
- Name: ocp-cp-# (Example ocp-cp-1)
- 4vcpu
- 8GB RAM
- 50GB HDD
- NIC connected to the OCP network
- Load the rhcos-X.X.X-x86_64-installer.x86_64.iso image into the CD/DVD drive
Create 2 Worker virtual machines (or more if you want) with minimum settings:
- Name: ocp-w-# (Example ocp-w-1)
- 4vcpu
- 8GB RAM
- 50GB HDD
- NIC connected to the OCP network
- Load the rhcos-X.X.X-x86_64-installer.x86_64.iso image into the CD/DVD drive
Create a Bootstrap virtual machine (this vm will be deleted once installation completes) with minimum settings:
- Name: ocp-boostrap
- 4vcpu
- 8GB RAM
- 50GB HDD
- NIC connected to the OCP network
- Load the rhcos-X.X.X-x86_64-installer.x86_64.iso image into the CD/DVD drive
Create a Services virtual machine with minimum settings:
- Name: ocp-svc
- 4vcpu
- 4GB RAM
- 120GB HDD
- NIC1 connected to the VM Network (LAN)
- NIC2 connected to the OCP network
- Load the CentOS_8.iso image into the CD/DVD drive
Boot all virtual machines so they each are assigned a MAC address
Shut down all virtual machines except for 'ocp-svc'
Use the VMware ESXi dashboard to record the MAC address of each vm, these will be used later to set static IPs

Configure Environmental Services

Install CentOS8 on the ocp-svc host
- Remove the home dir partition and assign all free storage to '/'
- Optionally you can install the 'Guest Tools' package to have monitoring and reporting in the VMware ESXi dashboard
- Enable the LAN NIC only to obtain a DHCP address from the LAN network and make note of the IP address (ocp-svc_IP_address) assigned to the vm
Boot the ocp-svc VM

Move the files downloaded from the RedHat Cluster Manager site to the ocp-svc node

scp ~/Downloads/openshift-install-linux.tar.gz ~/Downloads/openshift-client-linux.tar.gz ~/Downloads/rhcos-metal.x86_64.raw.gz root@{ocp-svc_IP_address}:/root/

SSH to the ocp-svc vm
```
ssh root@{ocp-svc_IP_address}
```

Extract Client tools and copy them to /usr/local/bin

tar xvf openshift-client-linux.tar.gz
mv oc kubectl /usr/local/bin

Confirm Client Tools are working
```
kubectl version
oc version
```
Extract the OpenShift Installer
```
tar xvf openshift-install-linux.tar.gz
```
Update CentOS so we get the latest packages for each of the services we are about to install
```
dnf update
```
Install Git
```
dnf install git -y
```

Download config files for each of the services

git clone https://github.com/ryanhay/ocp4-metal-install

OPTIONAL: Create a file '~/.vimrc' and paste the following (this helps with editing in vim, particularly yaml files):
```
cat <<EOT >> ~/.vimrc
syntax on
set nu et ai sts=0 ts=2 sw=2 list hls
EOT
```
Update the preferred editor
```
export OC_EDITOR="vim"
export KUBE_EDITOR="vim"
```
Set a Static IP for OCP network interface nmtui-edit ens224 or edit /etc/sysconfig/network-scripts/ifcfg-ens224
- Address: 192.168.22.1
- DNS Server: 127.0.0.1
- Search domain: ocp.lan
- Never use this network for default route
- Automatically connect
If changes arent applied automatically you can bounce the NIC with nmcli connection down ens224 and nmcli connection up ens224
Setup firewalld

Create internal and external zones
```
nmcli connection modify ens224 connection.zone internal
nmcli connection modify ens192 connection.zone external
```
View zones:
```
firewall-cmd --get-active-zones
```
Set masquerading (source-nat) on the both zones.

So to give a quick example of source-nat - for packets leaving the external interface, which in this case is ens192 - after they have been routed they will have their source address altered to the interface address of ens192 so that return packets can find their way back to this interface where the reverse will happen.
```
firewall-cmd --zone=external --add-masquerade --permanent
firewall-cmd --zone=internal --add-masquerade --permanent
```
Reload firewall config
```
firewall-cmd --reload
```
Check the current settings of each zone
```
firewall-cmd --list-all --zone=internal
firewall-cmd --list-all --zone=external
```
When masquerading is enabled so is ip forwarding which basically makes this host a router. Check:
```
cat /proc/sys/net/ipv4/ip_forward
```

Install and configure BIND DNS

Install

dnf install bind bind-utils -y

Apply configuration

\cp ~/ocp4-metal-install/dns/named.conf /etc/named.conf
cp -R ~/ocp4-metal-install/dns/zones /etc/named/

Configure the firewall for DNS

firewall-cmd --add-port=53/udp --zone=internal --permanent
# for OCP 4.9 and later 53/tcp is required
firewall-cmd --add-port=53/tcp --zone=internal --permanent
firewall-cmd --reload

Enable and start the service

systemctl enable named
systemctl start named
systemctl status named

At the moment DNS will still be pointing to the LAN DNS server. You can see this by testing with dig ocp.lan.

Change the LAN nic (ens192) to use 127.0.0.1 for DNS AND ensure Ignore automatically Obtained DNS parameters is ticked

nmtui-edit ens192

Restart Network Manager

systemctl restart NetworkManager

Confirm dig now sees the correct DNS results by using the DNS Server running locally

dig ocp.lan
# The following should return the answer ocp-bootstrap.lab.ocp.lan from the local server
dig -x 192.168.22.200

Install & configure DHCP

Install the DHCP Server
```
dnf install dhcp-server -y
```
Edit dhcpd.conf from the cloned git repo to have the correct mac address for each host and copy the conf file to the correct location for the DHCP service to use
```
\cp ~/ocp4-metal-install/dhcpd.conf /etc/dhcp/dhcpd.conf
```
Configure the Firewall
```
firewall-cmd --add-service=dhcp --zone=internal --permanent
firewall-cmd --reload
```
Enable and start the service
```
systemctl enable dhcpd
systemctl start dhcpd
systemctl status dhcpd
```
Install & configure Apache Web Server

Install Apache
```
dnf install httpd -y
```
Change default listen port to 8080 in httpd.conf
```
sed -i 's/Listen 80/Listen 0.0.0.0:8080/' /etc/httpd/conf/httpd.conf
```
Configure the firewall for Web Server traffic
```
firewall-cmd --add-port=8080/tcp --zone=internal --permanent
firewall-cmd --reload
```
Enable and start the service
```
systemctl enable httpd
systemctl start httpd
systemctl status httpd
```
Making a GET request to localhost on port 8080 should now return the default Apache webpage
```
curl localhost:8080
```

Install & configure HAProxy

Install HAProxy

dnf install haproxy -y

Copy HAProxy config

\cp ~/ocp4-metal-install/haproxy.cfg /etc/haproxy/haproxy.cfg

Configure the Firewall

Note: Opening port 9000 in the external zone allows access to HAProxy stats that are useful for monitoring and troubleshooting. The UI can be accessed at: http://{ocp-svc_IP_address}:9000/stats

firewall-cmd --add-port=6443/tcp --zone=internal --permanent # kube-api-server on control plane nodes
firewall-cmd --add-port=6443/tcp --zone=external --permanent # kube-api-server on control plane nodes
firewall-cmd --add-port=22623/tcp --zone=internal --permanent # machine-config server
firewall-cmd --add-service=http --zone=internal --permanent # web services hosted on worker nodes
firewall-cmd --add-service=http --zone=external --permanent # web services hosted on worker nodes
firewall-cmd --add-service=https --zone=internal --permanent # web services hosted on worker nodes
firewall-cmd --add-service=https --zone=external --permanent # web services hosted on worker nodes
firewall-cmd --add-port=9000/tcp --zone=external --permanent # HAProxy Stats
firewall-cmd --reload

Enable and start the service

setsebool -P haproxy_connect_any 1 # SELinux name_bind access
systemctl enable haproxy
systemctl start haproxy
systemctl status haproxy

Install and configure NFS for the OpenShift Registry. It is a requirement to provide storage for the Registry, emptyDir can be specified if necessary.

Install NFS Server

dnf install nfs-utils -y

Create the Share

Check available disk space and its location df -h

mkdir -p /shares/registry
chown -R nobody:nobody /shares/registry
chmod -R 777 /shares/registry

Export the Share

echo "/shares/registry  192.168.22.0/24(rw,sync,root_squash,no_subtree_check,no_wdelay)" > /etc/exports
exportfs -rv

Set Firewall rules:

firewall-cmd --zone=internal --add-service mountd --permanent
firewall-cmd --zone=internal --add-service rpc-bind --permanent
firewall-cmd --zone=internal --add-service nfs --permanent
firewall-cmd --reload

Enable and start the NFS related services

systemctl enable nfs-server rpcbind
systemctl start nfs-server rpcbind nfs-mountd

Generate and host install files

Generate an SSH key pair keeping all default options
```
ssh-keygen
```
Create an install directory
```
mkdir ~/ocp-install
```
Copy the install-config.yaml included in the clones repository to the install directory
```
cp ~/ocp4-metal-install/install-config.yaml ~/ocp-install
```
Update the install-config.yaml with your own pull-secret and ssh key.
- Line 23 should contain the contents of your pull-secret.txt
- Line 24 should contain the contents of your '~/.ssh/id_rsa.pub'
```
vim ~/ocp-install/install-config.yaml
```
Generate Kubernetes manifest files
```
~/openshift-install create manifests --dir ~/ocp-install
```
A warning is shown about making the control plane nodes schedulable. It is up to you if you want to run workloads on the Control Plane nodes. If you dont want to you can disable this with: sed -i 's/mastersSchedulable: true/mastersSchedulable: false/' ~/ocp-install/manifests/cluster-scheduler-02-config.yml. Make any other custom changes you like to the core Kubernetes manifest files.

Generate the Ignition config and Kubernetes auth files
```
~/openshift-install create ignition-configs --dir ~/ocp-install/
```
Create a hosting directory to serve the configuration files for the OpenShift booting process
```
mkdir /var/www/html/ocp4
```
Copy all generated install files to the new web server directory
```
cp -R ~/ocp-install/* /var/www/html/ocp4
```
Move the Core OS image to the web server directory (later you need to type this path multiple times so it is a good idea to shorten the name)
```
mv ~/rhcos-X.X.X-x86_64-metal.x86_64.raw.gz /var/www/html/ocp4/rhcos
```

Change ownership and permissions of the web server directory

chcon -R -t httpd_sys_content_t /var/www/html/ocp4/
chown -R apache: /var/www/html/ocp4/
chmod 755 /var/www/html/ocp4/

Confirm you can see all files added to the /var/www/html/ocp4/ dir through Apache
```
curl localhost:8080/ocp4/
```

Deploy OpenShift

Power on the ocp-bootstrap host and ocp-cp-# hosts and select 'Tab' to enter boot configuration. Enter the following configuration:

# Bootstrap Node - ocp-bootstrap
coreos.inst.install_dev=sda coreos.inst.image_url=http://192.168.22.1:8080/ocp4/rhcos coreos.inst.insecure=yes coreos.inst.ignition_url=http://192.168.22.1:8080/ocp4/bootstrap.ign

# Or if you waited for it boot, use the following command then just reboot after it finishes and make sure you remove the attached .iso
sudo coreos-installer install /dev/sda -u http://192.168.22.1:8080/ocp4/rhcos -I http://192.168.22.1:8080/ocp4/bootstrap.ign --insecure --insecure-ignition

# Each of the Control Plane Nodes - ocp-cp-\#
coreos.inst.install_dev=sda coreos.inst.image_url=http://192.168.22.1:8080/ocp4/rhcos coreos.inst.insecure=yes coreos.inst.ignition_url=http://192.168.22.1:8080/ocp4/master.ign

# Or if you waited for it boot, use the following command then just reboot after it finishes and make sure you remove the attached .iso
sudo coreos-installer install /dev/sda -u http://192.168.22.1:8080/ocp4/rhcos -I http://192.168.22.1:8080/ocp4/master.ign --insecure --insecure-ignition

Power on the ocp-w-# hosts and select 'Tab' to enter boot configuration. Enter the following configuration:

# Each of the Worker Nodes - ocp-w-\#
coreos.inst.install_dev=sda coreos.inst.image_url=http://192.168.22.1:8080/ocp4/rhcos coreos.inst.insecure=yes coreos.inst.ignition_url=http://192.168.22.1:8080/ocp4/worker.ign

# Or if you waited for it boot, use the following command then just reboot after it finishes and make sure you remove the attached .iso
sudo coreos-installer install /dev/sda -u http://192.168.22.1:8080/ocp4/rhcos -I http://192.168.22.1:8080/ocp4/worker.ign --insecure --insecure-ignition

Monitor the Bootstrap Process

You can monitor the bootstrap process from the ocp-svc host at different log levels (debug, error, info)
```
~/openshift-install --dir ~/ocp-install wait-for bootstrap-complete --log-level=debug
```
Once bootstrapping is complete the ocp-boostrap node can be removed

Remove the Bootstrap Node

Remove all references to the ocp-bootstrap host from the /etc/haproxy/haproxy.cfg file

# Two entries
vim /etc/haproxy/haproxy.cfg
# Restart HAProxy - If you are still watching HAProxy stats console you will see that the ocp-boostrap host has been removed from the backends.
systemctl reload haproxy

The ocp-bootstrap host can now be safely shutdown and deleted from the VMware ESXi Console, the host is no longer required

Wait for installation to complete

IMPORTANT: if you set mastersSchedulable to false the worker nodes will need to be joined to the cluster to complete the installation. This is because the OpenShift Router will need to be scheduled on the worker nodes and it is a dependency for cluster operators such as ingress, console and authentication.

Collect the OpenShift Console address and kubeadmin credentials from the output of the install-complete event
```
~/openshift-install --dir ~/ocp-install wait-for install-complete
```
Continue to join the worker nodes to the cluster in a new tab whilst waiting for the above command to complete

Join Worker Nodes

Setup 'oc' and 'kubectl' clients on the ocp-svc machine

export KUBECONFIG=~/ocp-install/auth/kubeconfig
# Test auth by viewing cluster nodes
oc get nodes

View and approve pending CSRs

Note: Once you approve the first set of CSRs additional 'kubelet-serving' CSRs will be created. These must be approved too. If you do not see pending requests wait until you do.

# View CSRs
oc get csr
# Approve all pending CSRs
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
# Wait for kubelet-serving CSRs and approve them too with the same command
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve

Watch and wait for the Worker Nodes to join the cluster and enter a 'Ready' status

This can take 5-10 minutes
```
watch -n5 oc get nodes
```

Configure storage for the Image Registry

A Bare Metal cluster does not by default provide storage so the Image Registry Operator bootstraps itself as 'Removed' so the installer can complete. As the installation has now completed storage can be added for the Registry and the operator updated to a 'Managed' state.

Create the 'image-registry-storage' PVC by updating the Image Registry operator config by updating the management state to 'Managed' and adding 'pvc' and 'claim' keys in the storage key:
```
oc edit configs.imageregistry.operator.openshift.io
```
```
managementState: Managed
```
```
storage:
  pvc:
    claim: # leave the claim blank
```
Confirm the 'image-registry-storage' pvc has been created and is currently in a 'Pending' state
```
oc get pvc -n openshift-image-registry
```
Create the persistent volume for the 'image-registry-storage' pvc to bind to
```
oc create -f ~/ocp4-metal-install/manifest/registry-pv.yaml
```
After a short wait the 'image-registry-storage' pvc should now be bound
```
oc get pvc -n openshift-image-registry
```

Create the first Admin user

Apply the oauth-htpasswd.yaml file to the cluster

This will create a user 'admin' with the password 'password'. To set a different username and password substitue the htpasswd key in the '~/ocp4-metal-install/manifest/oauth-htpasswd.yaml' file with the output of htpasswd -n -B -b <username> <password>
```
oc apply -f ~/ocp4-metal-install/manifest/oauth-htpasswd.yaml
```

Assign the new user (admin) admin permissions

oc adm policy add-cluster-role-to-user cluster-admin admin

Access the OpenShift Console

Wait for the 'console' Cluster Operator to become available
```
oc get co
```

Append the following to your local workstations /etc/hosts file:

From your local workstation If you do not want to add an entry for each new service made available on OpenShift you can configure the ocp-svc DNS server to serve externally and create a wildcard entry for *.apps.lab.ocp.lan

# Open the hosts file
sudo vi /etc/hosts

# Append the following entries:
192.168.0.96 ocp-svc api.lab.ocp.lan console-openshift-console.apps.lab.ocp.lan oauth-openshift.apps.lab.ocp.lan downloads-openshift-console.apps.lab.ocp.lan alertmanager-main-openshift-monitoring.apps.lab.ocp.lan grafana-openshift-monitoring.apps.lab.ocp.lan prometheus-k8s-openshift-monitoring.apps.lab.ocp.lan thanos-querier-openshift-monitoring.apps.lab.ocp.lan

Navigate to the OpenShift Console URL and log in as the 'admin' user

You will get self signed certificate warnings that you can ignore If you need to login as kubeadmin and need to the password again you can retrieve it with: cat ~/ocp-install/auth/kubeadmin-password

Troubleshooting

You can collect logs from all cluster hosts by running the following command from the 'ocp-svc' host:

./openshift-install gather bootstrap --dir ocp-install --bootstrap=192.168.22.200 --master=192.168.22.201 --master=192.168.22.202 --master=192.168.22.203

Modify the role of the Control Plane Nodes

If you would like to schedule workloads on the Control Plane nodes apply the 'worker' role by changing the value of 'mastersSchedulable' to true.

If you do not want to schedule workloads on the Control Plane nodes remove the 'worker' role by changing the value of 'mastersSchedulable' to false.

Remember depending on where you host your workloads you will have to update HAProxy to include or exclude the control plane nodes from the ingress backends.
```
oc edit schedulers.config.openshift.io cluster
```

ocp4-metal-install's People

Contributors

Stargazers

Watchers

Forkers

iamsenorespana srikantt mghanawi nishnair82 dinesh737 ramarao05 slimeril robyyasiramri devopacademics adoang1919 mndambuki davideperson mohanboyapati a6pamob prm239 firstvenue dkoci blaurans hazemp94 jkwonl florianmoss mohzaher2000 hifijhc raulsanzbla khramov86 abbbi migutak karimbzu anudeeplalam pippi2007 ireshmm dennyjoe bu3ny darenjacobscellwize mohammadshakirsaifi yasirsharif dokspribadi itisdeep allankiplangat akhil08254 andros45 alanpeng rajatjpatel umarmani mdioum robinluyi ahachmann rcdelacruz milindlasurve serg1969 gittest20202 alyfantisd arvindjha joe-speedboat lintasarta loloklo miiraheart bajajamit09 resatsahin alfredoisrael-lopez muzammilpeer brightzheng100 marconesns mustaphabouhalleb minhlh vrushalisonwani shkatara rajesh-manne25 bfields3 benny-fields3rd koteechowdary1 surmdren suk2021 mcphaul91 va123564 vijay24794 uubufr cjsilvadf ganesh35 kotialla-sudo mikelo sipatha coolhandtim allwyn79 rockfreak101 samuvseji crleekwc suchi-rebaca anil-kattoju ae-exact karthikprabhut pbmoses antareja nkosea dstoeckm hyosunkim mani9990 kairen artrza tuanna122

ocp4-metal-install's Issues

bootstap fails Error: decoding and writing image

Hello,

can you please advise?

ssh-keygen to be done by root user or core user on ocp-svc node ?

Hi Ryan,
thanks for the information here.
just one thing ... you generated the ssh key (ssh-keygen) using the root user on the svc node , but should it be generated by 'root' user or 'core' user ? as there is no root user on core OS

OS Suggestion For This Now that CentOS 8 is EOL

OS Suggestion For This Now that CentOS 8 is EOL? Thanks! Can Stream to used, or CentOS 7 is still in life, or RHEL?

DEBUG Still waiting for the Kubernetes API: the server has asked for the client to provide credentials

[root@ocp-svc ~]# ~/openshift-install --dir ~/ocp-install wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.9.18
DEBUG Built from commit eb132dae953888e736c382f1176c799c0e1aa49e
INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.ocp.lan:6443...
DEBUG Still waiting for the Kubernetes API: the server has asked for the client to provide credentials

can any one know why it's happening like this....

overlayfs: unrecognized mount option ""volatile" or missing value

I'm trying to install openshift 4.9 using IPI and UPI but the same error appears in the bootstrap machine "overlayfs: unrecognized mount option ""volatile" or missing value"
and the installation didn't complete with the following output: "if anyone help i'll very thankful"

time="2021-11-08T05:38:16+02:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2021-11-08T05:38:16+02:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
time="2021-11-08T05:38:16+02:00" level=fatal msg="Bootstrap failed to complete"

======

I'm using vsphere 6.7U2
.openshift_install.log
.openshift-install.log is attached

When issueing setbool command under haproxy section error

When issueing this command

setsebool -P haproxy_connect_any 1 # SELinux name_bind access

I get the following error and cannot enable haproxy

[root@ocp-svc haproxy]# libsepol.context_from_record: type hwtracing_device_t is not defined
libsepol.context_from_record: could not create context structure
libsepol.context_from_string: could not create context structure
libsepol.sepol_context_to_sid: could not convert system_u:object_r:hwtracing_device_t:s0 to sid
invalid context system_u:object_r:hwtracing_device_t:s0
Failed to commit changes to booleans: Success

RHEL 9

Using different FQDNs

Why do you use opc-svc.opc.lan and sometimes use ocp-svc.lab.ocp.lan?

Kubernetes API Issue

Hi Ryan,

Thank you so much with this repository. I am running openshift and am trying to get the Bootstrap to complete and debugging it using the command ~/openshift-install --dir ~/ocp-install wait-for bootstrap-complete --log-level=debug.

Error:
INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.ocp.lan:6443/...
DEBUG Still waiting for the Kubernetes Get " https://api.lab.ocp.lan:6443/version":EOF.

If anyone else had the same error or knows how to fix it let me know.
Thanks

Critical documentation error

I don't know if this was introduced with Openshift 4.7 ... but you need to change the kernel boot arguments otherwise it will fail to boot with missing gpg sig files that you can not get hold of ...

The official documents state ...

"If you are using coreos.inst.image_url, you must also use coreos.inst.insecure. This is because the bare-metal media are not GPG-signed for OpenShift Container Platform."

Your kernal boot documentation is omitting the final

coreos.inst.insecure

!!!
Took me 2 weeks to figure out ... I kept trying to sign the image files with the ssh-keygen which is impossible currently with the tools in RHEL 8.

Suggestion for NFS shares

Hey Ryan,

I have a couple of suggestions for your NFS section (18).

create a couple of extra shares for when the cluster is ready - folks are going to need a few, might as well set them up here...

   mkdir -p /shares/{registry,pv0001,pv0002,pv0003}
   chown -R nobody:nobody /shares/{registry,pv0001,pv0002,pv0003}
   chmod -R 777 /shares/{registry,pv0001,pv0002,pv0003}

and

echo "/shares/registry 192.168.22.0/24(rw,sync,root_squash,no_subtree_check,no_wdelay)" > /etc/exports
echo "/shares/pv0001 192.168.22.0/24(rw,sync,root_squash,no_subtree_check,no_wdelay)" >> /etc/exports
echo "/shares/pv0002 192.168.22.0/24(rw,sync,root_squash,no_subtree_check,no_wdelay)" >> /etc/exports
echo "/shares/pv0003 192.168.22.0/24(rw,sync,root_squash,no_subtree_check,no_wdelay)" >> /etc/exports
exportfs -arv

Folks should pay close attention to the subnet and use the correct one. I deployed a cluster on IBM's cloud and my subnet was 10.70.174.128/26 - which looks like a normal IP address but isn't, it's a "network" because the CIDR is /26 ;-)

I would even recommend spinning up an extra "test" vm on the cluster's subnet and actually testing the NFS mounts before attempt to use them persistent volumes:

sudo mkdir /test
mount -t nfs ocp-svc:/shares/registry /test
touch /test/it-works
rm /test/it-works
umount /test

If this works, it can save you some real issues down the road...

You could also use the VM to validate some network routes and name resolution to ensure the firewall is has opened it's ports and is forwarding traffic correctly.

Thanks again for your great work capturing all of this.

Is the ~/ocp-install/auth directory needed to be copied to /var/www/html/ocp4?

For this step
cp -R ~/ocp-install/* /var/www/html/ocp4

This would also copy the ~/ocp-install/auth directory and exposed via the web server.
It would include the two files:
kubeadmin-password and kubeconfig

In case it is necessary for installation of control pane / nodes, would it be ok to remove the directory /var/www/html/ocp4/auth after installation?

HAProxy stats not updating on adding new worker nodes

I followed same steps to configure OCP4-Metal-install and I successfully configured the cluster with 3 control plane nodes and two worker nodes.
I added additional node to the cluster successfully but the status in not updating in HA-proxy. any changes do I need to perform in HA-Proxy. any support is much appreciated.

[root@ocp-svc log]# oc get nodes
NAME STATUS ROLES AGE VERSION
ocp-cp-1.lab.ocp.lan Ready control-plane,master,worker 28h v1.27.8+4fab27b
ocp-cp-2.lab.ocp.lan Ready control-plane,master,worker 27h v1.27.8+4fab27b
ocp-cp-3.lab.ocp.lan Ready control-plane,master,worker 27h v1.27.8+4fab27b
ocp-w-1.lab.ocp.lan Ready worker 26h v1.27.8+4fab27b
ocp-w-2.lab.ocp.lan Ready worker 26h v1.27.8+4fab27b
ocp-w-3.lab.ocp.lan Ready worker 7h37m v1.27.8+4fab27b
ocp-w-4.lab.ocp.lan Ready worker 7h17m v1.27.8+4fab27b
[root@ocp-svc log]#
[root@ocp-svc log]#
[root@ocp-svc log]#

Documentation issue under Deploy OpenShift

Step 1 - I can not get either option to work.

Q1: Is rhcos equal to the rhcos*.iso file?
Q2: For the bootstrap method, do you replace the entire line after pressing the TAB?
Q3: For the other method I get the following error:

# Bootstrap Node - ocp-bootstrap
coreos.inst.install_dev=sda coreos.inst.image_url=http://192.168.22.1:8080/ocp4/rhcos coreos.inst.insecure=yes coreos.inst.ignition_url=http://192.168.22.1:8080/ocp4/bootstrap.ign

# Or if you waited for it boot, use the following command then just reboot after it finishes and make sure you remove the attached .iso
sudo coreos-installer install /dev/sda -u http://192.168.22.1:8080/ocp4/rhcos -I http://192.168.22.1:8080/ocp4/bootstrap.ign --insecure --insecure-ignition

Redhat interface has been changed and I cannot find the coreOS itself

Dears,
Kindly I need your help to download cluster and its component as the interface has a major change and cannot found the most of components
Thanks

Issue during boot on bootstrap and control plane

Version
$ openshift-install version
./openshift-install 4.5.9
built from commit 0d5c871ce7d03f3d03ab4371dc39916a5415cf5c
release image quay.io/openshift-release-dev/ocp-release@sha256:7ad540594e2a667300dd2584fe2ede2c1a0b814ee6a62f60809d87ab564f4425
Platform:
baremetal

UPI (semi-manual installation on customised infrastructure)
What happened?
Cluster details:
Control planes on the baremetal and worker on baremetal. But the bootstrap is running on the ESXi server which is on the same network.

After i launch my boot strap and control nodes i can see this message for boot strap:

~/openshift-install --dir ~/ocp-install wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.5.9
DEBUG Built from commit 0d5c871
INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.ocp.lan:6443...
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF
DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF

And on the Control plane nodes i can see the error as seen below image:

May i know if i have missed something? i fell some issue with connectivity.

expose applications to external IPs

I deployed a oracle Database container and trying to access the DB using service. But after exposing the service, I'm not able to access or ping the external IP. i used nodeport and external IP as well but still not able to access the database.

[root@ocp-svc ~]# oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 172.30.0.1 443/TCP 21d
openshift ExternalName kubernetes.default.svc.cluster.local 21d
os-sample-java-web ClusterIP 172.30.104.87 8080/TCP,8443/TCP,8778/TCP 4d19h
sdsqa-oracle-svc02 NodePort 172.30.111.211 10.221.92.160 1521:31864/TCP 5d23h
sdsqa-oracle-svc03 NodePort 172.30.167.211 10.221.92.160 1521:31865/TCP 4d21h
sdsqa-oracle-svc04 NodePort 172.30.213.22 1521:31866/TCP 2d4h
[root@ocp-svc ~]#

Any idea how we can access applications using external IPs or nodeport mechanism.

Error installing worker node

Hi,

Following your install guide got an error on worker node installation with attempt more then 600 times and keep going. See attach file for the screenshot.

Please help on what needs to be check and done here.

RHCOS Install file

This file (rhcos-X.X.X-x86_64-metal.x86_64.raw.gz) is no longer an available download when creating a cluster. Any idea what I should use instead?

CP nodes not taking url's

Hello Sir,

I do the same configuration of Service node on CentOS 7 rather than 8 because I was facing issues to download repository on CentOS8. All configuration I done all was successful on CentOS 7 but whenever I try to pull the files on Bootstrap node it failed to download the files from the URL

I attached the screenshot of error

bootstap fails Error

hello, I have a problem when installing openshift 4.11 get the following error on the bootstrap machine.
Error: couldn't find boot device for /dev/sda
Resetting partition table
Error : install failed

https://photos.app.goo.gl/jFRcjr24hm11rVdN7

Openshift 4.7. There is a known bug that is causing the boostrap and install phases to fail on vmware version 14 VM's.

You may want to put a warning up on your docs :-) #2 months of pain.

https://bugzilla.redhat.com/show_bug.cgi?id=1935539

RH have issued a warning on their web site

Virtual machines (VMs) configured to use virtual hardware version 14 or greater might result in a failed installation. It is recommended to configure VMs with virtual hardware version 13. This is a known issue that is being addressed in BZ#1935539.

I have seen this with both ESXi 7.0b and Proxmox 6.3-6 ..may save people 2 months of pain I have been going through.

The nature of the problem means you just have to keep on restarting the bootstrap phase / install phase repeatedly ever time it times out and fails ... eventually, if lucky you will get through the problem within 24hrs.

Unable to "Open Console"

Hello All
I am new to openshift, I have successfully deployed a cluster as per the all given instructions.
In Overview tab, Cluster status is "Ready", I am unable to open link through "Open Console" button.
Giving an error "This site can't be reached".

Attached snap for reference, Thanks in advance!!

Combination of ESXi and BareMetal

This is more of a question rather than an issue @ryanhay . I would like to know if i can use a combination of ESXi and bare metals? I can spin up all the control plane on bare metal and even worker nodes on bare metals and keep services and bootstrap on the ESXi server?

My catch here is that the OCP network(2nd network that we create) how do i bridge that to my VMs? Also my bootstrap machine which is on ESXi is on the OCP network. But unable to fetch files from the Centos VM. Im a newbie to this, so these might sound like basic questions. Any help would be appreciated.

Image registry access outside the private network

Hi ,

Your Guide was very useful,
just one question though , how to access the OpenShift image registry from outside the network, the image registry url ends with apps.lab.ocp.lan but when I try to access it from the ocp-svc machine , it doesn't show up, the pod is running and it has an internal IP (assigned by libvirt I think) , so how can I access it from the ocp-svc machine ?

HAProxy stats not updating on adding two more worker nodes.

I added two more workers to the cluster.
added their entries in HAProxy configuration.

Entries are listing on stats page but only 2 out of 4 workers details get updated on stats page. WHat more configuration I should do to see updates from all 4 worker nodes.

Making sure that master is not schedulable

Hi Ryan

I would suggest making it clear that the:

sed -i 's/mastersSchedulable: true/mastersSchedulable: false/' ~/ocp-install/manifests/cluster-scheduler-02-config.yml

command is not optional. I missed it the first time and the following instructions only work when the commands were run before. Great guide though, highly appreciated!

Ask's username and password when try to downloading "config files" for services.

"git clone https://github.com/ryanhay/ocp4-metal-install"

requires username and password when trying to run above command.

What we have to use to run this command?

Unable to access https://192.168.2.200:9000/stats and https://192.168.2.200:6443/

Looks like the openshift env built, but not able to access the following I see:

https://192.168.2.200:6443/

{
"kind": "Status",
"apiVersion": "v1",
"metadata": {

},
"status": "Failure",
"message": "forbidden: User "system:anonymous" cannot get path "/"",
"reason": "Forbidden",
"details": {

},
"code": 403
}

https://192.168.2.200:9000/stats

This site can’t provide a secure connection192.168.2.200 sent an invalid response.
ERR_SSL_PROTOCOL_ERROR

Details of configured env below.

[root@okd4-services ~]# curl -kv https://oauth-openshift.apps.lab.ocp.lan/healthz

Trying 192.168.2.200...
TCP_NODELAY set
Connected to oauth-openshift.apps.lab.ocp.lan (192.168.2.200) port 443 (#0)
ALPN, offering h2
ALPN, offering http/1.1
successfully set certificate verify locations:
CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
TLSv1.3 (OUT), TLS handshake, Client hello (1):
OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.lab.ocp.lan:443
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.lab.ocp.lan:443

[root@okd4-services ~]# sh -x /tmp/.g

oc get csr
No resources found
oc get nodes
NAME STATUS ROLES AGE VERSION
okd4-compute-1.lab.ocp.lan Ready worker 6h10m v1.22.0-rc.0+a44d0f0
okd4-compute-2.lab.ocp.lan Ready worker 5h35m v1.22.0-rc.0+a44d0f0
okd4-control-plane-1.lab.ocp.lan Ready master,worker 6h20m v1.22.0-rc.0+a44d0f0
okd4-control-plane-2.lab.ocp.lan Ready master,worker 6h17m v1.22.0-rc.0+a44d0f0
okd4-control-plane-3.lab.ocp.lan Ready master,worker 6h14m v1.22.0-rc.0+a44d0f0
oc get pods
NAME READY STATUS RESTARTS AGE
myapache-7bcf9c6d44-tjmzs 1/1 Running 0 4h5m
myapache-7bcf9c6d44-xmtqx 1/1 Running 0 4h5m
oc get pods -n openshift-network-operator
NAME READY STATUS RESTARTS AGE
network-operator-6f4564ffb-b82hr 1/1 Running 15 (14m ago) 6h25m
oc get daemonsets -n openshift-sdn
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
sdn 5 5 5 5 5 kubernetes.io/os=linux 6h19m
sdn-controller 3 3 3 3 3 node-role.kubernetes.io/master= 6h19m
oc get network.config.openshift.io cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
creationTimestamp: "2021-11-12T17:50:19Z"
generation: 2
name: cluster
resourceVersion: "3114"
uid: 67438d09-d9df-4b9c-b19d-9863e990df05
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
  hostPrefix: 23
  externalIP:
  policy: {}
  networkType: OpenShiftSDN
  serviceNetwork:
- 172.30.0.0/16
  status:
  clusterNetwork:
- cidr: 10.128.0.0/14
  hostPrefix: 23
  clusterNetworkMTU: 1450
  networkType: OpenShiftSDN
  serviceNetwork:
- 172.30.0.0/16

[root@okd4-services ~]# dig ocp.lan

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> ocp.lan
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58296
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 0478592c3088da21edc7206c618f046c2b6a75fe33e2ea60 (good)
;; QUESTION SECTION:
;ocp.lan. IN A

;; AUTHORITY SECTION:
ocp.lan. 604800 IN SOA okd4-services.ocp.lan. contact.ocp.lan.ocp.lan. 1 604800 86400 2419200 604800

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Nov 12 19:18:52 EST 2021
;; MSG SIZE rcvd: 130

[root@okd4-services ~]# dig -x 192.168.22.211

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> -x 192.168.22.211
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16721
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 4c8bb19fd94b10e384c6db3b618f046f084882a1da69e033 (good)
;; QUESTION SECTION:
;211.22.168.192.in-addr.arpa. IN PTR

;; ANSWER SECTION:
211.22.168.192.in-addr.arpa. 604800 IN PTR okd4-compute-1.lab.ocp.lan.

;; AUTHORITY SECTION:
22.168.192.in-addr.arpa. 604800 IN NS okd4-services.ocp.lan.

;; ADDITIONAL SECTION:
okd4-services.ocp.lan. 604800 IN A 192.168.22.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Nov 12 19:18:55 EST 2021
;; MSG SIZE rcvd: 168