Giter VIP home page Giter VIP logo

support-tools's Introduction

support-tools

This repository contains Rancher support-tools to assist with investigating and troubleshooting issues with Rancher clusters, as well as other maintenance tasks.

Caution:

This repository contains scripts that can cause harm if used without the guidance of Rancher Support. We advise reaching out to Rancher Support before executing any of these scripts. Failure to reach out could incur production downtime.

The repository consists of the following directories of tools:

  • collection: non-mutating, non-destructive scripts for the purpose of collecting information/logs from a cluster or node.
  • files: common files used in conjunction with troubleshooting commands.

support-tools's People

Contributors

aemneina avatar ansilh avatar axeal avatar bentastic27 avatar celidon avatar dhawton avatar dkeightley avatar dnavarrete-suse avatar dnoland1 avatar jambajaar avatar juanbrny avatar kourosh7 avatar leflambeur avatar leodotcloud avatar leonardoalvesprates avatar mallardduck avatar masap avatar mattmattox avatar moio avatar oleg-vorobiov-suse avatar oxr463 avatar patrick0057 avatar rbreddy avatar richardcase avatar rosskirkpat avatar ryanelliottsmith avatar superseb avatar suryatejaboorlu avatar tlatino avatar weyfonk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

support-tools's Issues

grab kubelet logs for rke2 deployments

seems the kubelet logs arent in journald. they can be found here:
/var/lib/rancher/rke2/agent/logs/kubelet.log

Ignore below, this was from manual fiddling:
there's also another file we might want to grab at:

/var/lib/rancher/rke2/agent/logs/archive-log.log

Invalid node role for cluster-agent-tool using example from help text

On a call with a customer who ran the tool with the -r option taken from the example in the help text:

bash cluster-agent-tool.sh -r'--etcd --controlplane --worker'
You passed an invalid node role.  Listing what you specified below.
--etcd --controlplane --worker

Valid options are: --etcd --controlplane --worker

gz#17878

[Bug] rancher-single-tool.sh, is not compatible with the last v2.5.x version

Hi there,
Thanks for the useful support tools, but i have a problem to use them with the last rancher v.2.5.x.
I would like to change the server url from ip to dns with a rancher v.2.5.7.
https://support.tools/post/how-to-change-rancher-2-x-server-url

I noticed that the "--privileged" is missing in the standard settings and added this.
The update is created successfully and the container is restarted, but after a server / docker restart the docker service does not start anymore.

Is there a possibility of a rancher v2.5.7 to change the server url with this tool?

Review use of UTF8 in Windows Logging Script

$PSDefaultParameterValues['*:Encoding'] = 'utf8'

Per rancher/rancher#30701, Windows 20H2 now defaults to UTF-8 with BOM whereas all previous versions were UTF-8 without BOM. Fix for the above GH issue - rancher/rke-tools#125

The logging script may produce undesirable results on 20h2 due to this change and requires further testing and possibly some slight changes.

Create data collection such that it unpacks into a unique directory

RFE....

This looks like a useful data collection, but I would suggest that the resulting tar ball create a directory based on the name of the tarball itself. As it stands, it creates a 'flat' layout, which means that if you unpack multiple of these into the same directory, they will will overwrite each other.

Example:

tar tf ../somemachine-2022-10-18_10_02_25.tar.gz | head ./ ./etcd/ ./etcd/findserverdbetcd ./etcd/endpointhealth ./etcd/findserverdbsnapshots ./etcd/endpointstatus ./etcd/alarmlist ./networking/ ./networking/ip6tablesnat ./networking/ip6tablesmangle
I think ideally, it would create a 'somemachine-2022..' directory and in there put the resulting files.

Cluster-agent-tool for RKE2

I would like to see a version of the cluster-agent-tool script for downstream RKE2 clusters for when you change your Rancher url or certs. I've currently got 4 RKE2 clusters that are sitting in "Waiting for at least one bootstrap node" in my Cluster Management list because I changed certs for my Rancher app from the default Rancher self-signed to LetsEncrypt certs. Some kind of documentation/discussion would suffice in the interim. Thank you!

Retrieve etcd metrics on etcd nodes

When an etcd node is detected, retrieve the output from the metrics endpoint of the etcd container

Eg:

curl --cacert /etc/kubernetes/ssl/kube-ca.pem --key /etc/kubernetes/ssl/kube-etcd-xxx-key.pem --cert /etc/kubernetes/ssl/kube-etcd-xxx.pem https://127.0.0.1:2379/metrics

Update log collector to get specific fleet logs

I've got a request to add these one-liners specific for fleet [sanitised URL] into the linux log collector script. Would be possible to integrate these scripts directly in the log collector tool?

low priority nice/ionice by default

To ensure the log collector has no adverse performance impacts while troubleshooting an issue I would like to see the script lower the cpu and io priority to the lowest by default, perhaps with an optional flag to disable/change the priority if needed.

#gz14380

Add support for both etcd 3.3 and 3.4 to 2.x logs-collector

To collect logs from both versions successfully, detect and run the appropriate command to avoid errors, such as:

2020-05-10 19:42:05.100309 C | pkg/flags: conflicting environment variable "ETCDCTL_ENDPOINTS" is shadowed by corresponding command-line flag (either unset environment variable or disable flag)

Add pidstat to log collector script

Eg:-

pidstat -drshut -p ALL
Linux 5.4.0-74-generic (rancher24x) 	10/06/21 	_x86_64_	(4 CPU)

# Time        UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM StkSize  StkRef   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
09:30:08        0         2         -    0.00    0.00    0.00    0.00    0.00     3      0.00      0.00       0       0   0.00       0       0     -1.00     -1.00     -1.00       0  kthreadd
09:30:08        0         -         2    0.00    0.00    0.00    0.00    0.00     3      0.00      0.00       0       0   0.00       0       0     -1.00     -1.00     -1.00       0  |__kthreadd
09:30:08        0         3         -    0.00    0.00    0.00    0.00    0.00     0      0.00      0.00       0       0   0.00       0       0     -1.00     -1.00     -1.00       0  rcu_gp
....

PII Redaction Script

It would be great to have a tool to redact sensitive information from files that are shared by customers.

For example, removing ssh keys from a cluster.yml or hostnames and/or IP addresses.

More kubectl output for k3s server hosts

For hosts running k3 server, gather output for these kubectl commands for the log collector:

kubectl get deployments --all-namespaces
kubectl get daemonsets --all-namespaces
kubectl get statefulsets --all-namespaces
kubectl get pvc --all-namespaces
kubectl get pv
kubectl get replicasets --all-namespaces
kubectl get crds
kubectl top nodes
kubectl top pods --all-namespaces
kubectl get helmcharts --all-namespaces
kubectl get roles --all-namespaces
kubectl get rolebindings --all-namespaces
kubectl get clusterroles
kubectl get clusterrolebindings
kubectl get ingress --all-namespaces
kubectl get jobs --all-namespaces

option to tail filesystem logs

We can limit the scope on the k8s logs via days, but log files gathered off the filesystem such as those in /var/log are grabbed in their entirety.. on space limited nodes I would like an option to tail those files into smaller bundles.. ie, -t 2000 parameter to grab last 2k lines of any files that cannot be truncated by the -s param.

Collect all rancher/rancher-agent:vx.x.x logs in v2.x log collector script

Where a user runs the v2.x log collector script on a Rancher provisioned cluster node, that is failing to join the cluster, we do not collect container logs for the share-mnt, or non-named agent containers (that get assigned a random Docker container name), which are relevant to investigation. In addition to collection of known names (e.g. share-mnt), we should collect logs for agents via another attribute (env vars, image hash etc. as a few ideas at a first parse), as image name isn't known/guaranteed with customers in air-gapped envs for example.

Rancher Pod Collector seems to hang while collecting cluster info

It looks like it gets stuck when running the cluster info dump. I suspect it may be on this line:

${KUBECTL_CMD} cluster-info dump -o yaml -n cattle-system --log-file-max-size 200 --output-directory $TMPDIR/clusterinfo/cluster-info-dump

After 5 minutes we ended up hitting ctrl-c and tarballing the logs that had been generated so far. The dump is attached to 00309403 as out.tar.gz

overlaytest url wrong

the readme says to run:
kubectl apply -f https://raw.githubusercontent.com/rancherlabs/support-tools/master/swiss-army-knife/deploy/overlaytest.yaml
But that URL gives a 404 error. It should say:
kubectl apply -f https://raw.githubusercontent.com/rancherlabs/support-tools/master/swiss-army-knife/overlaytest.yaml

Update windows log collection script

flannel version

C:\opt\bin\rancher-wins-flanneld.exe --version

  • label container log names

  • grab evtx files in addition to json

  • refactor

  • better organize directories (top-level directory should contain all log files and dirs)

k3s log collection needs to differentiate between agent or server config

Running tests on the new log-collector

Seems like it quite tell the difference between a k3s agent or k3s server which results in some errors:

k3s Server (As expected):

19:23:25 [pi] @ rpi-cluster-a:(~) %
-> wget -O- https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh | sudo bash -s

--2020-05-14 19:23:39--  https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19357 (19K) [text/plain]
Saving to: ‘STDOUT’

-                                                                                                                          100%[======================================================================================================================================================================================================================================================================================================================================>]  18.90K  --.-KB/s    in 0.03s

2020-05-14 19:23:39 (670 KB/s) - written to stdout [19357/19357]

2020-05-14 19:23:39: Created /tmp/tmp.5eg1orXrEI
2020-05-14 19:23:39: Detecting OS... raspbian 10
2020-05-14 19:23:39: Detecting container runtime... k3s
2020-05-14 19:23:40: Detecting init type... systemd
2020-05-14 19:23:40: Collecting system info
2020-05-14 19:23:46: Collecting network output
2020-05-14 19:23:46: Collecting k3s info
2020-05-14 19:23:56: Collecting k3s logs
2020-05-14 19:24:03: Collecting Rancher logs
2020-05-14 19:24:20: Collecting k3s directory state
2020-05-14 19:24:20: Collecting k3s certificates
2020-05-14 19:24:20: Collecting system logs from /var/log
2020-05-14 19:24:20: Collecting system logs from journald
2020-05-14 19:24:21: Created /tmp/rpi-cluster-a-2020-05-14_19_24_21.tar.gz
2020-05-14 19:24:21: Removing /tmp/tmp.5eg1orXrEI

k3s Agents (As unexpected)

19:32:41 [pi] @ rpi-cluster-b:(~) %
-> wget -O- https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh | sudo bash -s
--2020-05-14 19:32:54--  https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19357 (19K) [text/plain]
Saving to: ‘STDOUT’

-                                                                                                                          100%[======================================================================================================================================================================================================================================================================================================================================>]  18.90K  --.-KB/s    in 0.05s

2020-05-14 19:32:54 (413 KB/s) - written to stdout [19357/19357]

2020-05-14 19:32:54: Created /tmp/tmp.1vjPzPYVwQ
2020-05-14 19:32:54: Detecting OS... raspbian 10
2020-05-14 19:32:54: Detecting container runtime... k3s
2020-05-14 19:32:55: Detecting init type... systemd
2020-05-14 19:32:55: Collecting system info
2020-05-14 19:33:01: Collecting network output
2020-05-14 19:33:01: Collecting k3s info
2020-05-14 19:33:10: Collecting k3s logs
2020-05-14 19:33:16: Collecting Rancher logs
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
2020-05-14 19:33:22: Collecting k3s directory state
2020-05-14 19:33:22: Collecting k3s certificates
find: ‘/var/lib/rancher/k3s/server/tls’: No such file or directory
2020-05-14 19:33:22: Collecting system logs from /var/log
2020-05-14 19:33:22: Collecting system logs from journald
2020-05-14 19:33:22: Created /tmp/rpi-cluster-b-2020-05-14_19_33_22.tar.gz
2020-05-14 19:33:22: Removing /tmp/tmp.1vjPzPYVwQ
19:40:07 [pi] @ rpi-cluster-c:(~) %
-> wget -O- https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh | sudo bash -s

--2020-05-14 19:40:13--  https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19357 (19K) [text/plain]
Saving to: ‘STDOUT’

-                                                                                                                          100%[======================================================================================================================================================================================================================================================================================================================================>]  18.90K  --.-KB/s    in 0.04s

2020-05-14 19:40:14 (440 KB/s) - written to stdout [19357/19357]

2020-05-14 19:40:14: Created /tmp/tmp.lH6q4J6lDQ
2020-05-14 19:40:14: Detecting OS... raspbian 10
2020-05-14 19:40:14: Detecting container runtime... k3s
2020-05-14 19:40:15: Detecting init type... systemd
2020-05-14 19:40:15: Collecting system info
2020-05-14 19:40:20: Collecting network output
2020-05-14 19:40:20: Collecting k3s info
2020-05-14 19:40:29: Collecting k3s logs
2020-05-14 19:40:35: Collecting Rancher logs
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
2020-05-14 19:40:41: Collecting k3s directory state
2020-05-14 19:40:41: Collecting k3s certificates
find: ‘/var/lib/rancher/k3s/server/tls’: No such file or directory
2020-05-14 19:40:41: Collecting system logs from /var/log
2020-05-14 19:40:41: Collecting system logs from journald
2020-05-14 19:40:41: Created /tmp/rpi-cluster-c-2020-05-14_19_40_41.tar.gz
2020-05-14 19:40:41: Removing /tmp/tmp.lH6q4J6lDQ

Add weave CLI output to log collector

When the weave container is present, collect additional information with exec

./weave --local report
./weave --local status peers
./weave --local status connections
./weave --local status ipam

Add safeguards to prevent deleting leftover PVs as part of extended-cleanup-rancher2.sh ++

We are still seeing the cleanup script getting stuck trying to rm container volumes with still mounted nfs mount points.

/var/lib/kubelet/pods/06dd64f1-bb2a-411d-9192-0aab6e7cbc73/volumes/kubernetes.io~nfs/nfs-mount-prd/prod/rancher
du: cannot access ‘./06dd64f1-bb2a-411d-9192-0aab6e7cbc73/volumes/kubernetes.io~nfs/nfs-mount-prd-binarystore/prod/rancher/artifactory-ha/data/filestore/17/17c6c483e99a0a604e27e87f5631989f2d5bf2c5’: No such file or directory

The customer is using an external NFS client provisioner (quay.io/external_storage/nfs-client-provisioner:v3.1.0-k8s1.11)

Workaround is to remove /var/lib/kubelet/* from the CLEANUP_DIRS variable in the script, which of course leaves behind directories we may wish to remove

Systems summary script sometimes missing Role, OS, and Docker Version for local cluster

Sometimes the systems summary report script does not report the Role, OS, and Docker Version for local cluster. Example run on a k3s local cluster:

Cluster: local (local)
Node Id         Address                  Role     CPU   RAM         OS       Docker Version   Created
machine-ktglc   10.1.1.55,ip-10-1-1-55   <none>   2     7890600Ki   <none>   <none>           2021-09-13T22:40:07Z
machine-wzj67   10.1.1.43,ip-10-1-1-43   <none>   2     7890608Ki   <none>   <none>           2021-09-13T22:40:08Z

Probably due to the cluster being k3s or containerd-based.

Etcd restore script appears to fail on RKE1

I'm trying to perform an etcd snapshot restore on a cluster that is provisioned with RKE cli and has an nginx pod deployed to represent a user workload. The restore script appears to encounter an error does not start a restored etcd container.

during restore script:

sed: unrecognized option '--name=etcd'
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

after script finishes:

ubuntu@ip-10-0-0-220:~$ docker logs etcd
Error: No such container: etcd

RKE CLI version: 1.3.14
Kubernetes version: 1.20.15

cluster.yml:

nodes:
    - address: X.X.X.X
      user: ubuntu
      role:
        - controlplane
        - etcd
        - worker
kubernetes_version: v1.20.15-rancher1-4
services:
  kube-api:
    secrets-encryption-config:
      enabled: true

I have attached full log output. Let me know if you need more info:
etcd-restore.log

logs for k8s components dont get pulled down for rke2 systems

We need to set the env var for rke2 binary path, as well as crictl config path, in order for crictl to work. I believe we assume these are set before running the script which isnt always the case.

looks like the right env vars might be set, we're not getting k8s component logs though

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.