Rancher 的異常排解紀錄

rkeuser@iiidevops4:~$ kubectl get pod -n cattle-system
NAME                                    READY   STATUS             RESTARTS   AGE
cattle-cluster-agent-6bf6f8fcc4-sznpp   1/1     Running            0          18m
cattle-node-agent-79nrh                 1/1     Running            23         67d
cattle-node-agent-ch6pn                 1/1     Running            23         67d
cattle-node-agent-jr5bq                 1/1     Running            7          7d20h
cattle-node-agent-k2fcs                 1/1     Running            26         67d
rancher-98d8d5cf5-hbjjv                 1/1     Running            1          25m
rancher-98d8d5cf5-nhlwz                 0/1     CrashLoopBackOff   8          25m
rancher-98d8d5cf5-zjbzs                 0/1     Running            0          105s

找出哪個 rancher pod 是 leader

$ kubectl describe configMap cattle-controllers -n kube-system
Name:         cattle-controllers
Namespace:    kube-system
Labels:       <none>
Annotations:  control-plane.alpha.kubernetes.io/leader:
                {"holderIdentity":"rancher-98d8d5cf5-hbjjv","leaseDurationSeconds":45,"acquireTime":"2021-09-08T06:40:25Z","renewTime":"2021-09-08T07:02:5...

Data
====
Events:  <none>

可以看到目前的 leader : rancher-98d8d5cf5-hbjjv , 所以可以看一下這 pod 的紀錄

$ kubectl logs rancher-98d8d5cf5-hbjjv -n cattle-system
2021/09/08 06:38:27 [INFO] Rancher version v2.4.15 (cdb64d640) is starting
2021/09/08 06:38:27 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:auto Embedded:false HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLog
Path:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features:}
2021/09/08 06:38:27 [INFO] Listening on /tmp/log.sock
I0908 06:38:27.719747       6 http.go:122] HTTP2 has been explicitly disabled
:
2021/09/08 06:56:18 [ERROR] AppController p-gn54t/test-20210831-master-sq [helm-controller] failed with : Get "https://10.43.0.1:443/apis/project.cattle.io/v3/namespaces/p-gn54t/apprevisions?labelSelector=io.cattle.field%!F(MISSING)appId%!D(MISSING)test-20210831-master-sq&timeout=30s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/09/08 06:57:04 [ERROR] PipelineExecutionController p-gn54t/p-qp9qq-1 [pipeline-execution-controller] failed with : pipeline.project.cattle.io "p-gn54t/p-qp9qq" not found
2021/09/08 07:01:20 [ERROR] PipelineExecutionController p-gn54t/p-qp9qq-1 [pipeline-execution-controller] failed with : pipeline.project.cattle.io "p-gn54t/p-qp9qq" not found

假設以下的 jenkins POD 不見了! PIPELINE 就無法啟動運行

~$ kubectl get namespace | grep pipeline
cattle-pipeline               Active   66d
p-gn54t-pipeline              Active   66d
~$ kubectl get pod -n p-gn54t-pipeline
NAME                               READY   STATUS    RESTARTS   AGE
docker-registry-57fbddc6cc-drt29   1/1     Running   4          66d
jenkins-75cf8d9966-m2vc8           1/1     Running   0          168m
minio-7b7866c65f-7hpl5             1/1     Running   0          167m

只要將 pipeline 這個 namespace Exp. p-gn54t-pipeline 刪除, 就會自動建立回來
參考 - https://github.com/rancher/rancher/issues/18779

環境 : rke / helm 安裝的 rancher
透過 helm uninstall 後, 再執行 helm install 後依然無法正常啟動
參考這篇乾淨移除 Rancher與這篇Rancher 中的 CRD說明後, 依照以下的處理方式就能解決
1. 刪除 crd 的 dynamicschemas.management.cattle.io
2. 刪除 cert-manager 和 cattle-system namespace
3. 重新安裝 rancher

參考 - https://gist.github.com/janeczku/d3b9eed3b1dee7863b66fba3367a1bd4

進入 rancher 進階設定頁面

https://<old_rancher_hostname>/g/settings/advanced

找到 server-url 進行編輯

rancher

Rancher 的異常排解紀錄

無法正確啟動的判別方式

不小心砍了 pipeline 的 jenlins POD

Rancher 異常無法啟動重新安裝

修改 Rancher server url 的方式