Created on 04-11-2024 09:19 PM - edited 04-11-2024 10:07 PM
I faced failed Initialize embedded Vault shile installing DataServices.
It happened everytime evenif I try install again from start if it happened once.
I tried install on system below
- Red Hat Enterprise Linux release 8.4 (Ootpa)
- Cloudera Manager 7.11.3 (#50275000 built by jenkins on 20240213-1404 git: 14e82e253ab970bfd576e4f80d297769a527df18)
- 1.5.2-b886-ecs-1.5.2-b886.p0.46792599 / 1.5.3-b297-ecs-1.5.3-b297.p0.50802651 both I tried
stdout
Fri Apr 12 11:36:52 KST 2024
Running on: cdppvc1.hostname.com (192.168.10.10)
JAVA_HOME=/usr/lib/jvm/java-openjdk
using /usr/lib/jvm/java-openjdk as JAVA_HOME
namespace/vault-system created
helmchart.helm.cattle.io/vault created
certificatesigningrequest.certificates.k8s.io/vault-csr created
certificatesigningrequest.certificates.k8s.io/vault-csr approved
secret/vault-server-tls created
secret/ingress-cert created
helmchart.helm.cattle.io/vault unchanged
Wait 30 seconds for startup
...
Timed out waiting for vault to come up
stderr
++ kubectl exec vault-0 -n vault-system -- vault operator init -tls-skip-verify -key-shares=1 -key-threshold=1 -format=json
error: unable to upgrade connection: container not found ("vault")
++ '[' 600 -gt 600 ']'
++ echo ...
++ sleep 10
++ time_elapsed=610
++ kubectl exec vault-0 -n vault-system -- vault operator init -tls-skip-verify -key-shares=1 -key-threshold=1 -format=json
error: unable to upgrade connection: container not found ("vault")
++ '[' 610 -gt 600 ']'
++ echo 'Timed out waiting for vault to come up'
++ exit 1
describe pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 108s default-scheduler Successfully assigned vault-system/vault-0 to cdppvc2.hostname.com
Warning FailedAttachVolume 108s attachdetach-controller AttachVolume.Attach failed for volume "pvc-33f9624d-4d90-48fa-8469-02a104df1d10" : rpc error: code = DeadlineExceeded desc = volume pvc-33f9624d-4d90-48fa-8469-02a104df1d10 failed to attach to node cdppvc2.hadoop.com with attachmentID csi-b57965889e8c6c2de7ffd7d045d52175b3415fa69c5e09d1cadc9c7ac1e5a467
Created 04-22-2024 10:01 AM
Hello @Hae
Appolgies for the delay as I was unavailable for some time
Let's check the volume logs on the cdppvc2 node under the below location
# /var/log/instances/pvc-33f9624d-4d90-48fa-8469-02a104df1d10.log
Created 04-22-2024 09:33 PM
Hello @Hae
Glad to know that the issue is fixed
For me, the log file is present below
# pwd
/var/lib/kubelet/pods/c2fa4324-b324-40c5-97a6-e55bd7fa1a65/volumes/kubernetes.io~csi/pvc-697a0f20-499f-4896-b6e9-a5e87435db9b/mount
[CDP-DS Tue Apr 23 04:32:08 UTC root@pvc-ds-readiness05.novalocal [/var/lib/kubelet/pods/c2fa4324-b324-40c5-97a6-e55bd7fa1a65/volumes/kubernetes.io~csi/pvc-697a0f20-499f-4896-b6e9-a5e87435db9b/mount]
# ls -lrth
total 99M
drwxrws---. 2 root 28536 16K Mar 12 09:08 lost+found
-rw-rw-r--. 1 28536 28536 0 Mar 12 09:08 LOCK
-rw-rw-r--. 1 28536 28536 37 Mar 12 09:08 IDENTITY
-rw-rw-r--. 1 28536 28536 11M Apr 3 10:38 LOG.old.1712143774120001
-rw-rw-r--. 1 28536 28536 482K Apr 4 07:59 LOG.old.1712218718445033
-rw-rw-r--. 1 28536 28536 5.9M Apr 15 06:18 LOG.old.1713163409204237
-rw-rw-r--. 1 28536 28536 40K Apr 15 07:43 LOG.old.1713167051095602
-rw-rw-r--. 1 28536 28536 4.7K Apr 15 07:44 OPTIONS-000017
-rw-rw-r--. 1 28536 28536 2.4M Apr 15 07:44 000018.sst
-rw-rw-r--. 1 28536 28536 559K Apr 16 05:44 LOG.old.1713246769612940
-rw-rw-r--. 1 28536 28536 4.8K Apr 16 05:52 000020.sst
-rw-rw-r--. 1 28536 28536 185 Apr 16 05:52 MANIFEST-000021
-rw-rw-r--. 1 28536 28536 16 Apr 16 05:52 CURRENT
-rw-rw-r--. 1 28536 28536 4.7K Apr 16 05:52 OPTIONS-000024
-rw-rw-r--. 1 28536 28536 2.0K Apr 16 07:20 000022.log
-rw-rw-r--. 1 28536 28536 4.1M Apr 23 04:22 LOG
Created 04-11-2024 11:17 PM
Hello @Hae
Thank you for reaching out
From the below error, it seems that there could be some issue with the volume
Warning FailedAttachVolume 108s attachdetach-controller AttachVolume.Attach failed for volume "pvc-33f9624d-4d90-48fa-8469-02a104df1d10" : rpc error: code = DeadlineExceeded desc = volume pvc-33f9624d-4d90-48fa-8469-02a104df1d10 failed to attach to node cdppvc2.hadoop.com with
Can you please check the status of the above volume from Longhorn UI? Is that volume in good health or bad health?
Created 04-11-2024 11:23 PM
Thank you for your answer @upadhyayk04
I can't access the longhorn UI.
[Storage UI] link in ECS does not working.
I think longhorn does not good status now, What should I checnk?
Created 04-11-2024 11:26 PM
Hello @Hae
You might need to check the status of longhorn pods and see why they are failing and need to fix them
# kubectl -n longhorn-system get pods
Then you will need to describe the pods which are failing that would help us know why they are failing
Created on 04-11-2024 11:29 PM - edited 04-11-2024 11:32 PM
Podd list
[root@cdppvc1 ~]# k get pod -n longhorn-system
NAME READY STATUS RESTARTS AGE
csi-attacher-5f79c59664-gsfc4 1/1 Running 0 96m
csi-attacher-5f79c59664-rppmd 1/1 Running 0 96m
csi-attacher-5f79c59664-spmmt 1/1 Running 1 (34m ago) 96m
csi-provisioner-7f9fff657d-mvmb6 1/1 Running 0 96m
csi-provisioner-7f9fff657d-r76kv 1/1 Running 1 (34m ago) 96m
csi-provisioner-7f9fff657d-wm77w 1/1 Running 0 96m
csi-resizer-7667995d7-fgkbd 1/1 Running 0 97m
csi-resizer-7667995d7-rn5ts 1/1 Running 1 (34m ago) 97m
csi-resizer-7667995d7-zx94l 1/1 Running 0 97m
csi-snapshotter-56954ddc99-b44ds 1/1 Running 0 97m
csi-snapshotter-56954ddc99-fmw8x 1/1 Running 1 (34m ago) 97m
csi-snapshotter-56954ddc99-jkwhv 1/1 Running 0 97m
engine-image-ei-6b4330bf-nnwmm 1/1 Running 0 3h52m
engine-image-ei-6b4330bf-npf9k 1/1 Running 1 (30m ago) 3h52m
instance-manager-12ec73857d1e3aea875a32230969da75 1/1 Running 0 34m
instance-manager-ad30a9ee514d3e836de7c5077cfe5ca6 1/1 Running 0 94m
longhorn-csi-plugin-j5xw4 3/3 Running 0 3h52m
longhorn-csi-plugin-v7bdh 3/3 Running 6 (26m ago) 3h52m
longhorn-driver-deployer-75c7cb9999-v8xgb 1/1 Running 0 96m
longhorn-manager-d495r 1/1 Running 1 (30m ago) 3h52m
longhorn-manager-nvgk7 1/1 Running 0 3h52m
longhorn-ui-64c4bfff54-d6c7n 1/1 Running 0 97m
longhorn-ui-64c4bfff54-vrx4q 1/1 Running 0 97m
describe pod csi-plugin
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 28m (x5 over 30m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Started 28m kubelet Started container longhorn-liveness-probe
Normal Created 28m kubelet Created container node-driver-registrar
Normal Started 28m kubelet Started container node-driver-registrar
Normal Pulled 28m kubelet Container image "registry.ecs.internal/cloudera_thirdparty/longhornio/livenessprobe:v2.12.0" already present on machine
Normal Created 28m kubelet Created container longhorn-liveness-probe
Normal Pulled 28m kubelet Container image "registry.ecs.internal/cloudera_thirdparty/longhornio/csi-node-driver-registrar:v2.9.2" already present on machine
Warning BackOff 28m (x2 over 28m) kubelet Back-off restarting failed container longhorn-csi-plugin in pod longhorn-csi-plugin-v7bdh_longhorn-system(4fe460af-df96-4006-a631-dcc21bd46a07)
Normal Pulled 28m (x2 over 28m) kubelet Container image "registry.ecs.internal/cloudera_thirdparty/longhornio/longhorn-manager:v1.5.4" already present on machine
Normal Created 28m (x2 over 28m) kubelet Created container longhorn-csi-plugin
Normal Started 28m (x2 over 28m) kubelet Started container longhorn-csi-plugin
Warning Unhealthy 27m (x3 over 28m) kubelet Liveness probe failed: Get "http://10.42.0.6:9808/healthz": dial tcp 10.42.0.6:9808: connect: connection refused
Normal Killing 27m kubelet Container longhorn-csi-plugin failed liveness probe, will be restarted
Warning BackOff 27m (x2 over 28m) kubelet Back-off restarting failed container node-driver-registrar in pod longhorn-csi-plugin-v7bdh_longhorn-system(4fe460af-df96-4006-a631-dcc21bd46a07)
log of csi-plugin pod
[root@cdppvc1 ~]# k logs -f longhorn-csi-plugin-v7bdh -n longhorn-system
Defaulted container "node-driver-registrar" out of: node-driver-registrar, longhorn-liveness-probe, longhorn-csi-plugin
I0412 06:02:45.498503 12176 main.go:135] Version: v2.9.2
I0412 06:02:45.498547 12176 main.go:136] Running node-driver-registrar in mode=
I0412 06:02:45.498553 12176 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
W0412 06:02:55.498699 12176 connection.go:232] Still connecting to unix:///csi/csi.sock
I0412 06:03:00.414873 12176 main.go:164] Calling CSI driver to discover driver name
I0412 06:03:00.417352 12176 main.go:173] CSI driver name: "driver.longhorn.io"
I0412 06:03:00.417373 12176 node_register.go:55] Starting Registration Server at: /registration/driver.longhorn.io-reg.sock
I0412 06:03:00.417530 12176 node_register.go:64] Registration Server started at: /registration/driver.longhorn.io-reg.sock
I0412 06:03:00.417667 12176 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0412 06:03:01.396603 12176 main.go:90] Received GetInfo call: &InfoRequest{}
I0412 06:03:01.402598 12176 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Created 04-12-2024 12:17 AM
Hello,
Thank you for the update
What about pods in other namespaces are they running fine, especially in cdp namespace and kube-system namespace? Is the DB pod running fine
You might need to check kubelet logs to know more about the problem
Created 04-12-2024 12:30 AM
Look all pods are fine
[root@cdppvc1 ~]# k get ns
NAME STATUS AGE
default Active 4h56m
ecs-webhooks Active 4h55m
kube-node-lease Active 4h56m
kube-public Active 4h56m
kube-system Active 4h56m
local-path-storage Active 4h55m
longhorn-system Active 4h55m
vault-system Active 116s
[root@cdppvc1 ~]# k get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
ecs-webhooks ecs-tolerations-webhook-77d857599d-b8hsh 1/1 Running 0 39m
ecs-webhooks ecs-tolerations-webhook-77d857599d-h6qxk 1/1 Running 0 39m
kube-system etcd-cdppvc1.hadoop.com 1/1 Running 1 4h54m
kube-system helm-install-rke2-ingress-nginx-mk845 0/1 Completed 0 10m
kube-system kube-apiserver-cdppvc1.hadoop.com 1/1 Running 1 4h54m
kube-system kube-controller-manager-cdppvc1.hadoop.com 1/1 Running 3 (89m ago) 4h54m
kube-system kube-proxy-cdppvc1.hadoop.com 1/1 Running 0 86m
kube-system kube-proxy-cdppvc2.hadoop.com 1/1 Running 0 4h53m
kube-system kube-scheduler-cdppvc1.hadoop.com 1/1 Running 1 (90m ago) 4h54m
kube-system rke2-canal-9h5hh 2/2 Running 0 4h53m
kube-system rke2-canal-qk2wg 2/2 Running 2 (90m ago) 4h53m
kube-system rke2-coredns-rke2-coredns-565dfc7d75-djp4t 1/1 Running 0 38m
kube-system rke2-coredns-rke2-coredns-565dfc7d75-gvxcj 1/1 Running 0 153m
kube-system rke2-coredns-rke2-coredns-autoscaler-6c48c95bf9-7ln92 1/1 Running 0 39m
kube-system rke2-ingress-nginx-controller-869fc5f494-xcz6x 1/1 Running 0 39m
kube-system rke2-metrics-server-c9c78bd66-blrwg 1/1 Running 0 156m
kube-system rke2-snapshot-controller-6f7bbb497d-wk5mg 1/1 Running 0 39m
kube-system rke2-snapshot-validation-webhook-65b5675d5c-7fst2 1/1 Running 0 39m
local-path-storage local-path-provisioner-6b8fcdf4f9-fqqnw 1/1 Running 0 155m
longhorn-system csi-attacher-5f79c59664-gsfc4 1/1 Running 0 156m
longhorn-system csi-attacher-5f79c59664-rppmd 1/1 Running 0 156m
longhorn-system csi-attacher-5f79c59664-spmmt 1/1 Running 1 (93m ago) 156m
longhorn-system csi-provisioner-7f9fff657d-mvmb6 1/1 Running 0 156m
longhorn-system csi-provisioner-7f9fff657d-r76kv 1/1 Running 1 (93m ago) 156m
longhorn-system csi-provisioner-7f9fff657d-wm77w 1/1 Running 0 156m
longhorn-system csi-resizer-7667995d7-fgkbd 1/1 Running 0 156m
longhorn-system csi-resizer-7667995d7-rn5ts 1/1 Running 1 (93m ago) 156m
longhorn-system csi-resizer-7667995d7-zx94l 1/1 Running 0 156m
longhorn-system csi-snapshotter-56954ddc99-b44ds 1/1 Running 0 156m
longhorn-system csi-snapshotter-56954ddc99-fmw8x 1/1 Running 1 (93m ago) 156m
longhorn-system csi-snapshotter-56954ddc99-jkwhv 1/1 Running 0 156m
longhorn-system engine-image-ei-6b4330bf-nnwmm 1/1 Running 0 4h52m
longhorn-system engine-image-ei-6b4330bf-npf9k 1/1 Running 1 (90m ago) 4h52m
longhorn-system instance-manager-12ec73857d1e3aea875a32230969da75 1/1 Running 0 38m
longhorn-system instance-manager-ad30a9ee514d3e836de7c5077cfe5ca6 1/1 Running 0 153m
longhorn-system longhorn-csi-plugin-j5xw4 3/3 Running 0 4h51m
longhorn-system longhorn-csi-plugin-v7bdh 3/3 Running 6 (86m ago) 4h51m
longhorn-system longhorn-driver-deployer-75c7cb9999-v8xgb 1/1 Running 0 156m
longhorn-system longhorn-manager-d495r 1/1 Running 1 (90m ago) 4h52m
longhorn-system longhorn-manager-nvgk7 1/1 Running 0 4h52m
longhorn-system longhorn-ui-64c4bfff54-d6c7n 1/1 Running 0 156m
longhorn-system longhorn-ui-64c4bfff54-vrx4q 1/1 Running 0 156m
Created 04-12-2024 10:58 AM
Hello,
Based on the error mentioned in the stack trace and the above output it seems that the vault itself is deleted somehow. Do you see vault pods present there?
++ kubectl exec vault-0 -n vault-system -- vault operator init -tls-skip-verify -key-shares=1 -key-threshold=1 -format=json
error: unable to upgrade connection: container not found ("vault")
++ '[' 610 -gt 600 ']'
Created on 04-12-2024 06:30 PM - edited 04-13-2024 01:39 AM
vault-0 pod goes terminating and containercreating status again again again.
because of volume attach faild
Warning FailedAttachVolume 108s attachdetach-controller AttachVolume.Attach failed for volume "pvc-33f9624d-4d90-48fa-8469-02a104df1d10" : rpc error: code = DeadlineExceeded desc = volume pvc-33f9624d-4d90-48fa-8469-02a104df1d10 failed to attach to node cdppvc2.hadoop.com with
Created on 04-13-2024 05:22 PM - edited 04-13-2024 05:23 PM
logs -f -n longhorn-system longhorn-csi-plugin-cxglq
Defaulted container "node-driver-registrar" out of: node-driver-registrar, longhorn-liveness-probe, longhorn-csi-plugin
I0413 09:50:20.091344 290593 main.go:166] Version: v2.5.0
I0413 09:50:20.091369 290593 main.go:167] Running node-driver-registrar in mode=registration
I0413 09:50:20.092527 290593 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0413 09:50:21.093286 290593 main.go:198] Calling CSI driver to discover driver name
I0413 09:50:21.094471 290593 main.go:208] CSI driver name: "driver.longhorn.io"
I0413 09:50:21.094497 290593 node_register.go:53] Starting Registration Server at: /registration/driver.longhorn.io-reg.sock
I0413 09:50:21.094656 290593 node_register.go:62] Registration Server started at: /registration/driver.longhorn.io-reg.sock
I0413 09:50:21.094779 290593 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0413 09:50:21.466617 290593 main.go:102] Received GetInfo call: &InfoRequest{}
I0413 09:50:21.466820 290593 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/driver.longhorn.io/registration"
I0413 09:50:23.205994 290593 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}