Support Questions

GrazittiAPI · ‎04-11-2024

I faced failed Initialize embedded Vault shile installing DataServices.

It happened everytime evenif I try install again from start if it happened once.

I tried install on system below

- Red Hat Enterprise Linux release 8.4 (Ootpa)

- Cloudera Manager 7.11.3 (#50275000 built by jenkins on 20240213-1404 git: 14e82e253ab970bfd576e4f80d297769a527df18)

- 1.5.2-b886-ecs-1.5.2-b886.p0.46792599 / 1.5.3-b297-ecs-1.5.3-b297.p0.50802651 both I tried

stdout

Fri Apr 12 11:36:52 KST 2024
Running on: cdppvc1.hostname.com (192.168.10.10)
JAVA_HOME=/usr/lib/jvm/java-openjdk
using /usr/lib/jvm/java-openjdk as JAVA_HOME
namespace/vault-system created
helmchart.helm.cattle.io/vault created
certificatesigningrequest.certificates.k8s.io/vault-csr created
certificatesigningrequest.certificates.k8s.io/vault-csr approved
secret/vault-server-tls created
secret/ingress-cert created
helmchart.helm.cattle.io/vault unchanged
Wait 30 seconds for startup
...
Timed out waiting for vault to come up

stderr

++ kubectl exec vault-0 -n vault-system -- vault operator init -tls-skip-verify -key-shares=1 -key-threshold=1 -format=json
error: unable to upgrade connection: container not found ("vault")
++ '[' 600 -gt 600 ']'
++ echo ...
++ sleep 10
++ time_elapsed=610
++ kubectl exec vault-0 -n vault-system -- vault operator init -tls-skip-verify -key-shares=1 -key-threshold=1 -format=json
error: unable to upgrade connection: container not found ("vault")
++ '[' 610 -gt 600 ']'
++ echo 'Timed out waiting for vault to come up'
++ exit 1

describe pod

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 108s default-scheduler Successfully assigned vault-system/vault-0 to cdppvc2.hostname.com
Warning FailedAttachVolume 108s attachdetach-controller AttachVolume.Attach failed for volume "pvc-33f9624d-4d90-48fa-8469-02a104df1d10" : rpc error: code = DeadlineExceeded desc = volume pvc-33f9624d-4d90-48fa-8469-02a104df1d10 failed to attach to node cdppvc2.hadoop.com with attachmentID csi-b57965889e8c6c2de7ffd7d045d52175b3415fa69c5e09d1cadc9c7ac1e5a467

upadhyayk04 · ‎04-22-2024

Hello @Hae

Appolgies for the delay as I was unavailable for some time

Let's check the volume logs on the cdppvc2 node under the below location

# /var/log/instances/pvc-33f9624d-4d90-48fa-8469-02a104df1d10.log

View solution in original post

upadhyayk04 · ‎04-22-2024

Hello @Hae

Glad to know that the issue is fixed

For me, the log file is present below

# pwd
/var/lib/kubelet/pods/c2fa4324-b324-40c5-97a6-e55bd7fa1a65/volumes/kubernetes.io~csi/pvc-697a0f20-499f-4896-b6e9-a5e87435db9b/mount
[CDP-DS Tue Apr 23 04:32:08 UTC [email protected] [/var/lib/kubelet/pods/c2fa4324-b324-40c5-97a6-e55bd7fa1a65/volumes/kubernetes.io~csi/pvc-697a0f20-499f-4896-b6e9-a5e87435db9b/mount]
# ls -lrth
total 99M
drwxrws---. 2 root  28536  16K Mar 12 09:08 lost+found
-rw-rw-r--. 1 28536 28536    0 Mar 12 09:08 LOCK
-rw-rw-r--. 1 28536 28536   37 Mar 12 09:08 IDENTITY
-rw-rw-r--. 1 28536 28536  11M Apr  3 10:38 LOG.old.1712143774120001
-rw-rw-r--. 1 28536 28536 482K Apr  4 07:59 LOG.old.1712218718445033
-rw-rw-r--. 1 28536 28536 5.9M Apr 15 06:18 LOG.old.1713163409204237
-rw-rw-r--. 1 28536 28536  40K Apr 15 07:43 LOG.old.1713167051095602
-rw-rw-r--. 1 28536 28536 4.7K Apr 15 07:44 OPTIONS-000017
-rw-rw-r--. 1 28536 28536 2.4M Apr 15 07:44 000018.sst
-rw-rw-r--. 1 28536 28536 559K Apr 16 05:44 LOG.old.1713246769612940
-rw-rw-r--. 1 28536 28536 4.8K Apr 16 05:52 000020.sst
-rw-rw-r--. 1 28536 28536  185 Apr 16 05:52 MANIFEST-000021
-rw-rw-r--. 1 28536 28536   16 Apr 16 05:52 CURRENT
-rw-rw-r--. 1 28536 28536 4.7K Apr 16 05:52 OPTIONS-000024
-rw-rw-r--. 1 28536 28536 2.0K Apr 16 07:20 000022.log
-rw-rw-r--. 1 28536 28536 4.1M Apr 23 04:22 LOG

View solution in original post

upadhyayk04 · ‎04-11-2024

Hello @Hae

Thank you for reaching out

From the below error, it seems that there could be some issue with the volume

Warning FailedAttachVolume 108s attachdetach-controller AttachVolume.Attach failed for volume "pvc-33f9624d-4d90-48fa-8469-02a104df1d10" : rpc error: code = DeadlineExceeded desc = volume pvc-33f9624d-4d90-48fa-8469-02a104df1d10 failed to attach to node cdppvc2.hadoop.com with

Can you please check the status of the above volume from Longhorn UI? Is that volume in good health or bad health?

Hae · ‎04-11-2024

Thank you for your answer @upadhyayk04

I can't access the longhorn UI.

[Storage UI] link in ECS does not working.

I think longhorn does not good status now, What should I checnk?

upadhyayk04 · ‎04-11-2024

Hello @Hae

You might need to check the status of longhorn pods and see why they are failing and need to fix them

# kubectl -n longhorn-system get pods

Then you will need to describe the pods which are failing that would help us know why they are failing

Hae · ‎04-11-2024

@upadhyayk04

Podd list

[root@cdppvc1 ~]# k get pod -n longhorn-system
NAME READY STATUS RESTARTS AGE
csi-attacher-5f79c59664-gsfc4 1/1 Running 0 96m
csi-attacher-5f79c59664-rppmd 1/1 Running 0 96m
csi-attacher-5f79c59664-spmmt 1/1 Running 1 (34m ago) 96m
csi-provisioner-7f9fff657d-mvmb6 1/1 Running 0 96m
csi-provisioner-7f9fff657d-r76kv 1/1 Running 1 (34m ago) 96m
csi-provisioner-7f9fff657d-wm77w 1/1 Running 0 96m
csi-resizer-7667995d7-fgkbd 1/1 Running 0 97m
csi-resizer-7667995d7-rn5ts 1/1 Running 1 (34m ago) 97m
csi-resizer-7667995d7-zx94l 1/1 Running 0 97m
csi-snapshotter-56954ddc99-b44ds 1/1 Running 0 97m
csi-snapshotter-56954ddc99-fmw8x 1/1 Running 1 (34m ago) 97m
csi-snapshotter-56954ddc99-jkwhv 1/1 Running 0 97m
engine-image-ei-6b4330bf-nnwmm 1/1 Running 0 3h52m
engine-image-ei-6b4330bf-npf9k 1/1 Running 1 (30m ago) 3h52m
instance-manager-12ec73857d1e3aea875a32230969da75 1/1 Running 0 34m
instance-manager-ad30a9ee514d3e836de7c5077cfe5ca6 1/1 Running 0 94m
longhorn-csi-plugin-j5xw4 3/3 Running 0 3h52m
longhorn-csi-plugin-v7bdh 3/3 Running 6 (26m ago) 3h52m
longhorn-driver-deployer-75c7cb9999-v8xgb 1/1 Running 0 96m
longhorn-manager-d495r 1/1 Running 1 (30m ago) 3h52m
longhorn-manager-nvgk7 1/1 Running 0 3h52m
longhorn-ui-64c4bfff54-d6c7n 1/1 Running 0 97m
longhorn-ui-64c4bfff54-vrx4q 1/1 Running 0 97m

describe pod csi-plugin

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 28m (x5 over 30m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Started 28m kubelet Started container longhorn-liveness-probe
Normal Created 28m kubelet Created container node-driver-registrar
Normal Started 28m kubelet Started container node-driver-registrar
Normal Pulled 28m kubelet Container image "registry.ecs.internal/cloudera_thirdparty/longhornio/livenessprobe:v2.12.0" already present on machine
Normal Created 28m kubelet Created container longhorn-liveness-probe
Normal Pulled 28m kubelet Container image "registry.ecs.internal/cloudera_thirdparty/longhornio/csi-node-driver-registrar:v2.9.2" already present on machine
Warning BackOff 28m (x2 over 28m) kubelet Back-off restarting failed container longhorn-csi-plugin in pod longhorn-csi-plugin-v7bdh_longhorn-system(4fe460af-df96-4006-a631-dcc21bd46a07)
Normal Pulled 28m (x2 over 28m) kubelet Container image "registry.ecs.internal/cloudera_thirdparty/longhornio/longhorn-manager:v1.5.4" already present on machine
Normal Created 28m (x2 over 28m) kubelet Created container longhorn-csi-plugin
Normal Started 28m (x2 over 28m) kubelet Started container longhorn-csi-plugin
Warning Unhealthy 27m (x3 over 28m) kubelet Liveness probe failed: Get "http://10.42.0.6:9808/healthz": dial tcp 10.42.0.6:9808: connect: connection refused
Normal Killing 27m kubelet Container longhorn-csi-plugin failed liveness probe, will be restarted
Warning BackOff 27m (x2 over 28m) kubelet Back-off restarting failed container node-driver-registrar in pod longhorn-csi-plugin-v7bdh_longhorn-system(4fe460af-df96-4006-a631-dcc21bd46a07)

log of csi-plugin pod

[root@cdppvc1 ~]# k logs -f longhorn-csi-plugin-v7bdh -n longhorn-system
Defaulted container "node-driver-registrar" out of: node-driver-registrar, longhorn-liveness-probe, longhorn-csi-plugin
I0412 06:02:45.498503 12176 main.go:135] Version: v2.9.2
I0412 06:02:45.498547 12176 main.go:136] Running node-driver-registrar in mode=
I0412 06:02:45.498553 12176 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
W0412 06:02:55.498699 12176 connection.go:232] Still connecting to unix:///csi/csi.sock
I0412 06:03:00.414873 12176 main.go:164] Calling CSI driver to discover driver name
I0412 06:03:00.417352 12176 main.go:173] CSI driver name: "driver.longhorn.io"
I0412 06:03:00.417373 12176 node_register.go:55] Starting Registration Server at: /registration/driver.longhorn.io-reg.sock
I0412 06:03:00.417530 12176 node_register.go:64] Registration Server started at: /registration/driver.longhorn.io-reg.sock
I0412 06:03:00.417667 12176 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0412 06:03:01.396603 12176 main.go:90] Received GetInfo call: &InfoRequest{}
I0412 06:03:01.402598 12176 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

upadhyayk04 · ‎04-12-2024

Hello,

Thank you for the update

What about pods in other namespaces are they running fine, especially in cdp namespace and kube-system namespace? Is the DB pod running fine

You might need to check kubelet logs to know more about the problem

Hae · ‎04-12-2024

@upadhyayk04

Look all pods are fine

[root@cdppvc1 ~]# k get ns
NAME STATUS AGE
default Active 4h56m
ecs-webhooks Active 4h55m
kube-node-lease Active 4h56m
kube-public Active 4h56m
kube-system Active 4h56m
local-path-storage Active 4h55m
longhorn-system Active 4h55m
vault-system Active 116s

[root@cdppvc1 ~]# k get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
ecs-webhooks ecs-tolerations-webhook-77d857599d-b8hsh 1/1 Running 0 39m
ecs-webhooks ecs-tolerations-webhook-77d857599d-h6qxk 1/1 Running 0 39m
kube-system etcd-cdppvc1.hadoop.com 1/1 Running 1 4h54m
kube-system helm-install-rke2-ingress-nginx-mk845 0/1 Completed 0 10m
kube-system kube-apiserver-cdppvc1.hadoop.com 1/1 Running 1 4h54m
kube-system kube-controller-manager-cdppvc1.hadoop.com 1/1 Running 3 (89m ago) 4h54m
kube-system kube-proxy-cdppvc1.hadoop.com 1/1 Running 0 86m
kube-system kube-proxy-cdppvc2.hadoop.com 1/1 Running 0 4h53m
kube-system kube-scheduler-cdppvc1.hadoop.com 1/1 Running 1 (90m ago) 4h54m
kube-system rke2-canal-9h5hh 2/2 Running 0 4h53m
kube-system rke2-canal-qk2wg 2/2 Running 2 (90m ago) 4h53m
kube-system rke2-coredns-rke2-coredns-565dfc7d75-djp4t 1/1 Running 0 38m
kube-system rke2-coredns-rke2-coredns-565dfc7d75-gvxcj 1/1 Running 0 153m
kube-system rke2-coredns-rke2-coredns-autoscaler-6c48c95bf9-7ln92 1/1 Running 0 39m
kube-system rke2-ingress-nginx-controller-869fc5f494-xcz6x 1/1 Running 0 39m
kube-system rke2-metrics-server-c9c78bd66-blrwg 1/1 Running 0 156m
kube-system rke2-snapshot-controller-6f7bbb497d-wk5mg 1/1 Running 0 39m
kube-system rke2-snapshot-validation-webhook-65b5675d5c-7fst2 1/1 Running 0 39m
local-path-storage local-path-provisioner-6b8fcdf4f9-fqqnw 1/1 Running 0 155m
longhorn-system csi-attacher-5f79c59664-gsfc4 1/1 Running 0 156m
longhorn-system csi-attacher-5f79c59664-rppmd 1/1 Running 0 156m
longhorn-system csi-attacher-5f79c59664-spmmt 1/1 Running 1 (93m ago) 156m
longhorn-system csi-provisioner-7f9fff657d-mvmb6 1/1 Running 0 156m
longhorn-system csi-provisioner-7f9fff657d-r76kv 1/1 Running 1 (93m ago) 156m
longhorn-system csi-provisioner-7f9fff657d-wm77w 1/1 Running 0 156m
longhorn-system csi-resizer-7667995d7-fgkbd 1/1 Running 0 156m
longhorn-system csi-resizer-7667995d7-rn5ts 1/1 Running 1 (93m ago) 156m
longhorn-system csi-resizer-7667995d7-zx94l 1/1 Running 0 156m
longhorn-system csi-snapshotter-56954ddc99-b44ds 1/1 Running 0 156m
longhorn-system csi-snapshotter-56954ddc99-fmw8x 1/1 Running 1 (93m ago) 156m
longhorn-system csi-snapshotter-56954ddc99-jkwhv 1/1 Running 0 156m
longhorn-system engine-image-ei-6b4330bf-nnwmm 1/1 Running 0 4h52m
longhorn-system engine-image-ei-6b4330bf-npf9k 1/1 Running 1 (90m ago) 4h52m
longhorn-system instance-manager-12ec73857d1e3aea875a32230969da75 1/1 Running 0 38m
longhorn-system instance-manager-ad30a9ee514d3e836de7c5077cfe5ca6 1/1 Running 0 153m
longhorn-system longhorn-csi-plugin-j5xw4 3/3 Running 0 4h51m
longhorn-system longhorn-csi-plugin-v7bdh 3/3 Running 6 (86m ago) 4h51m
longhorn-system longhorn-driver-deployer-75c7cb9999-v8xgb 1/1 Running 0 156m
longhorn-system longhorn-manager-d495r 1/1 Running 1 (90m ago) 4h52m
longhorn-system longhorn-manager-nvgk7 1/1 Running 0 4h52m
longhorn-system longhorn-ui-64c4bfff54-d6c7n 1/1 Running 0 156m
longhorn-system longhorn-ui-64c4bfff54-vrx4q 1/1 Running 0 156m

upadhyayk04 · ‎04-12-2024

Hello,

Based on the error mentioned in the stack trace and the above output it seems that the vault itself is deleted somehow. Do you see vault pods present there?

++ kubectl exec vault-0 -n vault-system -- vault operator init -tls-skip-verify -key-shares=1 -key-threshold=1 -format=json
error: unable to upgrade connection: container not found ("vault")
++ '[' 610 -gt 600 ']'

Hae · ‎04-12-2024

@upadhyayk04

vault-0 pod goes terminating and containercreating status again again again.

because of volume attach faild

Warning FailedAttachVolume 108s attachdetach-controller AttachVolume.Attach failed for volume "pvc-33f9624d-4d90-48fa-8469-02a104df1d10" : rpc error: code = DeadlineExceeded desc = volume pvc-33f9624d-4d90-48fa-8469-02a104df1d10 failed to attach to node cdppvc2.hadoop.com with

Hae · ‎04-13-2024

@upadhyayk04

logs -f -n longhorn-system longhorn-csi-plugin-cxglq
Defaulted container "node-driver-registrar" out of: node-driver-registrar, longhorn-liveness-probe, longhorn-csi-plugin
I0413 09:50:20.091344 290593 main.go:166] Version: v2.5.0
I0413 09:50:20.091369 290593 main.go:167] Running node-driver-registrar in mode=registration
I0413 09:50:20.092527 290593 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0413 09:50:21.093286 290593 main.go:198] Calling CSI driver to discover driver name
I0413 09:50:21.094471 290593 main.go:208] CSI driver name: "driver.longhorn.io"
I0413 09:50:21.094497 290593 node_register.go:53] Starting Registration Server at: /registration/driver.longhorn.io-reg.sock
I0413 09:50:21.094656 290593 node_register.go:62] Registration Server started at: /registration/driver.longhorn.io-reg.sock
I0413 09:50:21.094779 290593 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0413 09:50:21.466617 290593 main.go:102] Received GetInfo call: &InfoRequest{}
I0413 09:50:21.466820 290593 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/driver.longhorn.io/registration"
I0413 09:50:23.205994 290593 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

Cloudera Community

Support Questions

Failed Initialize embedded Vault while installing DataServices.