Context
In some Aerospike Kubernetes Operator versions if a node is replaced the operator may not see the new PV thus will not reinitialize the device. When replacing a kubernetes node that leverages local drives for the Aerospike pods, it may be necessary to zeroize the device as some cloud providers may not have zeroized the header of the drive. The issue could be presented as the pod crashing with a similar error in the pod log:Mar 02 2023 20:12:49 GMT: CRITICAL (drv_ssd): (drv_ssd.c:2215) /aerospike/dev/xvdf_test: not an Aerospike device but not erased - check config or erase device
Method
The below steps are performed in an EKS environment, however, this should be applicable in other cloud environments when using the OpenEBS storage provisioner.When using the OpenEBS storage provisioner one of the requirements for NDM pods are to have privileged mode: https://openebs.io/docs/concepts/ndm#privileged-access
You should be able to exec into the pod, and run the following commands to wipe and zeroize the header since we have privileges:
Locate the NDM pod running on the node with the crashing asd pod:
root@4ef080953322:/# kubectl get pods -naerospike -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES aerocluster-1-0 1/2 CrashLoopBackOff 7 (4m55s ago) 18m 172.30.12.47 ip-172-30-12-149.us-west-1.compute.internal <none> <none> aerocluster-1-1 2/2 Running 0 124m 172.30.4.8 ip-172-30-2-189.us-west-1.compute.internal <none> <none> aerocluster-2-0 2/2 Running 0 173m 172.30.19.124 ip-172-30-30-206.us-west-1.compute.internal <none> <none> root@4ef080953322:/# kubectl get pods -nopenebs -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openebs-localpv-provisioner-795dc8dcff-9knqq 1/1 Running 0 3h4m 172.30.27.53 ip-172-30-30-206.us-west-1.compute.internal <none> <none> openebs-ndm-9p4mw 1/1 Running 0 174m 172.30.30.206 ip-172-30-30-206.us-west-1.compute.internal <none> <none> openebs-ndm-hbvhf 1/1 Running 0 17m 172.30.12.149 ip-172-30-12-149.us-west-1.compute.internal <none> <none> openebs-ndm-operator-55f7cfb488-242fx 1/1 Running 0 3h18m 172.30.20.91 ip-172-30-30-206.us-west-1.compute.internal <none> <none> openebs-ndm-ptrfk 1/1 Running 0 127m 172.30.2.189 ip-172-30-2-189.us-west-1.compute.internal <none> <none>
Exec into the pod and list the devices:
root@4ef080953322:/# kubectl exec -it -nopenebs openebs-ndm-hbvhf -- sh # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS loop0 7:0 0 139.7G 0 loop nvme1n1 259:0 0 139.7G 0 disk nvme0n1 259:1 0 40G 0 disk |-nvme0n1p1 259:2 0 40G 0 part /var/openebs/sparse | /var/openebs/ndm | /etc/hostname | /dev/termination-log | /etc/hosts | /etc/resolv.conf | /host/node-disk-manager.config `-nvme0n1p128 259:3 0 1M 0 part nvme2n1 259:4 0 1G 0 disk
If you're unsure which device is the one in question you can describe the PV and `ls` the path. In the below example the pod has 2 PVCs and only one is used by OpenEBS for the local drive:
root@4ef080953322:/# kubectl get pv -owide
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE VOLUMEMODE
pvc-2cf9d4e9-653e-47bf-8c45-b3594e342baa 50Gi RWO Delete Bound aerospike/test-aerocluster-2-0 openebs-device 3h12m Block
pvc-468003f9-b398-45c6-a97d-0113c59b5e3a 1Gi RWO Delete Bound aerospike/workdir-aerocluster-2-0 gp2 3h33m Filesystem
pvc-49b926ed-074b-49dd-a7de-428e8c0297b6 50Gi RWO Delete Bound aerospike/test-aerocluster-1-0 openebs-device 37m Block
pvc-6ffb66ca-0159-4e2e-97d8-254593081983 50Gi RWO Delete Bound aerospike/test-aerocluster-1-1 openebs-device 145m Block
pvc-a5e9851c-50ba-4e70-9a46-840be1c576a5 1Gi RWO Delete Bound aerospike/workdir-aerocluster-1-0 gp2 38m Filesystem
pvc-e23a79d4-27b2-4b26-8740-e865c77128b3 1Gi RWO Delete Bound aerospike/workdir-aerocluster-1-1 gp2 145m Filesystem
root@4ef080953322:/# kubectl describe pv pvc-49b926ed-074b-49dd-a7de-428e8c0297b6
Name: pvc-49b926ed-074b-49dd-a7de-428e8c0297b6
Labels: openebs.io/cas-type=local-device
Annotations: local.openebs.io/blockdeviceclaim: bdc-pvc-49b926ed-074b-49dd-a7de-428e8c0297b6
pv.kubernetes.io/provisioned-by: openebs.io/local
Finalizers: [kubernetes.io/pv-protection]
StorageClass: openebs-device
Status: Bound
Claim: aerospike/test-aerocluster-1-0
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Block
Capacity: 50Gi
Node Affinity:
Required Terms:
Term 0: kubernetes.io/hostname in [ip-172-30-12-149.us-west-1.compute.internal]
Message:
Source:
Type: LocalVolume (a persistent volume backed by local storage on a node)
Path: /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS229CD92DA5F7BEF6A
Events: <none>
root@4ef080953322:/# kubectl exec -it -nopenebs openebs-ndm-hbvhf -- sh
# ls -la /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS229CD92DA5F7BEF6A
lrwxrwxrwx 1 root root 13 May 19 15:51 /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS229CD92DA5F7BEF6A -> ../../nvme1n1
Then run blkdiscard to wipe the drive and chain the zeroize option on the following command like below:
# blkdiscard /dev/nvme1n1 && blkdiscard -z --length 8MiB /dev/nvme1n1
Once this is completed you can delete the crashing pod and have it recreate/reschedule which should come up normally:
root@4ef080953322:/# kubectl delete pod -naerospike aerocluster-1-0
pod "aerocluster-1-0" deleted
root@4ef080953322:/# kubectl logs -naerospike -f aerocluster-1-0
Defaulted container "aerospike-server" out of: aerospike-server, exporter, aerospike-init (init)
link eth0 state up
link eth0 state up in 0
May 19 2023 15:51:27 GMT: INFO (as): (as.c:310) <><><><><><><><><><> Aerospike Enterprise Edition build 6.1.0.3 <><><><><><><><><><>
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) logging {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) console {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) context any info
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) namespace test {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) memory-size 2000000000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) nsup-period 120
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) rack-id 1
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) replication-factor 2
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) storage-engine device {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) device /aerospike/dev/xvdf_test
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) network {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) fabric {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) port 3001
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) heartbeat {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) mesh-seed-address-port aerocluster-2-0.aerocluster.aerospike 3002
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) mesh-seed-address-port aerocluster-1-1.aerocluster.aerospike 3002
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) mode mesh
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) port 3002
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) service {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) access-address 172.30.12.149
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) access-port 3000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) alternate-access-address 13.57.205.206
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) alternate-access-port 3000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) port 3000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) security {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) service {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) cluster-name aerocluster
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) debug-allocations all
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) feature-key-file /etc/aerospike/secret/features.conf
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) node-id 1a0
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (features_ee.c:258) loaded feature key #224670751 (/etc/aerospike/secret/features.conf)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3988) system file descriptor limit: 1048576, proto-fd-max: 15000
May 19 2023 15:51:27 GMT: INFO (hardware): (hardware.c:1946) detected 4 CPU(s), 2 core(s), 1 NUMA node(s)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:4020) node-id 1a0
May 19 2023 15:51:27 GMT: WARNING (os): (os.c:239) failed min-free-kbytes check - min_free_kbytes should be at least 1150976
May 19 2023 15:51:27 GMT: WARNING (os): (os.c:260) failed swappiness check - swappiness not set to 0
May 19 2023 15:51:27 GMT: WARNING (config): (cfg.c:4418) failed best-practices checks - see 'https://docs.aerospike.com/docs/operations/install/linux/bestpractices/index.html'
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/evict.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/XDR.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/roster.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (storage): (drv_common_ee.c:108) {test} peek found fresh device /aerospike/dev/xvdf_test
May 19 2023 15:51:27 GMT: INFO (namespace): (namespace_ee.c:218) {test} found no stored data, will cold start
May 19 2023 15:51:27 GMT: INFO (namespace): (namespace_ee.c:365) {test} beginning cold start
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/sindex.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/truncate.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:3247) usable device size must be header size 8388608 + multiple of 1048576, rounding down
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:3336) opened device /aerospike/dev/xvdf_test: usable size 149999845376, io-min-size 512
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1072) /aerospike/dev/xvdf_test has 143051 wblocks of size 1048576
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:3101) {test} found all 1 devices fresh, initializing to random 8133044483554163411
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1040) {test} loading free & defrag queues
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:977) /aerospike/dev/xvdf_test init defrag profile: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1059) /aerospike/dev/xvdf_test init wblocks: pristine-id 8 pristine 143043 free-q 0, defrag-q 0
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:2166) {test} starting device maintenance threads
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1512) {test} starting write threads
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:897) {test} starting defrag threads
May 19 2023 15:51:27 GMT: INFO (as): (as.c:387) initializing services...
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/security.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (service): (service.c:168) starting 20 service threads
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:6789) added new mesh seed aerocluster-2-0.aerocluster.aerospike:3002
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:6789) added new mesh seed aerocluster-1-1.aerocluster.aerospike:3002
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:792) updated fabric published address list to {172.30.15.136:3001}
May 19 2023 15:51:27 GMT: INFO (partition): (partition_balance.c:201) {test} 4096 partitions: found 4096 absent, 0 stored
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:5519) updated heartbeat published address list to {172.30.15.136:3002}
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/UDF.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (batch): (batch.c:807) starting 4 batch-index-threads
May 19 2023 15:51:27 GMT: INFO (health): (health.c:318) starting health monitor thread
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:417) starting 8 fabric send threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 16 fabric rw channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric ctrl channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric bulk channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric meta channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:443) starting fabric accept thread
May 19 2023 15:51:27 GMT: INFO (xdr): (xdr_ee.c:148) {test} starting XDR tomb raider thread
May 19 2023 15:51:27 GMT: INFO (fabric): (socket.c:817) Started fabric endpoint 0.0.0.0:3001
May 19 2023 15:51:27 GMT: INFO (fabric): (socket.c:817) Started fabric endpoint [::]:3001
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:6974) initializing mesh heartbeat socket: 0.0.0.0:3002
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:7004) mtu of the network is 9001
May 19 2023 15:51:27 GMT: INFO (hb): (socket.c:817) Started mesh heartbeat endpoint 0.0.0.0:3002
May 19 2023 15:51:27 GMT: INFO (hb): (socket.c:817) Started mesh heartbeat endpoint [::]:3002
May 19 2023 15:51:27 GMT: INFO (nsup): (nsup.c:192) starting namespace supervisor threads
May 19 2023 15:51:27 GMT: INFO (service): (service.c:941) starting reaper thread
May 19 2023 15:51:27 GMT: INFO (service): (socket.c:817) Started client endpoint 0.0.0.0:3000
May 19 2023 15:51:27 GMT: INFO (service): (socket.c:817) Started client endpoint [::]:3000
May 19 2023 15:51:27 GMT: INFO (service): (service.c:200) starting accept thread
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd_ee.c:1452) {test} starting tomb raider thread
May 19 2023 15:51:27 GMT: INFO (as): (as.c:427) service ready: soon there will be cake!
May 19 2023 15:51:28 GMT: INFO (hb): (hb.c:8575) node arrived 2a0
May 19 2023 15:51:28 GMT: INFO (fabric): (fabric.c:2580) fabric: node 2a0 arrived
May 19 2023 15:51:28 GMT: INFO (hb): (hb.c:8575) node arrived 1a1
May 19 2023 15:51:28 GMT: INFO (clustering): (clustering.c:5989) sent cluster join request to 2a0
May 19 2023 15:51:28 GMT: INFO (fabric): (fabric.c:2580) fabric: node 1a1 arrived
May 19 2023 15:51:29 GMT: INFO (clustering): (clustering.c:5794) applied new cluster key 1c9127282376
May 19 2023 15:51:29 GMT: INFO (clustering): (clustering.c:5796) applied new succession list 2a0 1a1 1a0
May 19 2023 15:51:29 GMT: INFO (clustering): (clustering.c:5798) applied cluster size 3
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:2346) data exchange started with cluster key 1c9127282376
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:2729) exchange-compatibility-id: self 11 cluster-min 0 -> 11 cluster-max 0 -> 11
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:3297) received commit command from principal node 2a0
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:3260) data exchange completed with cluster key 1c9127282376
May 19 2023 15:51:29 GMT: INFO (partition): (partition_balance.c:1003) {test} replication factor is 2
May 19 2023 15:51:29 GMT: INFO (partition): (partition_balance.c:976) {test} rebalanced: expected-migrations (0,2028,2730) fresh-partitions 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:163) NODE-ID 1a0 CLUSTER-SIZE 3
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:234) cluster-clock: skew-ms 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:255) system: total-cpu-pct 13 user-cpu-pct 5 kernel-cpu-pct 8 free-mem-kbytes 32134436 free-mem-pct 98 thp-mem-kbytes 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:277) process: cpu-pct 11 threads (9,63,41,41) heap-kbytes (238112,241260,352768) heap-efficiency-pct 98.7
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:287) in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:311) fds: proto (0,0,0) heartbeat (2,2,0) fabric (48,48,0)
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:320) heartbeat-received: self 0 foreign 130
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:346) fabric-bytes-per-second: bulk (9,9) ctrl (22132,27035) meta (14,32) rw (38,38)
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:405) {test} objects: all 0 master 0 prole 0 non-replica 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:462) {test} migrations: remaining (0,18,439) active (0,1,1) complete-pct 99.11
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:495) {test} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:564) {test} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00