Articles in this section

How to zeroize a local drive after replacing kubernetes node when using OpenEBS

Context

In some Aerospike Kubernetes Operator versions if a node is replaced the operator may not see the new PV thus will not reinitialize the device. When replacing a kubernetes node that leverages local drives for the Aerospike pods, it may be necessary to zeroize the device as some cloud providers may not have zeroized the header of the drive. The issue could be presented as the pod crashing with a similar error in the pod log:
 
Mar 02 2023 20:12:49 GMT: CRITICAL (drv_ssd): (drv_ssd.c:2215) /aerospike/dev/xvdf_test: not an Aerospike device but not erased - check config or erase device

Method

The below steps are performed in an EKS environment, however, this should be applicable in other cloud environments when using the OpenEBS storage provisioner.
When using the OpenEBS storage provisioner one of the requirements for NDM pods are to have privileged mode: https://openebs.io/docs/concepts/ndm#privileged-access
You should be able to exec into the pod, and run the following commands to wipe and zeroize the header since we have privileges:
Locate the NDM pod running on the node with the crashing asd pod:
root@4ef080953322:/# kubectl get pods -naerospike -owide
NAME              READY   STATUS             RESTARTS        AGE    IP              NODE                                          NOMINATED NODE   READINESS GATES
aerocluster-1-0   1/2     CrashLoopBackOff   7 (4m55s ago)   18m    172.30.12.47    ip-172-30-12-149.us-west-1.compute.internal   <none>           <none>
aerocluster-1-1   2/2     Running            0               124m   172.30.4.8      ip-172-30-2-189.us-west-1.compute.internal    <none>           <none>
aerocluster-2-0   2/2     Running            0               173m   172.30.19.124   ip-172-30-30-206.us-west-1.compute.internal   <none>           <none>

root@4ef080953322:/# kubectl get pods -nopenebs -owide
NAME                                           READY   STATUS    RESTARTS   AGE     IP              NODE                                          NOMINATED NODE   READINESS GATES
openebs-localpv-provisioner-795dc8dcff-9knqq   1/1     Running   0          3h4m    172.30.27.53    ip-172-30-30-206.us-west-1.compute.internal   <none>           <none>
openebs-ndm-9p4mw                              1/1     Running   0          174m    172.30.30.206   ip-172-30-30-206.us-west-1.compute.internal   <none>           <none>
openebs-ndm-hbvhf                              1/1     Running   0          17m     172.30.12.149   ip-172-30-12-149.us-west-1.compute.internal   <none>           <none>
openebs-ndm-operator-55f7cfb488-242fx          1/1     Running   0          3h18m   172.30.20.91    ip-172-30-30-206.us-west-1.compute.internal   <none>           <none>
openebs-ndm-ptrfk                              1/1     Running   0          127m    172.30.2.189    ip-172-30-2-189.us-west-1.compute.internal    <none>           <none>

Exec into the pod and list the devices:

root@4ef080953322:/# kubectl exec -it -nopenebs openebs-ndm-hbvhf -- sh
# lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0           7:0    0 139.7G  0 loop 
nvme1n1       259:0    0 139.7G  0 disk 
nvme0n1       259:1    0    40G  0 disk 
|-nvme0n1p1   259:2    0    40G  0 part /var/openebs/sparse
|                                       /var/openebs/ndm
|                                       /etc/hostname
|                                       /dev/termination-log
|                                       /etc/hosts
|                                       /etc/resolv.conf
|                                       /host/node-disk-manager.config
`-nvme0n1p128 259:3    0     1M  0 part 
nvme2n1       259:4    0     1G  0 disk 

If you're unsure which device is the one in question you can describe the PV and `ls` the path. In the below example the pod has 2 PVCs and only one is used by OpenEBS for the local drive:

root@4ef080953322:/# kubectl get pv -owide
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS     REASON   AGE     VOLUMEMODE
pvc-2cf9d4e9-653e-47bf-8c45-b3594e342baa   50Gi       RWO            Delete           Bound    aerospike/test-aerocluster-2-0      openebs-device            3h12m   Block
pvc-468003f9-b398-45c6-a97d-0113c59b5e3a   1Gi        RWO            Delete           Bound    aerospike/workdir-aerocluster-2-0   gp2                       3h33m   Filesystem
pvc-49b926ed-074b-49dd-a7de-428e8c0297b6   50Gi       RWO            Delete           Bound    aerospike/test-aerocluster-1-0      openebs-device            37m     Block
pvc-6ffb66ca-0159-4e2e-97d8-254593081983   50Gi       RWO            Delete           Bound    aerospike/test-aerocluster-1-1      openebs-device            145m    Block
pvc-a5e9851c-50ba-4e70-9a46-840be1c576a5   1Gi        RWO            Delete           Bound    aerospike/workdir-aerocluster-1-0   gp2                       38m     Filesystem
pvc-e23a79d4-27b2-4b26-8740-e865c77128b3   1Gi        RWO            Delete           Bound    aerospike/workdir-aerocluster-1-1   gp2                       145m    Filesystem

root@4ef080953322:/# kubectl describe pv pvc-49b926ed-074b-49dd-a7de-428e8c0297b6
Name:              pvc-49b926ed-074b-49dd-a7de-428e8c0297b6
Labels:            openebs.io/cas-type=local-device
Annotations:       local.openebs.io/blockdeviceclaim: bdc-pvc-49b926ed-074b-49dd-a7de-428e8c0297b6
                   pv.kubernetes.io/provisioned-by: openebs.io/local
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      openebs-device
Status:            Bound
Claim:             aerospike/test-aerocluster-1-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Block
Capacity:          50Gi
Node Affinity:     
  Required Terms:  
    Term 0:        kubernetes.io/hostname in [ip-172-30-12-149.us-west-1.compute.internal]
Message:           
Source:
    Type:  LocalVolume (a persistent volume backed by local storage on a node)
    Path:  /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS229CD92DA5F7BEF6A
Events:    <none>

root@4ef080953322:/# kubectl exec -it -nopenebs openebs-ndm-hbvhf -- sh
# ls -la /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS229CD92DA5F7BEF6A
lrwxrwxrwx 1 root root 13 May 19 15:51 /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS229CD92DA5F7BEF6A -> ../../nvme1n1

Then run blkdiscard to wipe the drive and chain the zeroize option on the following command like below:

# blkdiscard /dev/nvme1n1 && blkdiscard -z --length 8MiB /dev/nvme1n1

Once this is completed you can delete the crashing pod and have it recreate/reschedule which should come up normally:

root@4ef080953322:/# kubectl delete pod -naerospike aerocluster-1-0
pod "aerocluster-1-0" deleted
root@4ef080953322:/# kubectl logs -naerospike -f aerocluster-1-0 
Defaulted container "aerospike-server" out of: aerospike-server, exporter, aerospike-init (init)
link eth0 state up
link eth0 state up in 0
May 19 2023 15:51:27 GMT: INFO (as): (as.c:310) <><><><><><><><><><>  Aerospike Enterprise Edition build 6.1.0.3  <><><><><><><><><><>
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) logging {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     console {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         context any    info
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) namespace test {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     memory-size    2000000000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     nsup-period    120
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     rack-id    1
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     replication-factor    2
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     storage-engine device {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         device    /aerospike/dev/xvdf_test
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) network {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     fabric {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         port    3001
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     heartbeat {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         mesh-seed-address-port aerocluster-2-0.aerocluster.aerospike 3002
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         mesh-seed-address-port aerocluster-1-1.aerocluster.aerospike 3002
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         mode    mesh
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         port    3002
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     service {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         access-address    172.30.12.149
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         access-port    3000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         alternate-access-address    13.57.205.206
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         alternate-access-port    3000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)         port    3000
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) security {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) 
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) service {
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     cluster-name    aerocluster
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     debug-allocations    all
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     feature-key-file    /etc/aerospike/secret/features.conf
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960)     node-id    1a0
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3960) }
May 19 2023 15:51:27 GMT: INFO (config): (features_ee.c:258) loaded feature key #224670751 (/etc/aerospike/secret/features.conf)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:3988) system file descriptor limit: 1048576, proto-fd-max: 15000
May 19 2023 15:51:27 GMT: INFO (hardware): (hardware.c:1946) detected 4 CPU(s), 2 core(s), 1 NUMA node(s)
May 19 2023 15:51:27 GMT: INFO (config): (cfg.c:4020) node-id 1a0
May 19 2023 15:51:27 GMT: WARNING (os): (os.c:239) failed min-free-kbytes check - min_free_kbytes should be at least 1150976
May 19 2023 15:51:27 GMT: WARNING (os): (os.c:260) failed swappiness check - swappiness not set to 0
May 19 2023 15:51:27 GMT: WARNING (config): (cfg.c:4418) failed best-practices checks - see 'https://docs.aerospike.com/docs/operations/install/linux/bestpractices/index.html'
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/evict.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/XDR.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/roster.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (storage): (drv_common_ee.c:108) {test} peek found fresh device /aerospike/dev/xvdf_test
May 19 2023 15:51:27 GMT: INFO (namespace): (namespace_ee.c:218) {test} found no stored data, will cold start
May 19 2023 15:51:27 GMT: INFO (namespace): (namespace_ee.c:365) {test} beginning cold start
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/sindex.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/truncate.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:3247) usable device size must be header size 8388608 + multiple of 1048576, rounding down
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:3336) opened device /aerospike/dev/xvdf_test: usable size 149999845376, io-min-size 512
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1072) /aerospike/dev/xvdf_test has 143051 wblocks of size 1048576
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:3101) {test} found all 1 devices fresh, initializing to random 8133044483554163411
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1040) {test} loading free & defrag queues
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:977) /aerospike/dev/xvdf_test init defrag profile: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1059) /aerospike/dev/xvdf_test init wblocks: pristine-id 8 pristine 143043 free-q 0, defrag-q 0
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:2166) {test} starting device maintenance threads
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:1512) {test} starting write threads
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd.c:897) {test} starting defrag threads
May 19 2023 15:51:27 GMT: INFO (as): (as.c:387) initializing services...
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/security.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (service): (service.c:168) starting 20 service threads
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:6789) added new mesh seed aerocluster-2-0.aerocluster.aerospike:3002
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:6789) added new mesh seed aerocluster-1-1.aerocluster.aerospike:3002
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:792) updated fabric published address list to {172.30.15.136:3001}
May 19 2023 15:51:27 GMT: INFO (partition): (partition_balance.c:201) {test} 4096 partitions: found 4096 absent, 0 stored
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:5519) updated heartbeat published address list to {172.30.15.136:3002}
May 19 2023 15:51:27 GMT: INFO (smd): (smd.c:2327) no file '/opt/aerospike/smd/UDF.smd' - starting empty
May 19 2023 15:51:27 GMT: INFO (batch): (batch.c:807) starting 4 batch-index-threads
May 19 2023 15:51:27 GMT: INFO (health): (health.c:318) starting health monitor thread
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:417) starting 8 fabric send threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 16 fabric rw channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric ctrl channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric bulk channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric meta channel recv threads
May 19 2023 15:51:27 GMT: INFO (fabric): (fabric.c:443) starting fabric accept thread
May 19 2023 15:51:27 GMT: INFO (xdr): (xdr_ee.c:148) {test} starting XDR tomb raider thread
May 19 2023 15:51:27 GMT: INFO (fabric): (socket.c:817) Started fabric endpoint 0.0.0.0:3001
May 19 2023 15:51:27 GMT: INFO (fabric): (socket.c:817) Started fabric endpoint [::]:3001
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:6974) initializing mesh heartbeat socket: 0.0.0.0:3002
May 19 2023 15:51:27 GMT: INFO (hb): (hb.c:7004) mtu of the network is 9001
May 19 2023 15:51:27 GMT: INFO (hb): (socket.c:817) Started mesh heartbeat endpoint 0.0.0.0:3002
May 19 2023 15:51:27 GMT: INFO (hb): (socket.c:817) Started mesh heartbeat endpoint [::]:3002
May 19 2023 15:51:27 GMT: INFO (nsup): (nsup.c:192) starting namespace supervisor threads
May 19 2023 15:51:27 GMT: INFO (service): (service.c:941) starting reaper thread
May 19 2023 15:51:27 GMT: INFO (service): (socket.c:817) Started client endpoint 0.0.0.0:3000
May 19 2023 15:51:27 GMT: INFO (service): (socket.c:817) Started client endpoint [::]:3000
May 19 2023 15:51:27 GMT: INFO (service): (service.c:200) starting accept thread
May 19 2023 15:51:27 GMT: INFO (drv_ssd): (drv_ssd_ee.c:1452) {test} starting tomb raider thread
May 19 2023 15:51:27 GMT: INFO (as): (as.c:427) service ready: soon there will be cake!
May 19 2023 15:51:28 GMT: INFO (hb): (hb.c:8575) node arrived 2a0
May 19 2023 15:51:28 GMT: INFO (fabric): (fabric.c:2580) fabric: node 2a0 arrived
May 19 2023 15:51:28 GMT: INFO (hb): (hb.c:8575) node arrived 1a1
May 19 2023 15:51:28 GMT: INFO (clustering): (clustering.c:5989) sent cluster join request to 2a0
May 19 2023 15:51:28 GMT: INFO (fabric): (fabric.c:2580) fabric: node 1a1 arrived
May 19 2023 15:51:29 GMT: INFO (clustering): (clustering.c:5794) applied new cluster key 1c9127282376
May 19 2023 15:51:29 GMT: INFO (clustering): (clustering.c:5796) applied new succession list 2a0 1a1 1a0
May 19 2023 15:51:29 GMT: INFO (clustering): (clustering.c:5798) applied cluster size 3
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:2346) data exchange started with cluster key 1c9127282376
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:2729) exchange-compatibility-id: self 11 cluster-min 0 -> 11 cluster-max 0 -> 11
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:3297) received commit command from principal node 2a0
May 19 2023 15:51:29 GMT: INFO (exchange): (exchange.c:3260) data exchange completed with cluster key 1c9127282376
May 19 2023 15:51:29 GMT: INFO (partition): (partition_balance.c:1003) {test} replication factor is 2
May 19 2023 15:51:29 GMT: INFO (partition): (partition_balance.c:976) {test} rebalanced: expected-migrations (0,2028,2730) fresh-partitions 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:163) NODE-ID 1a0 CLUSTER-SIZE 3
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:234)    cluster-clock: skew-ms 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:255)    system: total-cpu-pct 13 user-cpu-pct 5 kernel-cpu-pct 8 free-mem-kbytes 32134436 free-mem-pct 98 thp-mem-kbytes 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:277)    process: cpu-pct 11 threads (9,63,41,41) heap-kbytes (238112,241260,352768) heap-efficiency-pct 98.7
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:287)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:311)    fds: proto (0,0,0) heartbeat (2,2,0) fabric (48,48,0)
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:320)    heartbeat-received: self 0 foreign 130
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:346)    fabric-bytes-per-second: bulk (9,9) ctrl (22132,27035) meta (14,32) rw (38,38)
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:405) {test} objects: all 0 master 0 prole 0 non-replica 0
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:462) {test} migrations: remaining (0,18,439) active (0,1,1) complete-pct 99.11
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:495) {test} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
May 19 2023 15:51:37 GMT: INFO (info): (ticker.c:564) {test} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00

 


Notes

This method we do not need to modify the BD nor BDCs at all. You may just need to delete the PVC so the pod can be rescheduled to the new node which you had done previously.
 

Applies To Earliest Version

5.2

Applies To Latest Version

Current Version
Was this article helpful?
0 out of 0 found this helpful