Containerized kubelet and local disk volume lifecycle - kubernetes

Platform: OEL 7.7 + kube 1.15.5 + docker 19.03.1
We're building an erasure-coded object store on k8s using a containerized kubelet approach. We're having a tough time coming up with a viable disk life cycle approach. As it is now, we must provide an "extra_binds" argument to the kubelet which specifies the base mount point where our block devices are mounted. (80 SSDs per node, formatted as ext4)
That all works fine. Creating PV's and deploying apps works fine. Our problem comes when a PVC is deleted and we want to scrub the disk(s) that were used and make the disk(s) available again.
So far the only thing that works is to cordon that node, remove the extra binds from kubelet, bounce the node, reconfigure the block device, re-add the kubelet binds. Obviously this is too clunky for production. For starters, bouncing kubelet is not an option.
Once a PV gets used, something is locking this block device, even though checking lsof on the bare metal system shows non open handles. I can't unmount or create a new filesystem on the device. Merely bouncing kubelet doesn't free up the "lock".
Anyone using a containerized kubernetes control plane with an app using local disks in a similar fashion? Anyone found a viable way to work around this issue?
Our long term plan is to write an operator that manages disks but even with an operator I don't see how it can mitigate this problem.
Thanks for any help,

First look at your Finalizers:
$ kubectl describe pvc <PVC_NAME> | grep Finalizers
$ kubectl describe pv <PV_NAME> | grep Finalizers
if they are set to Finalizers: [kubernetes.io/pvc-protection] (explained here) that mean they are protected and you need to edit that, for example using:
$ kubectl patch pvc <PVC_NAME> -p '{"metadata":{"finalizers":null}}'
As for forcefully removing PersistentVolumes you can try
$ kubectl delete pv <PV_NAME> --force --grace-period=0
Also please check VolumeAttachment do still exist $ kubectl get volumeattachment as they might be blocked.
I also remember there was as issue on stack Kubernetes PV refuses to bind after delete/re-create stating that pv holds uid of pvc that was claimed by.
You can check that by displaying whole yaml of the pv:
$ kubectl get pv <PV_NAME> -o yaml and looking for:
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: packages-pvc
namespace: default
resourceVersion: "10218121"
uid: 1aede3e6-eaa1-11e9-a594-42010a9c0005
You would need to provide more information regarding your k8s cluster and pv, pvc configuration so I could go deeper into to or even test it.

Related

Kubernetes - All PVCs Bound, yet "pod has unbound immediate PersistentVolumeClaims"

Unfortunately I am unable to paste configs or kubectl output, but please bear with me.
Using helm to deploy a series of containers to K8s 1.14.6, all containers are deploying successfully except for those that have initContainer sections defined within them.
In these failing deployments, their templates define container and initContainer stanzas that reference the same persistent-volume (and associated persistent-volume-claim, both defined elsewhere).
The purpose of the initContainer is to copy persisted files from a mounted drive location into the appropriate place before the main container is established.
Other containers (without initContainer stanzas) mount properly and run as expected.
These pods which have initContainer stanzas, however, report "failed to initialize" or "CrashLoopBackOff" as they continually try to start up. The kubectl describe pod of these pods gives only a Warning in the events section that "pod has unbound immediate PersistentVolumeClaims." The initContainer section of the pod description says it has failed because "Error" with no further elaboration.
When looking at the associated pv and pvc entries from kubectl, however, none are left pending, and all report "Bound" with no Events to speak of in the description.
I have been able to find plenty of articles suggesting fixes when your pvc list shows Pending claims, yet none so far that address this particular set of circumstance when all pvcs are bound.
When a PVC is "Bound", this means that you do have a PersistentVolume object in your cluster, whose claimRef refers to that PVC (and usually: that your storage provisioner is done creating the corresponding volume in your storage backend).
When a volume is "not bound", in one of your Pod, this means the node where your Pod was scheduled is unable to attach your persistent volume. If you're sure there's no mistake in your Pods volumes, you should then check logs for your csi volumes attacher pod, when using CSI, or directly in nodes logs when using some in-tree driver.
While the crashLoopBackOff thing is something else. You should check for logs of your initContainer: kubectl logs -c <init-container-name> -p. From your explanation, I would suppose there's some permission issues when copying files over.

Exporting PersistentVolumes and PersistentVolumesClaims from Kubernetes API

On GKE I created a statefulset containing a volumeClaimTemplates. Then all the related PersistentVolumesClaims, PersistentVolumes and Google Persistent Disks are automatically created:
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
76m
qserv-data-qserv-worker-0 Bound pvc-c5e060dc-88cb-4630-8229-c4b1fcb4f64b 3Gi RWO qserv 76m
qserv-data-qserv-worker-1 Bound pvc-5dfffc24-165c-4e2c-a1fa-fa11dd45616f 3Gi RWO qserv 76m
qserv-data-qserv-worker-2 Bound pvc-14aa9a63-fae0-4328-aaaa-17db2dee4b79 3Gi RWO qserv 76m
qserv-data-qserv-worker-3 Bound pvc-8b701396-42ab-4d15-8b68-9b03ce5a2d07 3Gi RWO qserv 76m
qserv-data-qserv-worker-4 Bound pvc-7c49e7a0-fd73-467d-b677-820d899f41ee 3Gi RWO qserv 76m
kubectl get pv
pvc-14aa9a63-fae0-4328-aaaa-17db2dee4b79 3Gi RWO Retain Bound default/qserv-data-qserv-worker-2 qserv 77m
pvc-5dfffc24-165c-4e2c-a1fa-fa11dd45616f 3Gi RWO Retain Bound default/qserv-data-qserv-worker-1 qserv 77m
pvc-7c49e7a0-fd73-467d-b677-820d899f41ee 3Gi RWO Retain Bound default/qserv-data-qserv-worker-4 qserv 77m
pvc-8b701396-42ab-4d15-8b68-9b03ce5a2d07 3Gi RWO Retain Bound default/qserv-data-qserv-worker-3 qserv 77m
pvc-c5e060dc-88cb-4630-8229-c4b1fcb4f64b 3Gi RWO Retain Bound default/qserv-data-qserv-worker-0 qserv 77m
gcloud compute disks list
NAME LOCATION LOCATION_SCOPE SIZE_GB TYPE STATUS
...
pvc-14aa9a63-fae0-4328-aaaa-17db2dee4b79 us-central1-c zone 3 pd-balanced READY
pvc-5dfffc24-165c-4e2c-a1fa-fa11dd45616f us-central1-c zone 3 pd-balanced READY
pvc-7c49e7a0-fd73-467d-b677-820d899f41ee us-central1-c zone 3 pd-balanced READY
pvc-8b701396-42ab-4d15-8b68-9b03ce5a2d07 us-central1-c zone 3 pd-balanced READY
pvc-c5e060dc-88cb-4630-8229-c4b1fcb4f64b us-central1-c zone 3 pd-balanced READY
Is there a simple way to extract PVC/PV yaml file so that I can re-create all PVs/PVCs using the same Google Disks. (This might be useful to move the data to a new GKE cluster in case I delete the current one, or to restore the data if somebody remove accidentally the PVCs/PVs)
kubectl get pv,pvc -o yaml > export.yaml
Above command does not work because there is too much technical fields set at runtime which prevent kubectl apply -f export.yaml to work. Would you know a way to remove these fields from export.yaml
As asked in the question:
Is there a simple way to extract PVC/PV yaml file so that I can re-create all PVs/PVCs using the same Google Disks.
Some scripting would be needed to extract a manifest that could be used without any hassles.
I found a StackOverflow thread about similar question (how to export manifests):
Stackoverflow.com: Questions: Kubectl export is deprecated any alternative
A side note!
I also stumbled upon kubectl neat (a plugin for kubectl) which will be referenced later in that answer.
As correctly pointed by the author of the post, kubectl neat will show the message at the time of installation:
WARNING: You installed plugin "neat" from the krew-index plugin repository.
These plugins are not audited for security by the Krew maintainers.
Run them at your own risk.
I would consider going with some backup solution as more a viable option due to the fact of data persistence and in general data protection in case of any failure.
From backup solution side you can look here:
Portwortx.com: How to migrate stateful application from one gcp region to another with portwortx kubemotion
Velero.io
PV's in GKE are in fact Google Persistent Disks. You can create a snapshot/image of a disk as a backup measure. You can also use this feature to test how your migration behaves:
Cloud.google.com: Compute: Docs: Disks: Create snapshots
Cloud.google.com: Compute: Docs: Images: Create delete deprecate private images
Please consider below example as a workaround.
I've managed to migrate an example Statefulset from one cluster to another with data stored on gce-pd.
Once more, I encourage you to check this docs about using preexisting disks in GKE to create a Statefulset:
Cloud.google.com: Kubernetes Engine: Docs: How to: Persistent Volumes: Preexisting PD
Assuming that you used manifest from official Kubernetes site:
Kubernetes.io: Docs: Concepts: Workloads: Controllers: Statefulset
You can migrate it to another cluster by:
Setting the ReclaimPolicy to Retain on each PV used. <-- IMPORTANT
Using kubectl neat to extract needed manifests
Editing previously extracted manifests
Deleting the existing workload on old cluster
Creating a workload on new cluster
Setting the ReclaimPolicy to Retain on each PV used
You will need to check if your PV ReclaimPolicy is set to Retain. This would stop gce-pd deletion after PVC and PV are deleted from the cluster. You can do it by following Kubernetes documentation:
Kubernetes.io: Docs: Tasks: Administer cluster: Change PV reclaim policy
More reference:
Cloud.google.com: Kubernetes Engine: Docs: Concepts: Persistent Volumes: Dynamic provisioning
Using kubectl neat to extract needed manifests
There are many ways to extract the manifest from Kubernetes API. I stumbled upon kubectl neat here. kubectl-neat will remove some of the fields in the manifests.
I used it in a following manner:
$ kubectl get statefulset -o yaml | kubectl neat > final-sts.yaml
$ kubectl get pvc -o yaml | kubectl neat > final-pvc.yaml
$ kubectl get pv -o yaml | kubectl neat > final-pv.yaml
Disclaimer!
This workaround will use the names of the dynamically created disks in GCP. If you were to create new disks (from snapshot for example) you would need to modify whole setup (use preexsiting disks guide referenced earlier).
Above commands would store manifests of a StatefulSet used in the Kubernetes examples.
Editing previously extracted manifests
You will need to edit this manifests to be used in newly created cluster. This part could be automated:
final-pv.yaml - delete the .claimRef in .spec
Deleting the existing workload on old cluster
You will need to release used disks so that the new cluster could use them. You will need to delete this Statefulset and accompanying PVC's and PV's. Please make sure that the PV's reclaimPolicy is set to Retain.
Creating a workload on new cluster
You will need to use previously created manifest and apply them on a new cluster:
$ kubectl apply -f final-pv.yaml
$ kubectl apply -f final-pvc.yaml
$ kubectl apply -f final-sts.yaml
As for exporting manifests you could also look (if feasible) on Kubernetes client libraries:
Kubernetes.io: Docs: Reference: Using API: Client libraries
Here is an example script which replace kubectl neat and manual edit of the manifest files (removal of .spec.claimRef field) in order to export the mapping between PVCs,PVs and Google Persistent Disks.
https://github.com/k8s-school/skateful/blob/stackoverflow/main.go
How to use it:
git clone https://github.com/k8s-school/skateful.git
git checkout stackoverflow
# Requirements: go >= 1.14.7 and a kubeconfig file
make
./skateful
it will create a pvc-pv.yaml file that can be applied to any new GKE kubernetes cluster in order to attach the existing Google Persistent Disks to new PVCs/PVs.

Kubernetes Persistent Volume Claim FileSystemResizePending

i have a persistent volume claim for a kubernetes pod which shows the message "Waiting for user to (re-)start a pod to finish file system resize of volume on node." if i check it with 'kubectl describe pvc ...'
The rezising itself worked which was done with terraform in our deployments but this message still shows up here and i'm not really sure how to get this fixed? The pod was already restarted several times - i tried kubectl delete pod and scale it down with kubectl scale deployment.
Does anyone have an idea how to get rid of this message?screenshot
There are few things to consider:
Instead of using the Terraform, try resizing the PVC by editing it manually. After that wait for the underlying volume to be expanded by the storage provider and verify if the FileSystemResizePending condition is present by executing kubectl get pvc <pvc_name> -o yaml. Than, make sure that all the associated pods are restarted so the whole process can be completed. Once file system resizing is done, the PVC will automatically be updated to reflect new size.
Make sure that your volume type is supported for expansion. You can expand the following types of volumes:
gcePersistentDisk
awsElasticBlockStore
Cinder
glusterfs
rbd
Azure File
Azure Disk
Portworx
FlexVolumes
CSI
Check if in your StorageClass the allowVolumeExpansion field is set to true.

Recreate Pod managed by a StatefulSet with a fresh PersistentVolume

On an occasional basis I need to perform a rolling replace of all Pods in my StatefulSet such that all PVs are also recreated from scratch. The reason to do so is to get rid of all underlying hard drives that use old versions of encryption key. This operation should not be confused with regular rolling upgrades, for which I still want volumes to survive Pod terminations. The best routine I figured so far to do that is following:
Delete the PV.
Delete the PVC.
Delete the Pod.
Wait until all deletions complete.
Manually recreate the PVC deleted in step 2.
Wait for the new Pod to finish streaming data from other Pods in the StatefulSet.
Repeat from step 1. for the next Pod.
I'm not happy about step 5. I wish StatefulSet recreated the PVC for me, but unfortunately it does not. I have to do it myself, otherwise Pod creation fails with following error:
Warning FailedScheduling 3s (x15 over 15m) default-scheduler persistentvolumeclaim "foo-bar-0" not found
Is there a better way to do that?
I just recently had to do this. The following worked for me:
# Delete the PVC
$ kubectl delete pvc <pvc_name>
# Delete the underlying statefulset WITHOUT deleting the pods
$ kubectl delete statefulset <statefulset_name> --cascade=false
# Delete the pod with the PVC you don't want
$ kubectl delete pod <pod_name>
# Apply the statefulset manifest to re-create the StatefulSet,
# which will also recreate the deleted pod with a new PVC
$ kubectl apply -f <statefulset_yaml>
This is described in https://github.com/kubernetes/kubernetes/issues/89910. The workaround proposed there, of deleting the new Pod which is stuck pending, works and the second time it gets replaced a new PVC is created. It was marked as a duplicate of https://github.com/kubernetes/kubernetes/issues/74374, and reported as potentially fixed in 1.20.
It seems like you're using "Persistent" volume in a wrong way. It's designed to keep the data between roll-outs, not to delete it. There are other different ways to renew the keys. One can use k8s Secret and ConfigMap to mount the key into the Pod. Then you just need to recreate a Secret during a rolling update

What is the recommended way to move lone pods to different node before draining? Such that kubectl drain node1 --force does not delete the pod

Cannot find how to do so in the docs. After draining the node with --ignore-daemonsets --force pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet are lost. How should I move such pods prior to issuing the drain command? I want to preserve the local data on these pods.
A good practice is to always start a Pod as a Deployment with specs.replicas: 1. It's very easy as the Deployment specs.template literally takes in your Pod specs, and quite convenient as the deployment will make sure your Pod is always running.
Then, assuming you'll only have 1 replica of your Pod, you can simply use a PersistentVolumeClaim and attach it to the pod as a volume, you do not need a StatefulSet in that case. Your data will be stored in the PVC, and whenever your Pod is moved over nodes for whatever reason it will reattach the volume automatically without loosing any data.
Now, if it's too late for you, and your Pod hasn't got a volume pointing to a PVC, you can still get ready to change that by implementing the Deployment/PVC approach, and manually copy data out of your current pod:
kubectl cp theNamespace/thePod:/the/path /somewhere/on/your/local/computer
Before copying it back to the new pod:
kubectl cp /somewhere/on/your/local/computer theNamespace/theNewPod:/the/path
This time, just make sure /the/path (to reuse the example above) is actually a Volume mapped to a PVC so you won't have to do that manually again!